Add support for disabling the sorting and list creation for WordNet object relation methods #3240

bryant1410 · 2024-03-16T22:48:57Z

Add support for disabling the sorting and list creation for WordNet object relation methods. It keeps backward compatibility.

See #3193.

…bject relation methods Add support for disabling the sorting and list creation for WordNet object relation methods. It keeps backward compatibility. See nltk#3193.

ekaf · 2024-03-17T11:14:10Z

@bryant1410, how big a speed improvement do you typically observe on your use case, when setting sort and force_list to False?

bryant1410 · 2024-03-17T19:10:00Z

Hey, I did a quick test based on one of the functions I use. The setup is:

pip install git+https://github.com/bryant1410/nltk@patch-5

def _get_sibling_or_cousin_co_hyponym_lemmas(synset):
    for parent in synset.hypernyms():
        for grandparent in parent.hypernyms():
            for uncle_or_parent in grandparent.hyponyms():
                for cousin_or_sibling in uncle_or_parent.hyponyms():
                    if cousin_or_sibling != synset:
                        yield from cousin_or_sibling.lemmas()


def _get_sibling_or_cousin_co_hyponym_lemmas_no_sort(synset):
    for parent in synset.hypernyms(sort=False, force_list=False):
        for grandparent in parent.hypernyms(sort=False, force_list=False):
            for uncle_or_parent in grandparent.hyponyms(sort=False, force_list=False):
                for cousin_or_sibling in uncle_or_parent.hyponyms(sort=False, force_list=False):
                    if cousin_or_sibling != synset:
                        yield from cousin_or_sibling.lemmas()

import timeit
from nltk.corpus import wordnet as wn

synset = wn.synset("man.n.01")

# We warm it up before benchmarking:
list(_get_sibling_or_cousin_co_hyponym_lemmas(synset))

With default values:

>>> timeit.timeit(lambda: list(_get_sibling_or_cousin_co_hyponym_lemmas(synset)), number=100)
0.30820278730243444

With disabling sorting and list creations:

>>> timeit.timeit(lambda: list(_get_sibling_or_cousin_co_hyponym_lemmas_no_sort(synset)), number=100)
0.20816301740705967

About 50% faster for this example.

More examples (in milliseconds):

Synset	Before	After
man.n.01	308	208
woman.n.01	308	207
banana.n.01	62	37
apple.n.01	36	24
orange.n.01	13	9
dog.n.01	15	11
cat.n.01	4	3
calamagrostis.n.01	274	148
adult_male.n.01	309	213
adult.n.01	118	63
girlfriend.n.01	183	123
somebody	159	109

In my real use case example, I call this function multiple times per data loader worker (which, in turn, I have one per CPU core). It's part of a code that bottlenecks (starves) the GPU usage, as I have to compute thousands of these per second per data loader worker.

ekaf · 2024-03-18T11:13:08Z

There are two problems with the above test script.
First, it only shows the speedups attributed to sort=False. Setting force_list to False has no effect on the speed since the output is converted to a list anyway at last.
Second, the test only checks that any cousin_or_sibling found is different from the input synset, while it should also check that they have not been seen yet. For this reason, the list of cousin_or_sibling lemmas of man.n.01 has length 4430, while the corresponding set only has 2085 elements.

bryant1410 · 2024-03-18T15:53:32Z

First, it only shows the speedups attributed to sort=False. Setting force_list to False has no effect on the speed since the output is converted to a list anyway at last.

Note that the list conversion occurs to the final output, not the intermediate ones. The list creations that are avoided are each of the outputs to the hypernyms or hyponyms calls.

Still, we can do a test that avoids the final list. E.g., we can follow this SO post and just consume the iterable the fastest way possible. I did this and observed differences of 1-7ms.

Having said this, I later observed that using force_list=True actually makes it a bit faster (single-digit milliseconds). I looked online and found people discussing that it depends on the use case (e.g., using these functions for _get_sibling_or_cousin_co_hyponym_lemmas vs for a different use case). Still, there are memory benefits, so it's a tradeoff.

Second, the test only checks that any cousin_or_sibling found is different from the input synset, while it should also check that they have not been seen yet. For this reason, the list of cousin_or_sibling lemmas of man.n.01 has length 4430, while the corresponding set only has 2085 elements.

Thanks for catching this! For my use case, it's fine, and I believe it's preferred.

ekaf · 2024-03-19T10:01:23Z

I also observe slightly better timings with force_list=True. The difference is small but consistent, so this additional parameter does not contribute to solving the speed bottleneck from #3193. Still, it may be useful for uses where one wants to consume the output one piece at a time instead of waiting for the whole list.

ekaf · 2024-03-20T11:57:08Z

Testing with more inputs, the speedup from force_list= False stays small, but it displays some randomness, and it is not always positive or negative. It may be questionable whether it is worth the cost of managing this additional parameter:

Input	Unique cousins	sort gain	list gain
man.n.01	2085	25.74%	-0.12%
woman.n.01	2085	25.94%	-1.3%
banana.n.01	1357	59.19%	-0.2%
apple.n.01	210	41.86%	0.27%
orange.n.01	95	33.44%	0.35%
dog.n.01	204	26.96%	1.65%
cat.n.01	104	22.75%	1.54%
calamagrostis.n.01	4824	39.42%	-0.72%
adult_male.n.01	2085	26.61%	1.46%
adult.n.01	971	52.92%	2.5%
girlfriend.n.01	2377	25.42%	1.08%

sort gain is the percentage speedup of setting sort=False, w.r.t. the baseline, and list gain is the additional speedup of also setting force_list=False.

def speedup(t1,t2):
    return round((100*t1/t2)-100, 2)

In these tests, the functions to retrieve second degree cousins use sets, in order to avoid duplicates. This is much more effective (and in itself much quicker than the sort=False optimization) when the ratio of duplicates over unique cousins is high, as in the 'man.n.01' example, where it is 4430 over 2085. But when the ratio is low, as with 'dog.n.01' (212 over 204), using sets actually incurs a 20% runtime penalty.

ekaf

        if relation_symbol not in self._pointers:
            return []

With force_list= False, the return type of _related() becomes unpredictable, because it is an iterator when the relation_symbol is present in self._pointers, but an empty list otherwise, in which case calling next() on the output fails with an Error.

bryant1410 · 2024-03-30T09:21:28Z

It's an iterable regardless.

…

On Sat, Mar 30, 2024, 5:19 AM Eric Kafe ***@***.***> wrote: ***@***.**** requested changes on this pull request. if relation_symbol not in self._pointers: return [] With *force_list= False*, the return type of *_related()* becomes unpredictable, because it is an iterator when the *relation_symbol* is present in *self._pointers*, but an empty list otherwise, in which case calling *next()* on the output fails with an Error. — Reply to this email directly, view it on GitHub <#3240 (review)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AA5ZPXJLUDVN25AA2XKSDQTY2ZYQZAVCNFSM6AAAAABEZVFBUGVHI2DSMVQWIX3LMV43YUDVNRWFEZLROVSXG5CSMV3GSZLXHMYTSNRZHA3TCOJWGU> . You are receiving this because you were mentioned.Message ID: ***@***.***>

ekaf · 2024-03-30T09:43:41Z

It's an iterable regardless.

But it is not an iterator: next([]) fails with

TypeError: 'list' object is not an iterator

while next(iter([])) returns

StopIteration

bryant1410 · 2024-03-30T10:10:48Z

You can do `iter` regardless, then `next`. Or a `for` loop directly.

…

On Sat, Mar 30, 2024, 5:44 AM Eric Kafe ***@***.***> wrote: It's an iterable regardless. But it is not an iterator: *next([])* fails with TypeError: 'list' object is not an iterator while *next(iter([]))* returns StopIteration — Reply to this email directly, view it on GitHub <#3240 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AA5ZPXOAWCX5IDCLUQESZPDY2Z3NHAVCNFSM6AAAAABEZVFBUGVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDAMRXHE4TIMBZGU> . You are receiving this because you were mentioned.Message ID: ***@***.***>

ekaf · 2024-03-30T10:22:19Z

Wouldn't you prefer that the output of _related(force_list=False) had the same predictable type, no matter whether targets were found or not?

bryant1410 · 2024-03-30T18:59:26Z

For me personally no but I can see how other people would prefer that. If it's always e.g. an iterable type (that has `__iter__`) then it's fine for me. So I have that one guarantee. Which makes sense with something not being forced; it's more relaxed and something is not guaranteed then (that is not a list).

…

On Sat, Mar 30, 2024, 6:22 AM Eric Kafe ***@***.***> wrote: Wouldn't you prefer that the output of _*related(force_list=False)* had the same predictable type, no matter whether targets were found or not? — Reply to this email directly, view it on GitHub <#3240 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AA5ZPXN2B5URPVJNH3BNQPDY2Z76DAVCNFSM6AAAAABEZVFBUGVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDAMRYGAYDIMRXGU> . You are receiving this because you were mentioned.Message ID: ***@***.***>

Add support for disabling the sorting and list creation for WordNet o…

2dec414

…bject relation methods Add support for disabling the sorting and list creation for WordNet object relation methods. It keeps backward compatibility. See nltk#3193.

github-actions bot added the corpus label Mar 16, 2024

Add a test

b3f81f0

bryant1410 marked this pull request as ready for review March 16, 2024 22:57

ekaf mentioned this pull request Mar 23, 2024

Add support for a sort argument in WordNet methods #3193

Open

ekaf requested changes Mar 30, 2024

View reviewed changes

stevenbird self-assigned this Apr 5, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add support for disabling the sorting and list creation for WordNet object relation methods #3240

Add support for disabling the sorting and list creation for WordNet object relation methods #3240

bryant1410 commented Mar 16, 2024

ekaf commented Mar 17, 2024

bryant1410 commented Mar 17, 2024 •

edited

ekaf commented Mar 18, 2024 •

edited

bryant1410 commented Mar 18, 2024

ekaf commented Mar 19, 2024

ekaf commented Mar 20, 2024 •

edited

ekaf left a comment

bryant1410 commented Mar 30, 2024 via email

ekaf commented Mar 30, 2024

bryant1410 commented Mar 30, 2024 via email

ekaf commented Mar 30, 2024

bryant1410 commented Mar 30, 2024 via email

Add support for disabling the sorting and list creation for WordNet object relation methods #3240

Are you sure you want to change the base?

Add support for disabling the sorting and list creation for WordNet object relation methods #3240

Conversation

bryant1410 commented Mar 16, 2024

ekaf commented Mar 17, 2024

bryant1410 commented Mar 17, 2024 • edited

ekaf commented Mar 18, 2024 • edited

bryant1410 commented Mar 18, 2024

ekaf commented Mar 19, 2024

ekaf commented Mar 20, 2024 • edited

ekaf left a comment

Choose a reason for hiding this comment

bryant1410 commented Mar 30, 2024 via email

ekaf commented Mar 30, 2024

bryant1410 commented Mar 30, 2024 via email

ekaf commented Mar 30, 2024

bryant1410 commented Mar 30, 2024 via email

bryant1410 commented Mar 17, 2024 •

edited

ekaf commented Mar 18, 2024 •

edited

ekaf commented Mar 20, 2024 •

edited