New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add support for a sort
argument in WordNet methods
#3193
Comments
There are two _related() functions: one for lemmas, which does not support a sort argument, and one for synsets, which has sort set to True by default. |
However, it seems that we need to sort the output, because otherwise it would be returned in random order, so no test would be reproducible. Consider the hyponyms of legal action with sort set to False :
This output is in no particular order, and subsequent runs produce a different order. Also, while it is true that some lists may be long (up to 600 elements in a few cases), there is no reason why it would be sorted multiple times, except if the user explicitly repeated the same call, which is not likely to happen anyway, and would not be an NLTK concern. After all, the alphabetical sort produces the nicest presentation order, and especially it is stable, so I'm not sure that this issue needs fixing. |
Tests can be reproducible by taking the output and sorting it: from nltk.corpus import wordnet as wn
ss = wn.synset('legal_action.n.01')
print(sorted((wn.ss2of(x),x) for x in ss.hyponyms()))
My use case involves recursively visiting hypernyms and/or hyponyms of synsets that I won't know prior (they are determined at runtime). I'd need to sort multiple of them multiple times, and I need to do it quickly. See, for example, this function. I can see other people recursively traversing such relations (or similar ones) without needing to have them in memory (creating a temporary list for them) or sorted.
I agree it's nice to see it sorted when looking at examples manually. But it'd also be nice to have it disabled for performance considerations. E.g., |
Thanks @bryant1410 for your suggestion to sort the tests' output. However, this would not be necessary if we keep sort=True as default. |
As you can see in the linked function, I'm already caching its result (among other functions' results I also cache). This function has a |
Thanks @bryant1410! Your linked source script contains the following comment:
And you solve the present issue like this, using nothing else than what is already available in NLTK:
So users have presently two options: use hypernym() if they want sorted output, or use the lower-level _related() if they want more control. This seems to justify closing this issue as "already fixed", especially when very little time is available for review or merging eventual PRs. |
IMHO, what I'm doing is more like a workaround than a solution. The solutions I see for this are:
Besides this, it'd be great if I'd be happy to submit a PR. |
That sounds great, @bryant1410, and you're very welcome to submit a PR! Please note that these functions are called in many doctests all around, and especially in nltk/test/wordnet.doctest. The tests would fail if the output order changed. But you wrote earlier that you will maintain backward compatibility, so your proposal is fine. |
…bject relation methods Add support for disabling the sorting and list creation for WordNet object relation methods. It keeps backward compatibility. See nltk#3193.
Do any of the public methods that call Are there use cases for returning an iterator instead of a list? (E.g. large result sets? How large do they get?) |
My use case would be to avoid constantly creating temporary lists. Each of them is typically very short. I listed each relation size for each English synset and got these stats:
(Though note the most frequent synsets may get larger ones on average.) |
Me too: it breaks backward compatibility, but the current defaults are not optimal. Moreover, returning a generator is also often quicker than a list.
A very plausible use case could be that you want to check if two given synsets are directly connected. Synsets have only a very small number of hypernyms, but they may have many hyponyms. In that case, it is advantageous to stop computing more hyponyms as soon as the required target is found. I tested this effect, using the new parameters available in PR #3240, finding that the speedup is huge (over 10x) when the target is at the start of the list, but still high (58%) in the median position:
Checking 58 hyponyms of woman.n.01:
|
Changing the default output order of the relations would have far-reaching consequences, since it would modify the output of higher order functions like closure and tree, and deeply alter the topology of the relation trees.
This approach allows to postpone the choice of eventually shifting to better defaults later on. |
@ekaf can you please clarify how? Also, do any docstrings of functions that call |
None of these functions has any docstring, so they don't promise anything. Nevertheless, they have returned sorted output since PR #495, back in 2013, so users are accustomed to stability. How can it deeply affect the topology of relation trees? |
Looping over all the noun synsets in WordNet shows some examples of differences in the tree outputs.
With tree(), using the default hypernyms(sort=True):
With tree(), using the new hypernyms(sort=False):
The difference can become bigger with acyclic_tree(), since it discards all the cycles, truncating the output tree, from the place where it detects a cycle:
With acyclic_tree(), using the default hypernyms(sort=True):
With acyclic_tree(), using the new hypernyms(sort=False):
|
What about just adding an optimized public related method, and leaving everything else unchanged?
|
Interesting idea, though it risks making code less readable, having two ways of doing nearly the same thing.
Are you sure? I expect it is the order that related synsets appear in the underlying corpus. I'd be grateful if you wouldn't mind checking. |
@stevenbird, in the Wordnet db, relation targets are ordered by byte offset, which is not a meaningful order.
These sets are unordered, and one can observe that their output order changes arbitrarily. A further optimization is possible: since self._pointers is a defaultdict, there is no need to check that relation_symbol is present in it.
|
I favour returning unsorted results which calling programs can trivially sort, and documenting this in docstrings (and release notes and ChangeLog). Yes, some people will see different behaviour and might need to update their code, which has always happened from time to time with NLTK releases. I think that is preferable to proliferating the access methods. |
Returning unsorted results for all wordnet relations is a big change, that causes many failures in test/wordnet.doctest. Moreover, the number of failures changes randomly between subsequent runs:
And it has also implications outside the wordnet module, with 4 doctests failing in nltk/util.py. Many doctests can be fixed by wrapping the call with sorted(). But with tree() the changes are deeper in the output structure, and require the possibility of setting sort=True in the relation call. I initially found it attractive to change the default setting, but am now leaning toward considering a smoother transition. Another question is what do do with the second part of @bryant1410's proposal: should _related() return a list or an iterator? |
WordNet object methods support a series of methods, such as
hypernyms
,antonyms
, etc., that under the hood use a private method called_related
, that supports an argument calledsort
which by default isTrue
. This argument sorts the output objects by name. For example, see:nltk/nltk/corpus/reader/wordnet.py
Lines 134 to 135 in e2d368e
However, in some cases, we don't need the output to be sorted, and we may be performing these operations multiple times and on long lists, which incurs in considerable penalties because of multiple needless sorting going on under the hood.
Thus, I believe it'd be important that such methods supported a
sort
argument (as_related
does), whose default value isTrue
(to avoid breaking backward compatibility).The text was updated successfully, but these errors were encountered: