Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for a sort argument in WordNet methods #3193

Open
bryant1410 opened this issue Oct 10, 2023 · 20 comments
Open

Add support for a sort argument in WordNet methods #3193

bryant1410 opened this issue Oct 10, 2023 · 20 comments

Comments

@bryant1410
Copy link
Contributor

WordNet object methods support a series of methods, such as hypernyms, antonyms, etc., that under the hood use a private method called _related, that supports an argument called sort which by default is True. This argument sorts the output objects by name. For example, see:

def hypernyms(self):
return self._related("@")

However, in some cases, we don't need the output to be sorted, and we may be performing these operations multiple times and on long lists, which incurs in considerable penalties because of multiple needless sorting going on under the hood.

Thus, I believe it'd be important that such methods supported a sort argument (as _related does), whose default value is True (to avoid breaking backward compatibility).

@ekaf
Copy link
Contributor

ekaf commented Mar 6, 2024

There are two _related() functions: one for lemmas, which does not support a sort argument, and one for synsets, which has sort set to True by default.
But False would be a better default, because most users normally don't want the output to be sorted alphabetically, and those who need it can easily sort it by themselves.
So, even though it breaks backward compatibility, the easiest fix would be to just remove the sort option from the synsets' _related() function.

@ekaf
Copy link
Contributor

ekaf commented Mar 8, 2024

However, it seems that we need to sort the output, because otherwise it would be returned in random order, so no test would be reproducible.

Consider the hyponyms of legal action with sort set to False :

from nltk.corpus import wordnet as wn
ss = wn.synset('legal_action.n.01')
print([(wn.ss2of(x),x) for x in ss.hyponyms()])

[('01184230-n', Synset('civil_action.n.01')), ('01184407-n', Synset('counterclaim.n.01')), ('01198588-n', Synset('test_case.n.01')), ('01184045-n', Synset('antitrust_case.n.01')), ('01198307-n', Synset('prosecution.n.01')), ('01184720-n', Synset('lis_pendens.n.01')), ('01184565-n', Synset('custody_case.n.01'))]

This output is in no particular order, and subsequent runs produce a different order.

Also, while it is true that some lists may be long (up to 600 elements in a few cases), there is no reason why it would be sorted multiple times, except if the user explicitly repeated the same call, which is not likely to happen anyway, and would not be an NLTK concern.

After all, the alphabetical sort produces the nicest presentation order, and especially it is stable, so I'm not sure that this issue needs fixing.

@bryant1410
Copy link
Contributor Author

However, it seems that we need to sort the output, because otherwise it would be returned in random order, so no test would be reproducible.

Consider the hyponyms of legal action with sort set to False :

from nltk.corpus import wordnet as wn
ss = wn.synset('legal_action.n.01')
print([(wn.ss2of(x),x) for x in ss.hyponyms()])

[('01184230-n', Synset('civil_action.n.01')), ('01184407-n', Synset('counterclaim.n.01')), ('01198588-n', Synset('test_case.n.01')), ('01184045-n', Synset('antitrust_case.n.01')), ('01198307-n', Synset('prosecution.n.01')), ('01184720-n', Synset('lis_pendens.n.01')), ('01184565-n', Synset('custody_case.n.01'))]

This output is in no particular order, and subsequent runs produce a different order.

Tests can be reproducible by taking the output and sorting it:

from nltk.corpus import wordnet as wn
ss = wn.synset('legal_action.n.01')
print(sorted((wn.ss2of(x),x) for x in ss.hyponyms()))

Also, while it is true that some lists may be long (up to 600 elements in a few cases), there is no reason why it would be sorted multiple times, except if the user explicitly repeated the same call, which is not likely to happen anyway, and would not be an NLTK concern.

My use case involves recursively visiting hypernyms and/or hyponyms of synsets that I won't know prior (they are determined at runtime). I'd need to sort multiple of them multiple times, and I need to do it quickly. See, for example, this function.

I can see other people recursively traversing such relations (or similar ones) without needing to have them in memory (creating a temporary list for them) or sorted.

After all, the alphabetical sort produces the nicest presentation order, and especially it is stable, so I'm not sure that this issue needs fixing.

I agree it's nice to see it sorted when looking at examples manually. But it'd also be nice to have it disabled for performance considerations. E.g., ls, the Unix utility to list directory contents, returns the output sorted but supports an option (-U) that doesn't sort. This option allows saving compute/time for very large directories, and it also allows saving memory because sorting typically requires making a copy in memory. In other words, it's O(n) in memory usage, unlike being O(1) if you just return them one by one and forget about the past.

@ekaf
Copy link
Contributor

ekaf commented Mar 14, 2024

Thanks @bryant1410 for your suggestion to sort the tests' output. However, this would not be necessary if we keep sort=True as default.
Considering your use case, if it was possible to specify sort=False,
would it then really be computationally cheaper to keep making multiple identical calls, rather than caching (eventually all) the results in memory? Wordnet has around 100,000 synsets, 200,000 lemmas, and a limited number of relations, so everything fits easily inside a manageable python dictionary with O(1) lookup.

@bryant1410
Copy link
Contributor Author

Considering your use case, if it was possible to specify sort=False,
would it then really be computationally cheaper to keep making multiple identical calls, rather than caching (eventually all) the results in memory? Wordnet has around 100,000 synsets, 200,000 lemmas, and a limited number of relations, so everything fits easily inside a manageable python dictionary with O(1) lookup.

As you can see in the linked function, I'm already caching its result (among other functions' results I also cache). This function has a @functools.cache.

@ekaf
Copy link
Contributor

ekaf commented Mar 15, 2024

Thanks @bryant1410! Your linked source script contains the following comment:

# For the following two functions: we use _related instead of the actual nice names to avoid sorting unnecessarily.
# @ = hypernyms
# ~ = hyponyms

And you solve the present issue like this, using nothing else than what is already available in NLTK:

for parent in synset._related("@", sort=False):

So users have presently two options: use hypernym() if they want sorted output, or use the lower-level _related() if they want more control.

This seems to justify closing this issue as "already fixed", especially when very little time is available for review or merging eventual PRs.

@bryant1410
Copy link
Contributor Author

IMHO, what I'm doing is more like a workaround than a solution. _related is a private method, meaning somebody in NLTK could move, remove, or rename it without considering backward compatibility.

The solutions I see for this are:

  • Making this method public and, as such, part of the supported public API.
  • Adding sort to the public methods. I think this one is better.

Besides this, it'd be great if _related avoided creating a list when sort=False. Note that I'm calling this method repeatedly in a short time, and it'd be great if I could avoid incurring extra memory and compute. It could return a generator.

I'd be happy to submit a PR.

@ekaf
Copy link
Contributor

ekaf commented Mar 15, 2024

That sounds great, @bryant1410, and you're very welcome to submit a PR! Please note that these functions are called in many doctests all around, and especially in nltk/test/wordnet.doctest. The tests would fail if the output order changed. But you wrote earlier that you will maintain backward compatibility, so your proposal is fine.

bryant1410 added a commit to bryant1410/nltk that referenced this issue Mar 16, 2024
…bject relation methods

Add support for disabling the sorting and list creation for WordNet object relation methods. It keeps backward compatibility.

See nltk#3193.
@stevenbird
Copy link
Member

Do any of the public methods that call _related() have docstrings that promise a sorted result? I prefer the solution that _related() does not sort by default.

Are there use cases for returning an iterator instead of a list? (E.g. large result sets? How large do they get?)

@bryant1410
Copy link
Contributor Author

Are there use cases for returning an iterator instead of a list? (E.g. large result sets? How large do they get?)

My use case would be to avoid constantly creating temporary lists. Each of them is typically very short. I listed each relation size for each English synset and got these stats:

count    173670.000000
mean          1.643047
std           5.655514
min           1.000000
25%           1.000000
50%           1.000000
75%           1.000000
max         661.000000

(Though note the most frequent synsets may get larger ones on average.)

@ekaf
Copy link
Contributor

ekaf commented Mar 23, 2024

I prefer the solution that _related() does not sort by default.

Me too: it breaks backward compatibility, but the current defaults are not optimal. Moreover, returning a generator is also often quicker than a list.

Are there use cases for returning an iterator instead of a list? (E.g. large result sets? How large do they get?)

A very plausible use case could be that you want to check if two given synsets are directly connected. Synsets have only a very small number of hypernyms, but they may have many hyponyms. In that case, it is advantageous to stop computing more hyponyms as soon as the required target is found. I tested this effect, using the new parameters available in PR #3240, finding that the speedup is huge (over 10x) when the target is at the start of the list, but still high (58%) in the median position:


from nltk.corpus import wordnet as wn
import timeit

def speedup(t1,t2):
    return round((100*t1/t2)-100, 2)

def check_hypo_list(source, target):
    return target in source.hyponyms(sort=False, force_list=True)

def check_hypo_generator(source, target):
    return target in source.hyponyms(sort=False, force_list=False)

def testcase(ss, goal):
    t1 = timeit.timeit(lambda: check_hypo_list(ss, goal), number=100)
    t2 = timeit.timeit(lambda: check_hypo_generator(ss, goal), number=100)
    print(f"| {goal.name()} | {speedup(t1,t2)}% |")

#ss = wn.synset('man.n.01')
ss = wn.synset('woman.n.01')

# Test a randomly ordered list of hyponyms:
hypos = ss.hyponyms(sort=False, force_list=True)
print(f"Checking {len(hypos)} hyponyms of {ss.name()}:\n")
print("| Synset | Speedup |")
print("|--------|---------|")

for goal in hypos:
    testcase(ss, goal)

print("|--------|---------|")

# Retest the median position:
goal = hypos[len(hypos)//2]
testcase(ss, goal)

Checking 58 hyponyms of woman.n.01:

Synset Speedup
nymph.n.03 1055.79%
eyeful.n.01 808.67%
unmarried_woman.n.01 644.24%
nullipara.n.01 546.76%
girlfriend.n.02 496.08%
heroine.n.02 434.9%
nanny.n.01 376.01%
coquette.n.01 332.75%
white_woman.n.01 303.59%
inamorata.n.01 274.54%
old_woman.n.01 248.17%
bridesmaid.n.01 227.85%
lady.n.01 199.07%
widow.n.01 181.02%
maenad.n.01 171.69%
smasher.n.02 151.89%
wonder_woman.n.01 145.24%
girl.n.01 136.59%
amazon.n.01 127.05%
ball-buster.n.01 117.97%
dominatrix.n.01 109.49%
sylph.n.01 102.35%
vestal.n.01 95.0%
donna.n.01 90.53%
divorcee.n.01 85.16%
cat.n.03 78.66%
gravida.n.02 72.16%
dame.n.02 63.75%
wave.n.09 64.57%
girl.n.05 58.12%
girlfriend.n.01 47.55%
cinderella.n.01 49.99%
maenad.n.02 52.94%
mistress.n.01 40.44%
nymphet.n.01 39.26%
matriarch.n.01 35.61%
bluestocking.n.01 34.15%
gold_digger.n.02 31.52%
wac.n.01 29.87%
matron.n.03 25.65%
jilt.n.01 24.7%
yellow_woman.n.01 22.37%
mestiza.n.01 19.15%
broad.n.01 17.45%
ex-wife.n.01 14.9%
shiksa.n.01 13.65%
bachelor_girl.n.01 11.49%
jezebel.n.02 10.26%
enchantress.n.01 8.74%
black_woman.n.01 6.85%
mother_figure.n.01 5.34%
debutante.n.01 3.89%
wife.n.01 2.13%
matriarch.n.02 2.63%
b-girl.n.01 -0.57%
baggage.n.02 -2.09%
geisha.n.01 -3.0%
prostitute.n.01 -5.92%
-------- ---------
girl.n.05 58.55%

@ekaf
Copy link
Contributor

ekaf commented Mar 24, 2024

Changing the default output order of the relations would have far-reaching consequences, since it would modify the output of higher order functions like closure and tree, and deeply alter the topology of the relation trees.
But a "middle of the road" alternative is possible, that combines the advantages of both the old and the optimized approach:
keeping the plain relation functions unchanged, and adding a new set of "quick" functions (quick_hypernyms(), quick_hyponyms(), and so on ...), which call a _related_generator() function:

     def _related_generator(self, relation_symbol):
        if relation_symbol in self._pointers:
            yield from (
                self._wordnet_corpus_reader.synset_from_pos_and_offset(pos, offset) 
                    for pos, offset in self._pointers[relation_symbol]
            )

This approach allows to postpone the choice of eventually shifting to better defaults later on.

@stevenbird
Copy link
Member

stevenbird commented Mar 24, 2024

it would modify the output of higher order functions like closure and tree, and deeply alter the topology of the relation trees.

@ekaf can you please clarify how?

Also, do any docstrings of functions that call _related() promise sorted results?

@ekaf
Copy link
Contributor

ekaf commented Mar 24, 2024

Also, do any docstrings of functions that call _related() promise sorted results?

None of these functions has any docstring, so they don't promise anything. Nevertheless, they have returned sorted output since PR #495, back in 2013, so users are accustomed to stability.
By the way, many of the old comments in #458 are still relevant for the present issue.

How can it deeply affect the topology of relation trees?
Functions like closure and tree repeatedly call a given relation on all the outputs of that relation. So when the output order changes, it also impacts the branch ordering at that level of the tree.
It gets worse when redundant branches are truncated, because the truncations will occur at different places of the tree, depending on which node is traversed first.

@ekaf
Copy link
Contributor

ekaf commented Mar 26, 2024

Looping over all the noun synsets in WordNet shows some examples of differences in the tree outputs.
However no particular difference is ever guaranteed, since the unsorted order is random, and may or may not be identical to the sorted order by pure chance.

from nltk.corpus import wordnet as wn
from pprint import pprint
ss = wn.synset('medoc.n.01')

With tree(), using the default hypernyms(sort=True):

pprint(ss.tree(lambda s: s.hypernyms(), depth=2))

[Synset('medoc.n.01'),
[Synset('bordeaux.n.02'), [Synset('wine.n.01')]],
[Synset('red_wine.n.01'), [Synset('wine.n.01')]]]

With tree(), using the new hypernyms(sort=False):

[Synset('medoc.n.01'),
[Synset('red_wine.n.01'), [Synset('wine.n.01')]],
[Synset('bordeaux.n.02'), [Synset('wine.n.01')]]]

The difference can become bigger with acyclic_tree(), since it discards all the cycles, truncating the output tree, from the place where it detects a cycle:

pprint(ss.acyclic_tree(lambda s: s.hypernyms(), depth=2))

With acyclic_tree(), using the default hypernyms(sort=True):

[Synset('medoc.n.01'),
[Synset('bordeaux.n.02'), [Synset('wine.n.01')]],
[Synset('red_wine.n.01')]]

With acyclic_tree(), using the new hypernyms(sort=False):

[Synset('medoc.n.01'),
[Synset('red_wine.n.01'), [Synset('wine.n.01')]],
[Synset('bordeaux.n.02')]]

@ekaf
Copy link
Contributor

ekaf commented Mar 30, 2024

What about just adding an optimized public related method, and leaving everything else unchanged?

   def related(self, relation_symbol):
        """
        :param relation_symbol: the pointer symbol of a synset relation
        :type relation_symbol: str
        :return: an iterator over the relation targets in random order.

        >>> ss = wn.synset('girlfriend.n.01')
        >>> print(sorted(ss.related("@")))
        [Synset('friend.n.01'), Synset('woman.n.01')]
        """
        if relation_symbol not in self._pointers:
            return iter([])
        return (self._wordnet_corpus_reader.synset_from_pos_and_offset(pos, offset)
                for pos, offset in self._pointers[relation_symbol])


@stevenbird
Copy link
Member

stevenbird commented Mar 30, 2024

What about just adding an optimized public related method, and leaving everything else unchanged?

Interesting idea, though it risks making code less readable, having two ways of doing nearly the same thing.

the unsorted order is random

Are you sure? I expect it is the order that related synsets appear in the underlying corpus. I'd be grateful if you wouldn't mind checking.

@ekaf
Copy link
Contributor

ekaf commented Mar 31, 2024

@stevenbird, in the Wordnet db, relation targets are ordered by byte offset, which is not a meaningful order.
Nltk makes it a set:

436: self._pointers = defaultdict(set)
1648: synset._pointers[symbol].add((pos, offset))

These sets are unordered, and one can observe that their output order changes arbitrarily.

A further optimization is possible: since self._pointers is a defaultdict, there is no need to check that relation_symbol is present in it.
I also suggest making the private _related() a wrapper around the optimized wnrel() (or any other better name):

 
    def _related(self, relation_symbol, sort=True):
        """
        Convert the output of 'wnrel()' to a list, and sort it by default.

        :param relation_symbol: the pointer symbol of a synset relation
        :type relation_symbol: str
        :return: a list of this synset's relation targets

        >>> from nltk.corpus import wordnet as wn
        >>> ss = wn.synset('girlfriend.n.01')
        >>> print(ss._related("@"))
        [Synset('friend.n.01'), Synset('woman.n.01')]
        """
        r = list(self.wnrel(relation_symbol))
        if sort:
            r.sort()
        return r

    def wnrel(self, relation_symbol):
        """
        Get an iterator over the synsets related to 'self'.

        :param relation_symbol: the pointer symbol of a synset relation
        :type relation_symbol: str
        :return: an iterator over this synset's relation targets in random order.

        >>> from nltk.corpus import wordnet as wn
        >>> ss = wn.synset('girlfriend.n.01')
        >>> print(sorted(ss.wnrel("@")))
        [Synset('friend.n.01'), Synset('woman.n.01')]
        """
        return (
            self._wordnet_corpus_reader.synset_from_pos_and_offset(pos, offset)
            for pos, offset in self._pointers[relation_symbol]
        )

@stevenbird
Copy link
Member

I favour returning unsorted results which calling programs can trivially sort, and documenting this in docstrings (and release notes and ChangeLog). Yes, some people will see different behaviour and might need to update their code, which has always happened from time to time with NLTK releases. I think that is preferable to proliferating the access methods.

@ekaf
Copy link
Contributor

ekaf commented Apr 2, 2024

Returning unsorted results for all wordnet relations is a big change, that causes many failures in test/wordnet.doctest. Moreover, the number of failures changes randomly between subsequent runs:

1248c1266
<   15 of 197 in wordnet.doctest
---
>   17 of 197 in wordnet.doctest
1250,1251c1268,1269
< 182 passed and 15 failed.
< ***Test Failed*** 15 failures.
---
> 180 passed and 17 failed.
> ***Test Failed*** 17 failures.

And it has also implications outside the wordnet module, with 4 doctests failing in nltk/util.py.

Many doctests can be fixed by wrapping the call with sorted(). But with tree() the changes are deeper in the output structure, and require the possibility of setting sort=True in the relation call.

I initially found it attractive to change the default setting, but am now leaning toward considering a smoother transition.

Another question is what do do with the second part of @bryant1410's proposal: should _related() return a list or an iterator?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants