Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Avoid duplicate output in acyclic_breadth_first #3245

Open
wants to merge 1 commit into
base: develop
Choose a base branch
from

Conversation

ekaf
Copy link
Contributor

@ekaf ekaf commented Mar 25, 2024

Fix #3244.

Duplicate elimination is a central theme in the literature on the transitive closure of relational operators.

Note that, when the graph is acyclic, duplicate elimination is important only for efficiency, not for termination.

Performance Evaluation of Algorithms for Transitive Closure, Kabler et. al., InformationSystems Vol. 17, No. 5, PP. 415-441, 1992.

With cyclic cases, however, duplicate elimination becomes also crucial for termination, as seen in #2314.

Currently, the closure operation in wordnet.py uses the acyclic_breadth_first() function imported from nltk/util.py, which is supposed to prevent cycles. However, this function presently only detects cyclic children branches, but it does not properly eliminate some duplicates, that are accessible through different parent paths.

Consider for ex. the transitive closure of the hypernyms of 'calamagrostis.n.01', which presently duplicates the nodes that are accessible by two different paths.

With the present PR, the same call does not output duplicates:

from nltk.corpus import wordnet as wn
ss=wn.synset("calamagrostis.n.01")
print(list(ss.closure(lambda s: s.hypernyms())))

[Synset('gramineae.n.01'), Synset('monocot_genus.n.01'), Synset('monocot_family.n.01'), Synset('genus.n.02'), Synset('family.n.06'), Synset('taxonomic_group.n.01'), Synset('biological_group.n.01'), Synset('group.n.01'), Synset('abstraction.n.06'), Synset('entity.n.01')]

@ekaf
Copy link
Contributor Author

ekaf commented Mar 25, 2024

This PR includes #3218, which fixes the problem raised by @rmccampbell, concerning excessive verbosity in the warnings.

@ekaf
Copy link
Contributor Author

ekaf commented Apr 4, 2024

The acyclic_breadth_first() function uses two structures to manage the relation closure: traversed, a set of nodes already seen, and queue, a deque of (child, level) pairs to lookup next.
This picture shows how 'taxonomic_group.n.01', is added to the queue twice, while not yet added to the traversed set. The problem occurs later, when popping duplicates from the queue without checking traversed. So there are two ways of solving the problem, either checking traversed, or preventing duplicates in the queue. This PR implements the first solution, but it might be interesting to consider the second instead.

@ekaf
Copy link
Contributor Author

ekaf commented Apr 4, 2024

Calculating the transitive hypernymic closure of all the 82115 noun synsets in WordNet 3.0 with either solution yields identical closures in both cases, but the solution implemented in this PR is 15% quicker.

@stevenbird stevenbird self-assigned this Apr 5, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Duplicates in wordnet hypernyms closure
2 participants