Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Partition tree.py module into tree package + pickle fix #2863

Merged
merged 12 commits into from Nov 12, 2021

Conversation

tomaarsen
Copy link
Member

@tomaarsen tomaarsen commented Oct 21, 2021

Resolves #2478, resolves #1324

Hello!

Pull request overview

  • Allow unpickling of ParentedTree objects for Python 3.7+. Please see 938a64b to inspect only these changes.
  • Created a doctest to show that pickling now works correctly.
  • Partitioned the large (1800 lines) tree.py top-level module into a tree package.

Details

This PR contains two main changes, which are fairly unrelated beyond their location. As a result, if one change is welcomed, while the other isn't, then I'll happily remove one of the sections.

Allow unpickling of ParentedTree objects for Python 3.7+

For context, this is in response to #2478, which shows that

import pickle
from nltk.tree import ParentedTree

tree = ParentedTree.fromstring('(S (NN x) (NP x) (NN x))')
pickled = pickle.dumps(tree)
tree_2 = pickle.loads(pickled)

print(tree)
print(tree_2)

will throw an error for Python 3.7+. The pickle unpickling behaviour seems very strange, but my understanding of it seems to be that in Python 3.7+ onwards, the __new__ method gets called with some bogus parameters.
Beyond that, the _parent and _label parameters aren't set properly on the child of some parent object (yet), and then _setparent will complain.

The long and short of it is that adding __getnewargs__ as used by pickle (see https://docs.python.org/3.9/library/pickle.html#pickling-class-instances) will specify the correct parameters for __new__ when unpickling, and then checking whether the child even has a _parent attribute further fixes the issue entirely.

I've added a doctest for this under tree.doctest which shows that this now works correctly for all supported Python versions.

You can check these changes in 938a64b. I would recommend doing so, as this PR is kind of large otherwise.

Partitioning the tree.py module into a tree package

I noticed that tree.py was a collection of many Tree related classes, which together made up this really long file. I'm not a particular fan of this - as I prefer different kinds of classes to be in different files. Currently, nltk.tree provides the following classes and methods:

  • nltk.tree:
    • ImmutableProbabilisticTree
    • ImmutableTree
    • ProbabilisticTree
    • Tree
    • bracket_parse
    • sinica_parse
    • ParentedTree
    • MultiParentedTree
    • ImmutableParentedTree
    • ImmutableMultiParentedTree

After this change, these methods and classes arrive from these files:

  • nltk.tree.tree:
    • Tree
  • nltk.tree.immutable
    • ImmutableProbabilisticTree
    • ImmutableTree
    • ImmutableParentedTree
    • ImmutableMultiParentedTree
  • nltk.tree.parented
    • ParentedTree
    • MultiParentedTree
  • nltk.tree.probabilistic
    • ProbabilisticTree
  • nltk.tree.parse
    • bracket_parse
    • sinica_parse

It's important to note that all of these methods and classes can still be imported with either from nltk import ParentedTree or ntk.tree import ParentedTree. This change is purely for code quality and restructuring the codebase a bit, and should not affect users (unless I made a mistake!).

Beyond moving these classes to files, the only real changes I made was to move the import of the classes from nltk.tree.immutable into the _frozen_class method. For example for Tree:

    def _frozen_class(self):
        from nltk.tree.immutable import ImmutableTree

        return ImmutableTree

This is to avoid a circular import.

Discussion

There are two other tree-related files which perhaps ought to be moved into the new tree package: treeprettyprinter.py and treetransforms.py. If we didn't have to worry about people's codebases breaking upon updating, then I would move both of these files into tree, and rename them to prettyprinter.py and transforms.py respectively.
However, doing so would mean that these files need to be imported in a new way, which is a breaking change. Instead of the current way of:

from nltk.treeprettyprinter import TreePrettyPrinter

from nltk.treetransforms import chomsky_normal_form

Users would need to do

from nltk.tree.prettyprinter import TreePrettyPrinter
# or
from nltk.tree import TreePrettyPrinter
# or
from nltk import TreePrettyPrinter

from nltk.tree.transforms import chomsky_normal_form
# or
from nltk.tree import chomsky_normal_form
# or
from nltk import chomsky_normal_form

Perhaps there is a way to move the files while still supporting the old imports? For example by writing some hack into nltk/__init__.py?


Thank you @movabo for letting us know about the bug, and thank you @hitt08 for pointing me to it while I had some spare time to look into it.

  • Tom Aarsen

@tomaarsen tomaarsen added the bug label Oct 21, 2021
@stevenbird stevenbird self-assigned this Oct 22, 2021
@stevenbird
Copy link
Member

FYI @edloper

@stevenbird
Copy link
Member

Perhaps there is a way to move the files while still supporting the old imports? For example by writing some hack into nltk/init.py?

Yes, I like this solution. How about writing pass-through functions and immediately deprecating them.

@tomaarsen
Copy link
Member Author

That works. I'll throw it on my todo list for when I've got some more spare time.

@tomaarsen
Copy link
Member Author

I've expanded this PR to also tackle an issue very similar to #1324.

The bug in question:

from nltk.tree import ParentedTree

ptree = ParentedTree.fromstring("(TOP (S (NP (NNP Bell,)) (NP (NP (DT a) (NN company)) (SBAR (WHNP (WDT which)) (S (VP (VBZ is) (VP (VBN based) (PP (IN in) (NP (NNP LA,)))))))) (VP (VBZ makes) (CC and) (VBZ distributes) (NP (NN computer))) (. products.)))")
ptree.copy(deep=False)

which throws:

Traceback (most recent call last):
  File "[sic]\nltk_2863.py", line 12, in <module>
    ptree.copy(deep=False)
  File "[sic]\nltk\tree.py", line 562, in copy
    return type(self)(self._label, self)
  File "[sic]\nltk\tree.py", line 1277, in __init__
    super().__init__(node, children)
  File "[sic]\nltk\tree.py", line 1047, in __init__
    self._setparent(child, i, dry_run=True)
  File "[sic]\nltk\tree.py", line 1371, in _setparent
    raise ValueError("Can not insert a subtree that already " "has a parent.")
ValueError: Can not insert a subtree that already has a parent.

The actual issue reported in #1324 is about a deepcopy, but I have no issues with a deepcopy.
Furthermore, this error looks very similar to the issue from #2478, which this PR originally set out to resolve.

But upon some investigation, this is actually not really much of an implementation issue - it's a small design oversight.
We're attempting to make a shallow copy of a ParentedTree. For all Tree subclasses, this means that the root node is reconstructed, and the child nodes are used for both the original object and the freshly copied object. But, that means the children have two parents, which is not allowed by ParentedTree.
When trying to make a shallow copy (i.e. import copy; copy.copy(ptree) or ptree.copy(deep=False)), a deep copy is made. Beyond that, a warning is thrown to notify users that a shallow copy isn't supported for a ParentedTree.

My commit 2d5ef12 has added these fixes, and some doctests with documentation on how shallow copying works for different Tree subclasses.


Also, I've now moved treeprettyprinter and treetransform to the tree package, in such a way that old imports will simply call the functions at the new locations.

I'd like to link this PR to #1324 too, so that issue gets closed if this gets merged.

  • Tom Aarsen

@tomaarsen
Copy link
Member Author

I've moved forward with the changes suggested in #2875. I figured this PR would be a good place for it, as this has kind of become a PR that completely overhauls the tree module (now package). For the time being, we can go with the simplest option: no further option for users to modify the looks of the graph. (Note, the old system didn't have that option either)

All that remains is discussing if we want to add it to pip-req.tx. I'm leaning towards simply providing a "smarter" ModuleNotFoundError, as svgling will only be needed in very niche situations. (i.e. when outputting trees in jupyter notebooks)

@tomaarsen tomaarsen linked an issue Nov 7, 2021 that may be closed by this pull request
@stevenbird
Copy link
Member

Looks good to me. I agree about not adding svgling to pip-req.txt.

@stevenbird stevenbird merged commit 68e4e58 into nltk:develop Nov 12, 2021
@stevenbird
Copy link
Member

Thanks for this major reorganisation + many improvements!

@tomaarsen tomaarsen deleted the bugfix/pickle_tree branch November 12, 2021 14:54
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
2 participants