Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Prevent LazyModule from increasing the size of nltk.__dict__ #3033

Merged
merged 1 commit into from
Sep 3, 2022

Conversation

tomaarsen
Copy link
Member

@tomaarsen tomaarsen commented Aug 22, 2022

Hello!

Pull request overview

  • Fix issue with LazyModule that would increase the size of nltk.__dict__, which may break programs such as pytest --doctest-modules from iterating over all modules.

Details

NLTK uses 5 LazyModule instances:

nltk/nltk/__init__.py

Lines 164 to 168 in 3b72fa2

app = lazyimport.LazyModule("nltk.app", locals(), globals())
chat = lazyimport.LazyModule("nltk.chat", locals(), globals())
corpus = lazyimport.LazyModule("nltk.corpus", locals(), globals())
draw = lazyimport.LazyModule("nltk.draw", locals(), globals())
toolbox = lazyimport.LazyModule("nltk.toolbox", locals(), globals())

When inspecting nltk after importing with import nltk, you can see that nltk.__dict__["toolbox"] refers to LazyModule. Then, after interacting with nltk.toolbox, it will create a new entry in nltk.__dict__, accessible like so: nltk.__dict__["nltk.toolbox"]. This one will point to the actually loaded toolbox module.

>>> import nltk
>>> nltk.__dict__["toolbox"]
<LazyModule 'nltk.nltk.toolbox'>
# This is to be expected

>>> nltk.__dict__["nltk.toolbox"]
Traceback (most recent call last):   
  File "<stdin>", line 1, in <module>
KeyError: 'nltk.toolbox'
# This is also to be expected

>>> nltk.toolbox.hello
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "C:\GitHub\nltk\nltk\lazyimport.py", line 120, in __getattr__
    return getattr(module, name)
AttributeError: module 'nltk.toolbox' has no attribute 'hello'
# This should trigger the loading of nltk.toolbox

>>> nltk.__dict__["toolbox"]      
<LazyModule 'nltk.toolbox'>
# Still refers to the LazyModule. This is wrong, it should point to the real module now.

>>> nltk.__dict__["nltk.toolbox"]
<module 'nltk.toolbox' from 'C:\\GitHub\\nltk\\nltk\\toolbox.py'>
# This entry has been created in nltk.__dict__, but is never directly used (I believe)

Certain uses of nltk.toolbox.bla will then still go through LazyModule's __getattr__, while the LazyModule should have replaced itself completely.

The primary reason that this is an issue is because tools that iterate over nltk.__dict__ will get RuntimeError: dictionary changed size during iteration:

> pytest --doctest-modules .\nltk\
===================================================================================== test session starts =====================================================================================
platform win32 -- Python 3.9.4, pytest-6.2.5, py-1.10.0, pluggy-0.13.1
rootdir: C:\GitHub\nltk
plugins: cov-2.12.1, forked-1.3.0, mock-3.6.1, xdist-2.4.0
collected 736 items / 2 errors / 734 selected

=========================================================================================== ERRORS ============================================================================================
______________________________________________________________________________ ERROR collecting nltk/__init__.py ______________________________________________________________________________ 
C:\Users\Tom\AppData\Local\Programs\Python\Python39\lib\doctest.py:939: in find
    self._find(tests, obj, name, module, source_lines, globs, {})
.env39\lib\site-packages\_pytest\doctest.py:522: in _find
    doctest.DocTestFinder._find(  # type: ignore
C:\Users\Tom\AppData\Local\Programs\Python\Python39\lib\doctest.py:995: in _find
    for valname, val in obj.__dict__.items():
E   RuntimeError: dictionary changed size during iteration
_______________________________________________________________________ ERROR collecting nltk/corpus/reader/markdown.py _______________________________________________________________________ 
nltk\corpus\reader\markdown.py:5: in <module>
    from mdit_plain.renderer import RendererPlain
E   ModuleNotFoundError: No module named 'mdit_plain'
====================================================================================== warnings summary ======================================================================================= 
nltk\test\unit\test_tokenize.py:23
  C:\GitHub\nltk\nltk\test\unit\test_tokenize.py:23: DeprecationWarning:
  The StanfordTokenizer will be deprecated in version 3.2.5.
  Please use nltk.parse.corenlp.CoreNLPTokenizer instead.'
    seg = StanfordSegmenter()

-- Docs: https://docs.pytest.org/en/stable/warnings.html
=================================================================================== short test summary info =================================================================================== 
ERROR nltk/__init__.py - RuntimeError: dictionary changed size during iteration
ERROR nltk/corpus/reader/markdown.py - ModuleNotFoundError: No module named 'mdit_plain'
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! Interrupted: 2 errors during collection !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! 
================================================================================ 1 warning, 2 errors in 28.85s ================================================================================

Changes

After this PR, rather than setting nltk.__dict__["nltk.toolbox"] to be the toolbox module, nltk.__dict__["toolbox"] is overridden to be the toolbox module directly.

The output of the program above is now:

>>> import nltk
>>> nltk.__dict__["toolbox"]
<LazyModule 'nltk.toolbox'>
>>> nltk.__dict__["nltk.toolbox"]
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
KeyError: 'nltk.toolbox'
>>> nltk.toolbox.hello
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "C:\GitHub\nltk\nltk\lazyimport.py", line 121, in __getattr__
    return getattr(module, name)
AttributeError: module 'nltk.toolbox' has no attribute 'hello'
>>> nltk.__dict__["toolbox"]  
<module 'nltk.toolbox' from 'C:\\GitHub\\nltk\\nltk\\toolbox.py'>
>>> nltk.__dict__["nltk.toolbox"]
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
KeyError: 'nltk.toolbox'

Which is my expected behaviour.

Beyond that, pytest --doctest-modules ./nltk works now:

> pytest --doctest-modules .\nltk\        
===================================================================================== test session starts =====================================================================================
platform win32 -- Python 3.9.4, pytest-6.2.5, py-1.10.0, pluggy-0.13.1
rootdir: C:\GitHub\nltk
plugins: cov-2.12.1, forked-1.3.0, mock-3.6.1, xdist-2.4.0
collected 736 items

nltk\collections.py ....                                                                                                                                                                 [  0%]
nltk\data.py ...                                                                                                                                                                         [  0%]
nltk\decorators.py ..                                                                                                                                                                    [  1%]
nltk\downloader.py s                                                                                                                                                                     [  1%] 
nltk\featstruct.py ...                                                                                                                                                                   [  1%]
nltk\internals.py .......                                                                                                                                                                [  2%]
nltk\probability.py .........                                                                                                                                                            [  3%]
nltk\text.py ..F...                                                                                                                                                                      [  4%]
nltk\tgrep.py .                                                                                                                                                                          [  4%]
nltk\util.py ...............                                                                                                                                                             [  6%]
nltk\wsd.py .                                                                                                                                                                            [  7%]
...
======================================================= 52 failed, 645 passed, 30 skipped, 9 xfailed, 25 warnings in 135.67s (0:02:15) ========================================================

This is a very relevant and important change for #2989, as it will allow us to execute our doctests!

Note also that this change has no further negative effects, i.e. the LazyModule modules are still accessible like always.
I also moved the Markdown corpus reader imports into the __init__, to prevent the test suite from having to import those as dependencies. I'm open to alternatives here, as I'm not sure that's the best solution.

  • Tom Aarsen

Allows pytest --doctest-modules nltk to be executed
@stevenbird stevenbird merged commit 9bf5908 into nltk:develop Sep 3, 2022
@stevenbird
Copy link
Member

Thanks @tomaarsen.
And especially for a fix to such a long-standing and central piece of NLTK, early work by @edloper

@tomaarsen tomaarsen deleted the enhancement/lazyloader_dict branch September 3, 2022 05:12
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants