-
Notifications
You must be signed in to change notification settings - Fork 2.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Resolve the last failing doctests, add doctests to CI #3054
Conversation
The space was always just filtered away by '_tokenize_words', so it does not affect the tokenizer. However, it should improve the output of debug_decisions, as there's now no longer the useless space in the 'text' field. Beyond that, it better matches the original vision of the method.
I'm not a big fan of writing to files without the 'with open(...) as f' context manager notation, but this will do just fine.
This works when doing pytest --doctest-modules ./nltk/, but not when calling it for exclusively ./nltk/text.py
Also fix a critical bug in MultiListBox that prevents it from being instantiated. I guess nobody has used this at all?
…PATH If not, they are skipped. Sadly this does make the docstrings a bit more confusing
The moment we've been working towards is here!
These can throw LookupError's if the SENNA env variable is defined, like in the CI. Long story short, it's easiest to skip these, and we won't miss out on test coverage, as Senna should already be tested in test/unit/test_senna.py by the Linux workers
@@ -125,4 +125,4 @@ jobs: | |||
- name: Run pytest | |||
shell: bash | |||
run: | | |||
pytest --numprocesses auto -rsx --doctest-modules nltk/test | |||
pytest --numprocesses auto -rsx --doctest-modules nltk |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
--doctest-modules
was already always on, but it didn't do much, as it only looked in nltk/test
. Looking in nltk
directly was broken until #3033, but it works now!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @tomaarsen, this should make CI much more robust. I did not expect that it would be ready so soon!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh!
@@ -30,9 +30,9 @@ | |||
Note: Unit tests for this module can be found in test/unit/test_senna.py | |||
|
|||
>>> from nltk.classify import Senna | |||
>>> pipeline = Senna('/usr/share/senna-v3.0', ['pos', 'chk', 'ner']) | |||
>>> pipeline = Senna('/usr/share/senna-v3.0', ['pos', 'chk', 'ner']) # doctest: +SKIP |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All Senna(...)
sections are skipped, as they break on the Windows and Mac CI's, because the SENNA
environment variable does exist, but points to a path that doesn't.
@@ -62,7 +62,7 @@ def __init__(self, senna_path, operations, encoding="utf-8"): | |||
self._path = path.normpath(environ["SENNA"]) + sep | |||
exe_file_2 = self.executable(self._path) | |||
if not path.isfile(exe_file_2): | |||
raise OSError( | |||
raise LookupError( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe that LookupError
is more logical here. We also use LookupError
in other cases where looking up a certain resource fails.
@@ -206,6 +204,8 @@ def concatenated_view(self, reader, fileids, categories): | |||
) | |||
|
|||
def metadata_reader(self, stream): | |||
from yaml import safe_load | |||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This way we don't need to add pyyaml
to the requirements unnecessarily.
>>> master = SpaceWidget(c, 0, 30) | ||
>>> mlb = MultiListbox(master, 5, label_foreground='red') | ||
>>> root = Tk() # doctest: +SKIP | ||
>>> MultiListbox(root, ["Subject", "Sender", "Date"], label_foreground='red').pack() # doctest: +SKIP |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The old code didn't really make sense: I thought master
or root
was always a tkinter.Tk()
instance. Beyond that, it didn't compile I believe.
I also have to skip these, as Tk()
fails on a device without a display.
@@ -752,6 +752,7 @@ def demo(): | |||
>>> parser_std.train([gold_sent],'temp.arcstd.model', verbose=False) | |||
Number of training examples : 1 | |||
Number of valid (projective) examples : 1 | |||
>>> input_file.close() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Temporary files were being removed before they were closed, causing exceptions (and leftover files on my device...)
@@ -270,8 +270,7 @@ def findall(self, regexp): | |||
a single token must be surrounded by angle brackets. E.g. | |||
|
|||
>>> from nltk.text import TokenSearcher | |||
>>> print('hack'); from nltk.book import text1, text5, text9 | |||
hack | |||
>>> from nltk.book import text1, text5, text9 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The hack is unnecessary here. The tests pass like this, too.
text7: Wall Street Journal | ||
text8: Personals Corpus | ||
text9: The Man Who Was Thursday by G . K . Chesterton 1908 | ||
>>> from nltk.book import text4 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This one is a bit tricky. nltk.book
prints some stuff when imported, which is what we were testing for. However, it only prints it the first time it is imported.
if you do pytest --doctest-modules nltk/text.py
, then it imports nltk.book
for the first time here, and then it expects some printed output, but if you do pytest --doctest-modules ./nltk
, then it imports it elsewhere differently (probably in some pytest code that traverses everything under ./nltk
), and then we expect this import to print nothing.
Long story short: the new situation passes on pytest --doctest-modules nltk
(which is the important case), but fails on pytest --doctest-modules nltk/text.py
(which is the unimportant case).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed
last_space_index = previous_slice.start | ||
prev_word_slice = slice(last_space_index, match.start()) | ||
index_after_last_space = previous_slice.start | ||
prev_word_slice = slice(index_after_last_space, match.start()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This method, which is supposed to return a "context" word that comes before a specific token of interest, used to also return the whitespace that preceded that word. I've fixed this now, so it doesn't return the whitespace anymore. It should not affect the tokenization, as that already strip()
-ed away the extra whitespace anyways.
regex | ||
scikit-learn | ||
tqdm | ||
twython |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These are (now) required to run the CI due to the doctests.
I'm going to make some small changes to how the CoreNLP tests are skipped. Currently, they are all skipped in the CI, which I also want to change. |
That way, if that file changes, it should cause it to refresh the cache
This looks good!
|
Nice work @tomaarsen! Thanks @ekaf. |
Resolves #2989
Hello!
Pull request overview
requirements-ci.txt
withclick
,python-crfsuite
andtqdm
as they are required for doctests.tkinter
, astkinter.Tk()
requires a display to exist.OSError
withLookupError
when SENNA can't be found.MultiListBox
that prevents it from being instantiated.full 2018-10-05
to the most recent4.5.1
.tools/github_actions/third-party.sh
is updated.check_jar
function which callspytest.skip
if a jar is unavailable, and otherwise does nothing.Details
The changes are primarily very small and few and far between, so I'm going to add comments to the changed code rather than describing the changes in this PR text itself. That way, it's easier to see which comment refers to what piece of code. The only exception is to CoreNLP, which is a somewhat big change:
CoreNLP
I've reworked the doctests in that file to include
server = CoreNLPServer(...)
,server.start()
,server.stop()
orwith CoreNLPServer() as server:
. That way, the tests actually do pass now if the CoreNLP jars are onCLASSPATH
.Thanks to @ekaf and @Madnex for working on the other doctests & compiling the broken ones.