Allow empty string in CFG's + more #2888
Merged
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Resolves #1890
Hello!
Pull request overview
nltk.parse.generate
+ doctest.import nltk
.Empty strings as terminals in CFG's
This corresponds with #1890. Here, it shows that the following grammar throws an exception:
fails with
However, this is definitely not a case of an Unterminated string. The crux here is this regex:
nltk/nltk/grammar.py
Line 1320 in 68e4e58
Simply said, it requires either:
"[^"]+"
or'[^']+'
, which requires 1 or more tokens to be inside of quotes. However, I don't believe there is a reason this should be "1 or more". Replacing the+
with*
will solve the above problem.Consequences
As a result, people can now do, e.g.:
now produces
(This previously crashed)
Do note that similar functionality was already possible, but with empty productions, not empty strings as productions:
has always produced:
I've created doctests for these consequences.
Fixed issue with RecursionError in
generate
nltk.parse.generate
can be used with infinite grammars as long asdepth
is provided:produces
However, if it's not provided, the output is this:
This is definitely not the traceback we expect! The issue is here:
nltk/nltk/parse/generate.py
Lines 45 to 52 in 68e4e58
I've resolved this by catching
RecursionError
instead, instead of relying on amessage
of the error object. Now the output is:That's more like what we would expect to see. I've made a doctest for this.
Moved unnecessary PCFG's into the relevant methods
In
nltk.grammar
, these two pcfg's were being made:nltk/nltk/grammar.py
Lines 1534 to 1573 in 68e4e58
These are used elsewhere in
demo
functions, and in doctests. However, they are being compiled whenever a user performsimport nltk
, while a user will never use these variables. As a result, I've copied them into the places that require them. This way, importing nltk will be very slightly faster.Feel free to look at the .doctest files to see some more examples of these changes.