Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow empty string in CFG's + more #2888

Merged
merged 8 commits into from Nov 18, 2021

Conversation

tomaarsen
Copy link
Member

Resolves #1890

Hello!

Pull request overview

  • Allow for empty strings as terminals in CFG's + doctests.
  • Fixed issue with infinite recursion in nltk.parse.generate + doctest.
  • Moved unnecessary PCFG's into the relevant methods, so they don't get created whenever someone does import nltk.

Empty strings as terminals in CFG's

This corresponds with #1890. Here, it shows that the following grammar throws an exception:

from nltk.grammar import CFG
grammar = CFG.fromstring("""
S -> A B
A -> 'a'
B -> 'b' | ''
""")

fails with

ValueError: Unable to parse line 5: B -> 'b' | ''
Unterminated string

However, this is definitely not a case of an Unterminated string. The crux here is this regex:

_TERMINAL_RE = re.compile(r'( "[^"]+" | \'[^\']+\' ) \s*', re.VERBOSE)

Simply said, it requires either: "[^"]+" or '[^']+' , which requires 1 or more tokens to be inside of quotes. However, I don't believe there is a reason this should be "1 or more". Replacing the + with * will solve the above problem.

Consequences

As a result, people can now do, e.g.:

from nltk.grammar import CFG
from nltk.parse.generate import generate
grammar = CFG.fromstring("""
S -> A B
A -> 'a'
# An empty string:
B -> 'b' | ''
""")
print(list(generate(grammar)))

now produces

[['a', 'b'], ['a', '']]

(This previously crashed)

Do note that similar functionality was already possible, but with empty productions, not empty strings as productions:

grammar = CFG.fromstring("""
S -> A B
A -> 'a'
# An empty production:
B -> 'b' |
""")
print(list(generate(grammar)))

has always produced:

[['a', 'b'], ['a']]

I've created doctests for these consequences.

Fixed issue with RecursionError in generate

nltk.parse.generate can be used with infinite grammars as long as depth is provided:

grammar = CFG.fromstring("""
S -> A | A S
A -> "a"
""")
print(list(generate(grammar, depth=10)))

produces

[['a'], ['a', 'a'], ['a', 'a', 'a'], ['a', 'a', 'a', 'a'], ['a', 'a', 'a', 'a', 'a'], ['a', 'a', 'a', 'a', 'a', 'a'], ['a', 'a', 'a', 'a', 'a', 'a', 'a'], ['a', 'a', 'a', 'a', 'a', 'a', 'a', 'a']]

However, if it's not provided, the output is this:

Traceback (most recent call last):
...
  File "[sic]\nltk\parse\generate.py", line 42, in _generate_all
    for frag1 in _generate_one(grammar, items[0], depth):
  File "[sic]\nltk\parse\generate.py", line 61, in _generate_one
    yield from _generate_all(grammar, prod.rhs(), depth - 1)
  File "[sic]\nltk\parse\generate.py", line 43, in _generate_all
    for frag2 in _generate_all(grammar, items[1:], depth):
  File "[sic]\nltk\parse\generate.py", line 46, in _generate_all
    if _error.message == "maximum recursion depth exceeded":
AttributeError: 'RecursionError' object has no attribute 'message'

This is definitely not the traceback we expect! The issue is here:

except RuntimeError as _error:
if _error.message == "maximum recursion depth exceeded":
# Helpful error message while still showing the recursion stack.
raise RuntimeError(
"The grammar has rule(s) that yield infinite recursion!!"
) from _error
else:
raise

I've resolved this by catching RecursionError instead, instead of relying on a message of the error object. Now the output is:

Traceback (most recent call last):
...
  File "[sic]\nltk\parse\generate.py", line 42, in _generate_all
    for frag1 in _generate_one(grammar, items[0], depth):
  File "[sic]\nltk\parse\generate.py", line 61, in _generate_one
    yield from _generate_all(grammar, prod.rhs(), depth - 1)
  File "[sic]\nltk\parse\generate.py", line 43, in _generate_all
    for frag2 in _generate_all(grammar, items[1:], depth):
RuntimeError: The grammar has rule(s) that yield infinite recursion!

That's more like what we would expect to see. I've made a doctest for this.

Moved unnecessary PCFG's into the relevant methods

In nltk.grammar, these two pcfg's were being made:

nltk/nltk/grammar.py

Lines 1534 to 1573 in 68e4e58

toy_pcfg1 = PCFG.fromstring(
"""
S -> NP VP [1.0]
NP -> Det N [0.5] | NP PP [0.25] | 'John' [0.1] | 'I' [0.15]
Det -> 'the' [0.8] | 'my' [0.2]
N -> 'man' [0.5] | 'telescope' [0.5]
VP -> VP PP [0.1] | V NP [0.7] | V [0.2]
V -> 'ate' [0.35] | 'saw' [0.65]
PP -> P NP [1.0]
P -> 'with' [0.61] | 'under' [0.39]
"""
)
toy_pcfg2 = PCFG.fromstring(
"""
S -> NP VP [1.0]
VP -> V NP [.59]
VP -> V [.40]
VP -> VP PP [.01]
NP -> Det N [.41]
NP -> Name [.28]
NP -> NP PP [.31]
PP -> P NP [1.0]
V -> 'saw' [.21]
V -> 'ate' [.51]
V -> 'ran' [.28]
N -> 'boy' [.11]
N -> 'cookie' [.12]
N -> 'table' [.13]
N -> 'telescope' [.14]
N -> 'hill' [.5]
Name -> 'Jack' [.52]
Name -> 'Bob' [.48]
P -> 'with' [.61]
P -> 'under' [.39]
Det -> 'the' [.41]
Det -> 'a' [.31]
Det -> 'my' [.28]
"""
)

These are used elsewhere in demo functions, and in doctests. However, they are being compiled whenever a user performs import nltk, while a user will never use these variables. As a result, I've copied them into the places that require them. This way, importing nltk will be very slightly faster.


Feel free to look at the .doctest files to see some more examples of these changes.

  • Tom Aarsen

@stevenbird stevenbird merged commit 7fb092a into nltk:develop Nov 18, 2021
@stevenbird
Copy link
Member

@tomaarsen - this is great!

@tomaarsen tomaarsen mentioned this pull request Nov 20, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

CFG.fromstring cannot handle epsilons
2 participants