Allow empty string in CFG's + more #2888

tomaarsen · 2021-11-17T10:33:42Z

Resolves #1890

Hello!

Pull request overview

Allow for empty strings as terminals in CFG's + doctests.
Fixed issue with infinite recursion in nltk.parse.generate + doctest.
Moved unnecessary PCFG's into the relevant methods, so they don't get created whenever someone does import nltk.

Empty strings as terminals in CFG's

This corresponds with #1890. Here, it shows that the following grammar throws an exception:

from nltk.grammar import CFG
grammar = CFG.fromstring("""
S -> A B
A -> 'a'
B -> 'b' | ''
""")

fails with

ValueError: Unable to parse line 5: B -> 'b' | ''
Unterminated string

However, this is definitely not a case of an Unterminated string. The crux here is this regex:

nltk/nltk/grammar.py

Line 1320 in 68e4e58

_TERMINAL_RE = re.compile(r'( "[^"]+" | \'[^\']+\' ) \s*', re.VERBOSE)

Simply said, it requires either: "[^"]+" or '[^']+' , which requires 1 or more tokens to be inside of quotes. However, I don't believe there is a reason this should be "1 or more". Replacing the + with * will solve the above problem.

Consequences

As a result, people can now do, e.g.:

from nltk.grammar import CFG
from nltk.parse.generate import generate
grammar = CFG.fromstring("""
S -> A B
A -> 'a'
# An empty string:
B -> 'b' | ''
""")
print(list(generate(grammar)))

now produces

[['a', 'b'], ['a', '']]

(This previously crashed)

Do note that similar functionality was already possible, but with empty productions, not empty strings as productions:

grammar = CFG.fromstring("""
S -> A B
A -> 'a'
# An empty production:
B -> 'b' |
""")
print(list(generate(grammar)))

has always produced:

[['a', 'b'], ['a']]

I've created doctests for these consequences.

Fixed issue with RecursionError in `generate`

nltk.parse.generate can be used with infinite grammars as long as depth is provided:

grammar = CFG.fromstring("""
S -> A | A S
A -> "a"
""")
print(list(generate(grammar, depth=10)))

produces

[['a'], ['a', 'a'], ['a', 'a', 'a'], ['a', 'a', 'a', 'a'], ['a', 'a', 'a', 'a', 'a'], ['a', 'a', 'a', 'a', 'a', 'a'], ['a', 'a', 'a', 'a', 'a', 'a', 'a'], ['a', 'a', 'a', 'a', 'a', 'a', 'a', 'a']]

However, if it's not provided, the output is this:

Traceback (most recent call last):
...
  File "[sic]\nltk\parse\generate.py", line 42, in _generate_all
    for frag1 in _generate_one(grammar, items[0], depth):
  File "[sic]\nltk\parse\generate.py", line 61, in _generate_one
    yield from _generate_all(grammar, prod.rhs(), depth - 1)
  File "[sic]\nltk\parse\generate.py", line 43, in _generate_all
    for frag2 in _generate_all(grammar, items[1:], depth):
  File "[sic]\nltk\parse\generate.py", line 46, in _generate_all
    if _error.message == "maximum recursion depth exceeded":
AttributeError: 'RecursionError' object has no attribute 'message'

This is definitely not the traceback we expect! The issue is here:

nltk/nltk/parse/generate.py

Lines 45 to 52 in 68e4e58

    
           except RuntimeError as _error: 
        
               if _error.message == "maximum recursion depth exceeded": 
        
                   # Helpful error message while still showing the recursion stack. 
        
                   raise RuntimeError( 
        
                       "The grammar has rule(s) that yield infinite recursion!!" 
        
                   ) from _error 
        
               else: 
        
                   raise

I've resolved this by catching RecursionError instead, instead of relying on a message of the error object. Now the output is:

Traceback (most recent call last):
...
  File "[sic]\nltk\parse\generate.py", line 42, in _generate_all
    for frag1 in _generate_one(grammar, items[0], depth):
  File "[sic]\nltk\parse\generate.py", line 61, in _generate_one
    yield from _generate_all(grammar, prod.rhs(), depth - 1)
  File "[sic]\nltk\parse\generate.py", line 43, in _generate_all
    for frag2 in _generate_all(grammar, items[1:], depth):
RuntimeError: The grammar has rule(s) that yield infinite recursion!

That's more like what we would expect to see. I've made a doctest for this.

Moved unnecessary PCFG's into the relevant methods

In nltk.grammar, these two pcfg's were being made:

nltk/nltk/grammar.py

Lines 1534 to 1573 in 68e4e58

    
           toy_pcfg1 = PCFG.fromstring( 
        
               """ 
        
               S -> NP VP [1.0] 
        
               NP -> Det N [0.5] | NP PP [0.25] | 'John' [0.1] | 'I' [0.15] 
        
               Det -> 'the' [0.8] | 'my' [0.2] 
        
               N -> 'man' [0.5] | 'telescope' [0.5] 
        
               VP -> VP PP [0.1] | V NP [0.7] | V [0.2] 
        
               V -> 'ate' [0.35] | 'saw' [0.65] 
        
               PP -> P NP [1.0] 
        
               P -> 'with' [0.61] | 'under' [0.39] 
        
               """ 
        
           ) 
        
           toy_pcfg2 = PCFG.fromstring( 
        
               """ 
        
               S    -> NP VP         [1.0] 
        
               VP   -> V NP          [.59] 
        
               VP   -> V             [.40] 
        
               VP   -> VP PP         [.01] 
        
               NP   -> Det N         [.41] 
        
               NP   -> Name          [.28] 
        
               NP   -> NP PP         [.31] 
        
               PP   -> P NP          [1.0] 
        
               V    -> 'saw'         [.21] 
        
               V    -> 'ate'         [.51] 
        
               V    -> 'ran'         [.28] 
        
               N    -> 'boy'         [.11] 
        
               N    -> 'cookie'      [.12] 
        
               N    -> 'table'       [.13] 
        
               N    -> 'telescope'   [.14] 
        
               N    -> 'hill'        [.5] 
        
               Name -> 'Jack'        [.52] 
        
               Name -> 'Bob'         [.48] 
        
               P    -> 'with'        [.61] 
        
               P    -> 'under'       [.39] 
        
               Det  -> 'the'         [.41] 
        
               Det  -> 'a'           [.31] 
        
               Det  -> 'my'          [.28] 
        
               """ 
        
           )

These are used elsewhere in demo functions, and in doctests. However, they are being compiled whenever a user performs import nltk, while a user will never use these variables. As a result, I've copied them into the places that require them. This way, importing nltk will be very slightly faster.

Feel free to look at the .doctest files to see some more examples of these changes.

Tom Aarsen

stevenbird · 2021-11-18T12:47:36Z

@tomaarsen - this is great!

tomaarsen added 7 commits November 17, 2021 10:35

Allow for empty string terminals

fb54a79

Move toy pcfg's into pcfg_demo

cba9e75

Deal with removing toy_pcfg1 and 2 from nltk/grammar.py

6b754b4

Added doctests for empty strings/productions and their consequences

65a28ab

Reworked RecursionError

2baf8b1

Add doctest for generate RecursionError

e8ee10f

Fixed issue in incorrect explanation of doctest

34c1a96

tomaarsen added enhancement parsing labels Nov 17, 2021

Add as contributor

2bece9a

stevenbird merged commit 7fb092a into nltk:develop Nov 18, 2021

tomaarsen mentioned this pull request Nov 20, 2021

Grammar fixes #2884

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Allow empty string in CFG's + more #2888

Allow empty string in CFG's + more #2888

tomaarsen commented Nov 17, 2021

stevenbird commented Nov 18, 2021

	except RuntimeError as _error:
	if _error.message == "maximum recursion depth exceeded":
	# Helpful error message while still showing the recursion stack.
	raise RuntimeError(
	"The grammar has rule(s) that yield infinite recursion!!"
	) from _error
	else:
	raise

	toy_pcfg1 = PCFG.fromstring(
	"""
	S -> NP VP [1.0]
	NP -> Det N [0.5] \| NP PP [0.25] \| 'John' [0.1] \| 'I' [0.15]
	Det -> 'the' [0.8] \| 'my' [0.2]
	N -> 'man' [0.5] \| 'telescope' [0.5]
	VP -> VP PP [0.1] \| V NP [0.7] \| V [0.2]
	V -> 'ate' [0.35] \| 'saw' [0.65]
	PP -> P NP [1.0]
	P -> 'with' [0.61] \| 'under' [0.39]
	"""
	)

	toy_pcfg2 = PCFG.fromstring(
	"""
	S -> NP VP [1.0]
	VP -> V NP [.59]
	VP -> V [.40]
	VP -> VP PP [.01]
	NP -> Det N [.41]
	NP -> Name [.28]
	NP -> NP PP [.31]
	PP -> P NP [1.0]
	V -> 'saw' [.21]
	V -> 'ate' [.51]
	V -> 'ran' [.28]
	N -> 'boy' [.11]
	N -> 'cookie' [.12]
	N -> 'table' [.13]
	N -> 'telescope' [.14]
	N -> 'hill' [.5]
	Name -> 'Jack' [.52]
	Name -> 'Bob' [.48]
	P -> 'with' [.61]
	P -> 'under' [.39]
	Det -> 'the' [.41]
	Det -> 'a' [.31]
	Det -> 'my' [.28]
	"""
	)

Allow empty string in CFG's + more #2888

Allow empty string in CFG's + more #2888

Conversation

tomaarsen commented Nov 17, 2021

Pull request overview

Empty strings as terminals in CFG's

Consequences

Fixed issue with RecursionError in generate

Moved unnecessary PCFG's into the relevant methods

stevenbird commented Nov 18, 2021

Fixed issue with RecursionError in `generate`