Grammar fixes #2884

stefkauf · 2021-11-14T05:49:46Z

No description provided.

Conversion to Chomsky Normal Form (CNF), implemented as the '.chomsky_normal_form()' method in grammar.CFG instances, was incomplete. In particular, it could not deal with empty productions or mixed productions (with terminals and non-terminals co-occuring on the right-hand side). The new implementation removes those shortcomings. Like in the earlier version, the individual steps of conversion are carried out by calling class methods.

tomaarsen · 2021-11-16T17:00:59Z

Hey @stefkauf! You might've noticed that the tests are failing for this. The reason for this is explained in this comment: #2822 (comment). I hope this helps.

stefkauf · 2021-11-16T18:05:04Z

Thanks, @tomaarsen, the problem was a bit mysterious to me. I've fixed the files with pre-commit and updated the pull request.

tomaarsen · 2021-11-16T18:15:16Z

I understand! I'm still unsure how to help new contributors with it in an intuitive way.

tomaarsen · 2021-11-17T08:12:18Z

Upon reading your commit comment:

Conversion to Chomsky Normal Form (CNF), implemented as the
'.chomsky_normal_form()' method in grammar.CFG instances, was
incomplete. In particular, it could not deal with empty productions
or mixed productions (with terminals and non-terminals co-occuring
on the right-hand side). The new implementation removes those
shortcomings. Like in the earlier version, the individual steps of
conversion are carried out by calling class methods.

I presume these are related to the following Errors:

nltk/nltk/grammar.py

Lines 746 to 749 in 68e4e58

    
           if self.productions(empty=True): 
        
               raise ValueError( 
        
                   "Grammar has Empty rules. " "Cannot deal with them at the moment" 
        
               )

nltk/nltk/grammar.py

Lines 751 to 756 in 68e4e58

    
           # check for mixed rules 
        
           for rule in self.productions(): 
        
               if rule.is_lexical() and len(rule.rhs()) > 1: 
        
                   raise ValueError( 
        
                       f"Cannot handled mixed rule {rule.lhs()} => {rule.rhs()}" 
        
                   )

Would it be possible to provide some tests that cover these new cases?
For example in:

nltk/test/grammar.doctest (which can be tested with pytest nltk/test/grammar.doctest)
nltk/test/grammartestsuites.doctest
nltk/test/parse.doctest
or if you prefer pytest, then in a new file under nltk/test/unit/test_grammar.py.

That way, even people like myself who know about Chomsky Normal Form, but don't know the algorithm, can get some confidence in the correctness of the implementation.

Context-free grammars can have an empty set of productions. Conversion to Chomsky Normal Form leads to an empty list of productions if the language of the CFG is empty (i.e., does not even contain the empty string - i.e., [S -> S]). The grammar.CFG class did not allow for this case because some of the methods to determine properties of the grammar (e.g., self.is_emtpy, self.is_binary) referred to the minimum or maximum length of the productions. If the list of productions is emtpy, Python's 'min' and 'max' operators throw an exception. This is fixed now: the methods determining properties of the grammar are rewritten using 'all', which succeeds on an empty list. The attributes self._min_len and self._max_len were left in place and are set to 'None' if there are no productions.

stefkauf · 2021-11-19T05:30:38Z

I've been working on some doctest examples. But it is tricky because the requirements for matching outputs are so stringent. For instance, the following grammar for the language a^nb^n:

    S -> 'a' S 'b'
    S ->

was on one run converted to this one, which I put in the doctest file:

    S0 -> 
    S0 -> T0 B1
    S0 -> T0 T1
    B0 -> S T1
    B1 -> S T1
    S -> T0 B0
    S -> T0 T1
    T0 -> 'a'
    T1 -> 'b'

But sometimes I get a different (but equivalent) result, for instance this one, which differs only in the order in which the new nonterminals were created:

    S0 -> 
    S0 -> T1 B0
    S0 -> T1 T0
    B0 -> S T0
    B1 -> S T0
    S -> T1 B1
    S -> T1 T0
    T0 -> 'b'
    T1 -> 'a'

There is some arbitrariness in the order in which new nonterminals are created because I use sets. I'm a bit reluctant to change this just to make doctest stop complaining. Would it be acceptable to use some other way to illustrate how it works? For instance, I could put a 'demo' function at the end of grammar.py, for users to invoke.

The earlier version created new non-terminals in an unpredictable order because iteration over sets was used. This led to spurious failures of doctest. The new version uses order-preserving containers (dicts or, where possible, lists) to eliminate the arbitrariness.

iliakur · 2021-11-19T19:11:23Z

You may also want to consider unit tests for this. Stick with the ones that give you the biggest bang for the buck: simple to write and can cover more ground. I would argue pytest may be a better choice here than doctest.

stefkauf · 2021-11-19T20:42:37Z

@iliakur Actually, in my latest update I've solved the problem by using dictionaries instead of sets where the order mattered. Since dictionaries preserve order of insertion, that eliminated the arbitrary variation.

tomaarsen · 2021-11-20T00:25:53Z

Well done on adding those doctests! I've handled a merge conflict between this PR and changes introduced from #2888. I'll have a look if I can inspect the implementation itself more thoroughly soon.

tomaarsen

As mentioned, I'll make time to look at the rest of the PR too, but this was something I noticed quickly.

nltk/grammar.py

Co-authored-by: Tom Aarsen <37621491+tomaarsen@users.noreply.github.com>

tomaarsen · 2021-12-01T15:25:52Z

Apologies, I have not yet had time. I've been quite busy.

stevenbird · 2022-07-05T12:45:17Z

nltk/grammar.py

@@ -734,114 +735,456 @@ def is_chomsky_normal_form(self):
        """
        return self.is_flexible_chomsky_normal_form() and self._all_unary_are_lexical

-    def chomsky_normal_form(self, new_token_padding="@$@", flexible=False):
+    ##################################################
+    # Stefan Kaufmann's proposed changes


@stefkauf can we drop such comments please, and just identify you as a co-author at the top of the module?

stevenbird · 2022-07-05T12:45:37Z

nltk/grammar.py

+        return cls( grammar.start(), list(new_productions))
+
+
+    # End of Stefan Kaufmann's proposed changes


... and this

stevenbird · 2022-07-05T12:45:46Z

nltk/grammar.py

+    ##################################################
+
+    ##################################################
+    # Code to be replaced by Stefan Kaufmann's version


... and all this

stevenbird · 2022-07-05T12:50:48Z

@stefkauf thanks for this great contribution... I've reviewed the code, and just think some of your comments can go, as noted, then we can merge this.

… grammar-fixes Remove extraneous comments

stefkauf · 2022-07-06T11:26:32Z

@stevenbird Thanks, Steven, for your messages. I've removed those comments and committed the new version of grammar.py.

rmalouf · 2024-03-06T00:12:38Z

What is the status of this @stevenbird?

stefkauf · 2024-04-10T15:50:50Z

Hi Rob, you didn't get a reply about this, did you? I can't tell you what became of it. I have the impression that NLTK is not managed all that actively anymore, but I might be wrong. In any case, I'm not sure. Hope things are well, Stefan

…

--- Department of Linguistics, University of Connecticut http://stefan-kaufmann.uconn.edu/

________________________________ From: Rob Malouf ***@***.***> Sent: Tuesday, March 5, 2024 7:12 PM To: nltk/nltk ***@***.***> Cc: Kaufmann, Stefan ***@***.***>; Mention ***@***.***> Subject: Re: [nltk/nltk] Grammar fixes (PR #2884) *Message sent from a system outside of UConn.* What is the status of this? — Reply to this email directly, view it on GitHub<#2884 (comment)>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/AWNJMKBO3GBHCAMYLHMVEX3YWZNQHAVCNFSM5H7O7B5KU5DIOJSWCZC7NNSXTN2JONZXKZKDN5WW2ZLOOQ5TCOJXHE4DIOBWGE3Q>. You are receiving this because you were mentioned.Message ID: ***@***.***>

stevenbird · 2024-04-11T00:30:01Z

Hi Rob and Stefan I'd welcome an updated PR and will try to turn it around quickly

stefkauf · 2024-04-12T13:48:45Z

I've resolved a merge conflict, it should be good now.

stefkauf added 2 commits November 14, 2021 00:05

Added Stefan Kaufmann to AUTHORS.md

6529dc6

Formatting fixes by pre-commit.

e8164f1

tomaarsen added enhancement parsing labels Nov 17, 2021

stefkauf added 2 commits November 17, 2021 23:12

Miscellaneous changes to improve readability and documentation.

6b3ec5b

stefkauf added 2 commits November 19, 2021 10:06

Minor formatting changes

bf61269

Merge branch 'develop' into grammar-fixes

906ef2c

tomaarsen reviewed Nov 20, 2021

View reviewed changes

nltk/grammar.py Outdated Show resolved Hide resolved

nltk/grammar.py Outdated Show resolved Hide resolved

stefkauf and others added 2 commits November 22, 2021 15:50

Update nltk/grammar.py

a7dd2b4

Co-authored-by: Tom Aarsen <37621491+tomaarsen@users.noreply.github.com>

Update nltk/grammar.py

e42c0ef

Co-authored-by: Tom Aarsen <37621491+tomaarsen@users.noreply.github.com>

stevenbird reviewed Jul 5, 2022

View reviewed changes

stefkauf added 2 commits July 6, 2022 13:04

Remove extraneous comments

3169975

Merge branch 'grammar-fixes' of https://github.com/stefkauf/nltk into…

9cb2e27

… grammar-fixes Remove extraneous comments

Empty Commit; Force the CI to refresh

cfe33b8

Merge branch 'develop' into grammar-fixes

a38da33

github-actions bot removed the parsing label Apr 12, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Grammar fixes #2884

Grammar fixes #2884

stefkauf commented Nov 14, 2021

tomaarsen commented Nov 16, 2021

stefkauf commented Nov 16, 2021

tomaarsen commented Nov 16, 2021

tomaarsen commented Nov 17, 2021 •

edited

stefkauf commented Nov 19, 2021

iliakur commented Nov 19, 2021

stefkauf commented Nov 19, 2021 •

edited

tomaarsen commented Nov 20, 2021

tomaarsen left a comment

tomaarsen commented Dec 1, 2021

stevenbird Jul 5, 2022

stevenbird Jul 5, 2022

stevenbird Jul 5, 2022

stevenbird commented Jul 5, 2022

stefkauf commented Jul 6, 2022

rmalouf commented Mar 6, 2024 •

edited

stefkauf commented Apr 10, 2024 via email

stevenbird commented Apr 11, 2024 •

edited

stefkauf commented Apr 12, 2024

		return cls( grammar.start(), list(new_productions))


		# End of Stefan Kaufmann's proposed changes

Grammar fixes #2884

Are you sure you want to change the base?

Grammar fixes #2884

Conversation

stefkauf commented Nov 14, 2021

tomaarsen commented Nov 16, 2021

stefkauf commented Nov 16, 2021

tomaarsen commented Nov 16, 2021

tomaarsen commented Nov 17, 2021 • edited

stefkauf commented Nov 19, 2021

iliakur commented Nov 19, 2021

stefkauf commented Nov 19, 2021 • edited

tomaarsen commented Nov 20, 2021

tomaarsen left a comment

Choose a reason for hiding this comment

tomaarsen commented Dec 1, 2021

stevenbird Jul 5, 2022

Choose a reason for hiding this comment

stevenbird Jul 5, 2022

Choose a reason for hiding this comment

stevenbird Jul 5, 2022

Choose a reason for hiding this comment

stevenbird commented Jul 5, 2022

stefkauf commented Jul 6, 2022

rmalouf commented Mar 6, 2024 • edited

stefkauf commented Apr 10, 2024 via email

stevenbird commented Apr 11, 2024 • edited

stefkauf commented Apr 12, 2024

tomaarsen commented Nov 17, 2021 •

edited

stefkauf commented Nov 19, 2021 •

edited

rmalouf commented Mar 6, 2024 •

edited

stevenbird commented Apr 11, 2024 •

edited