New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Grammar fixes #2884
base: develop
Are you sure you want to change the base?
Grammar fixes #2884
Conversation
Conversion to Chomsky Normal Form (CNF), implemented as the '.chomsky_normal_form()' method in grammar.CFG instances, was incomplete. In particular, it could not deal with empty productions or mixed productions (with terminals and non-terminals co-occuring on the right-hand side). The new implementation removes those shortcomings. Like in the earlier version, the individual steps of conversion are carried out by calling class methods.
Hey @stefkauf! You might've noticed that the tests are failing for this. The reason for this is explained in this comment: #2822 (comment). I hope this helps. |
Thanks, @tomaarsen, the problem was a bit mysterious to me. I've fixed the files with pre-commit and updated the pull request. |
I understand! I'm still unsure how to help new contributors with it in an intuitive way. |
Upon reading your commit comment:
I presume these are related to the following Errors: Lines 746 to 749 in 68e4e58
Lines 751 to 756 in 68e4e58
Would it be possible to provide some tests that cover these new cases?
That way, even people like myself who know about Chomsky Normal Form, but don't know the algorithm, can get some confidence in the correctness of the implementation. |
Context-free grammars can have an empty set of productions. Conversion to Chomsky Normal Form leads to an empty list of productions if the language of the CFG is empty (i.e., does not even contain the empty string - i.e., [S -> S]). The grammar.CFG class did not allow for this case because some of the methods to determine properties of the grammar (e.g., self.is_emtpy, self.is_binary) referred to the minimum or maximum length of the productions. If the list of productions is emtpy, Python's 'min' and 'max' operators throw an exception. This is fixed now: the methods determining properties of the grammar are rewritten using 'all', which succeeds on an empty list. The attributes self._min_len and self._max_len were left in place and are set to 'None' if there are no productions.
I've been working on some doctest examples. But it is tricky because the requirements for matching outputs are so stringent. For instance, the following grammar for the language a^nb^n:
was on one run converted to this one, which I put in the doctest file:
But sometimes I get a different (but equivalent) result, for instance this one, which differs only in the order in which the new nonterminals were created:
There is some arbitrariness in the order in which new nonterminals are created because I use sets. I'm a bit reluctant to change this just to make doctest stop complaining. Would it be acceptable to use some other way to illustrate how it works? For instance, I could put a 'demo' function at the end of grammar.py, for users to invoke. |
The earlier version created new non-terminals in an unpredictable order because iteration over sets was used. This led to spurious failures of doctest. The new version uses order-preserving containers (dicts or, where possible, lists) to eliminate the arbitrariness.
You may also want to consider unit tests for this. Stick with the ones that give you the biggest bang for the buck: simple to write and can cover more ground. I would argue |
@iliakur Actually, in my latest update I've solved the problem by using dictionaries instead of sets where the order mattered. Since dictionaries preserve order of insertion, that eliminated the arbitrary variation. |
Well done on adding those doctests! I've handled a merge conflict between this PR and changes introduced from #2888. I'll have a look if I can inspect the implementation itself more thoroughly soon. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As mentioned, I'll make time to look at the rest of the PR too, but this was something I noticed quickly.
Co-authored-by: Tom Aarsen <37621491+tomaarsen@users.noreply.github.com>
Co-authored-by: Tom Aarsen <37621491+tomaarsen@users.noreply.github.com>
Apologies, I have not yet had time. I've been quite busy. |
nltk/grammar.py
Outdated
@@ -734,114 +735,456 @@ def is_chomsky_normal_form(self): | |||
""" | |||
return self.is_flexible_chomsky_normal_form() and self._all_unary_are_lexical | |||
|
|||
def chomsky_normal_form(self, new_token_padding="@$@", flexible=False): | |||
################################################## | |||
# Stefan Kaufmann's proposed changes |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@stefkauf can we drop such comments please, and just identify you as a co-author at the top of the module?
nltk/grammar.py
Outdated
return cls( grammar.start(), list(new_productions)) | ||
|
||
|
||
# End of Stefan Kaufmann's proposed changes |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
... and this
nltk/grammar.py
Outdated
################################################## | ||
|
||
################################################## | ||
# Code to be replaced by Stefan Kaufmann's version |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
... and all this
@stefkauf thanks for this great contribution... I've reviewed the code, and just think some of your comments can go, as noted, then we can merge this. |
… grammar-fixes Remove extraneous comments
@stevenbird Thanks, Steven, for your messages. I've removed those comments and committed the new version of grammar.py. |
What is the status of this @stevenbird? |
Hi Rob,
you didn't get a reply about this, did you? I can't tell you what became of it. I have the impression that NLTK is not managed all that actively anymore, but I might be wrong. In any case, I'm not sure.
Hope things are well,
Stefan
…---
Department of Linguistics, University of Connecticut
http://stefan-kaufmann.uconn.edu/
________________________________
From: Rob Malouf ***@***.***>
Sent: Tuesday, March 5, 2024 7:12 PM
To: nltk/nltk ***@***.***>
Cc: Kaufmann, Stefan ***@***.***>; Mention ***@***.***>
Subject: Re: [nltk/nltk] Grammar fixes (PR #2884)
*Message sent from a system outside of UConn.*
What is the status of this?
—
Reply to this email directly, view it on GitHub<#2884 (comment)>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/AWNJMKBO3GBHCAMYLHMVEX3YWZNQHAVCNFSM5H7O7B5KU5DIOJSWCZC7NNSXTN2JONZXKZKDN5WW2ZLOOQ5TCOJXHE4DIOBWGE3Q>.
You are receiving this because you were mentioned.Message ID: ***@***.***>
|
Hi Rob and Stefan I'd welcome an updated PR and will try to turn it around quickly |
I've resolved a merge conflict, it should be good now. |
No description provided.