Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Grammar fixes #2884

Open
wants to merge 14 commits into
base: develop
Choose a base branch
from
Open

Grammar fixes #2884

wants to merge 14 commits into from

Conversation

stefkauf
Copy link

No description provided.

Conversion to Chomsky Normal Form (CNF), implemented as the
'.chomsky_normal_form()' method in grammar.CFG instances, was
incomplete. In particular, it could not deal with empty productions
or mixed productions (with terminals and non-terminals co-occuring
on the right-hand side). The new implementation removes those
shortcomings. Like in the earlier version, the individual steps of
conversion are carried out by calling class methods.
@tomaarsen
Copy link
Member

Hey @stefkauf! You might've noticed that the tests are failing for this. The reason for this is explained in this comment: #2822 (comment). I hope this helps.

@stefkauf
Copy link
Author

Thanks, @tomaarsen, the problem was a bit mysterious to me. I've fixed the files with pre-commit and updated the pull request.

@tomaarsen
Copy link
Member

I understand! I'm still unsure how to help new contributors with it in an intuitive way.

@tomaarsen
Copy link
Member

tomaarsen commented Nov 17, 2021

Upon reading your commit comment:

Conversion to Chomsky Normal Form (CNF), implemented as the
'.chomsky_normal_form()' method in grammar.CFG instances, was
incomplete. In particular, it could not deal with empty productions
or mixed productions (with terminals and non-terminals co-occuring
on the right-hand side). The new implementation removes those
shortcomings. Like in the earlier version, the individual steps of
conversion are carried out by calling class methods.

I presume these are related to the following Errors:

nltk/nltk/grammar.py

Lines 746 to 749 in 68e4e58

if self.productions(empty=True):
raise ValueError(
"Grammar has Empty rules. " "Cannot deal with them at the moment"
)

nltk/nltk/grammar.py

Lines 751 to 756 in 68e4e58

# check for mixed rules
for rule in self.productions():
if rule.is_lexical() and len(rule.rhs()) > 1:
raise ValueError(
f"Cannot handled mixed rule {rule.lhs()} => {rule.rhs()}"
)

Would it be possible to provide some tests that cover these new cases?
For example in:

That way, even people like myself who know about Chomsky Normal Form, but don't know the algorithm, can get some confidence in the correctness of the implementation.

Context-free grammars can have an empty set of productions.
Conversion to Chomsky Normal Form leads to an empty list of
productions if the language of the CFG is empty (i.e., does not
even contain the empty string - i.e., [S -> S]). The
grammar.CFG class did not allow for this case because some of the
methods to determine properties of the grammar (e.g., self.is_emtpy,
self.is_binary) referred to the minimum or maximum length of the
productions. If the list of productions is emtpy, Python's
'min' and 'max' operators throw an exception. This is fixed now:
the methods determining properties of the grammar are rewritten
using 'all', which succeeds on an empty list. The attributes
self._min_len and self._max_len were left in place and are set to
'None' if there are no productions.
@stefkauf
Copy link
Author

I've been working on some doctest examples. But it is tricky because the requirements for matching outputs are so stringent. For instance, the following grammar for the language a^nb^n:

    S -> 'a' S 'b'
    S -> 

was on one run converted to this one, which I put in the doctest file:

    S0 -> 
    S0 -> T0 B1
    S0 -> T0 T1
    B0 -> S T1
    B1 -> S T1
    S -> T0 B0
    S -> T0 T1
    T0 -> 'a'
    T1 -> 'b'

But sometimes I get a different (but equivalent) result, for instance this one, which differs only in the order in which the new nonterminals were created:

    S0 -> 
    S0 -> T1 B0
    S0 -> T1 T0
    B0 -> S T0
    B1 -> S T0
    S -> T1 B1
    S -> T1 T0
    T0 -> 'b'
    T1 -> 'a'

There is some arbitrariness in the order in which new nonterminals are created because I use sets. I'm a bit reluctant to change this just to make doctest stop complaining. Would it be acceptable to use some other way to illustrate how it works? For instance, I could put a 'demo' function at the end of grammar.py, for users to invoke.

The earlier version created new non-terminals in an unpredictable
order because iteration over sets was used. This led to spurious
failures of doctest. The new version uses order-preserving
containers (dicts or, where possible, lists) to eliminate the
arbitrariness.
@iliakur
Copy link
Contributor

iliakur commented Nov 19, 2021

You may also want to consider unit tests for this. Stick with the ones that give you the biggest bang for the buck: simple to write and can cover more ground. I would argue pytest may be a better choice here than doctest.

@stefkauf
Copy link
Author

stefkauf commented Nov 19, 2021

@iliakur Actually, in my latest update I've solved the problem by using dictionaries instead of sets where the order mattered. Since dictionaries preserve order of insertion, that eliminated the arbitrary variation.

@tomaarsen
Copy link
Member

Well done on adding those doctests! I've handled a merge conflict between this PR and changes introduced from #2888. I'll have a look if I can inspect the implementation itself more thoroughly soon.

Copy link
Member

@tomaarsen tomaarsen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As mentioned, I'll make time to look at the rest of the PR too, but this was something I noticed quickly.

nltk/grammar.py Outdated Show resolved Hide resolved
nltk/grammar.py Outdated Show resolved Hide resolved
stefkauf and others added 2 commits November 22, 2021 15:50
Co-authored-by: Tom Aarsen <37621491+tomaarsen@users.noreply.github.com>
Co-authored-by: Tom Aarsen <37621491+tomaarsen@users.noreply.github.com>
@tomaarsen
Copy link
Member

Apologies, I have not yet had time. I've been quite busy.

nltk/grammar.py Outdated
@@ -734,114 +735,456 @@ def is_chomsky_normal_form(self):
"""
return self.is_flexible_chomsky_normal_form() and self._all_unary_are_lexical

def chomsky_normal_form(self, new_token_padding="@$@", flexible=False):
##################################################
# Stefan Kaufmann's proposed changes
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@stefkauf can we drop such comments please, and just identify you as a co-author at the top of the module?

nltk/grammar.py Outdated
return cls( grammar.start(), list(new_productions))


# End of Stefan Kaufmann's proposed changes
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

... and this

nltk/grammar.py Outdated
##################################################

##################################################
# Code to be replaced by Stefan Kaufmann's version
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

... and all this

@stevenbird
Copy link
Member

@stefkauf thanks for this great contribution... I've reviewed the code, and just think some of your comments can go, as noted, then we can merge this.

@stefkauf
Copy link
Author

stefkauf commented Jul 6, 2022

@stevenbird Thanks, Steven, for your messages. I've removed those comments and committed the new version of grammar.py.

@rmalouf
Copy link
Contributor

rmalouf commented Mar 6, 2024

What is the status of this @stevenbird?

@stefkauf
Copy link
Author

stefkauf commented Apr 10, 2024 via email

@stevenbird
Copy link
Member

stevenbird commented Apr 11, 2024

Hi Rob and Stefan I'd welcome an updated PR and will try to turn it around quickly

@github-actions github-actions bot removed the parsing label Apr 12, 2024
@stefkauf
Copy link
Author

I've resolved a merge conflict, it should be good now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants