Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix decode error for bllip parser #2897

Merged
merged 4 commits into from Dec 16, 2021
Merged

Conversation

dannysepler
Copy link
Contributor

Closes #1920

This was fairly trivial to fix, but this seems like it was broken since py3 was rolled out. Do we know why more people didn't complain about this? Is it possible that this parser isn't being used at the moment?

I also added a very simple unit test, to capture simple breakages in the BllipParser for next time

@dannysepler
Copy link
Contributor Author

It looks like the bllip-parser has been broken on Windows for quite some time now BLLIP/bllip-parser#48

I'm not sure what to do here. We could just not have these new tests run on CI, or perhaps we should only install the pip package on non-Windows environments?

@dannysepler
Copy link
Contributor Author

okay looking for a bit of advice here!

nltk's BllipParser has likely been broken for years. as a result, we can reasonably assume it is not being used

The parser also isn't unit tested. adding tests to the CI build is tricky, since the bllipparser pip package itself has been broken on windows since 2016. i see a few paths forward, and am unsure which to take:

  • fix the BllipParser, don't add unit tests to the CI
  • fix the BllipParser, add unit tests to the CI, but have our CI build only download the bllipparser pip package on mac and linux
    • this will add some complexity to the CI build for a modest gain in test coverage
  • deprecate / eventually delete the BllipParser since we don't think it's being used

@tomaarsen
Copy link
Member

This is a bit of a tricky one. I'm not familiar with BllipParser, which means I don't know whether it has functionality that isn't already possible through some other parser. That would affect whether we should care to fix the BllipParser.
However, if the fix is simple, then I don't see why we shouldn't at least push the fix.

Regarding CI, we already have some third party tools which are downloaded for Mac and Linux only, but adding a package which must be installed via pip to that might be a hastle. I already believe the CI is a bit vague and complex, and I don't really want to add to that even more.

Perhaps all of this means that I'm leaning towards option 1.

An interesting mix of everything would be to add the tests as quick unit tests to method/class bodies of nltk/parse/bllip.py itself. These doctests are not ran automatically with the CI, but interested users can pytest --doctest-modules nltk/parse/bllip.py. That way, your work of coming up with the tests isn't wasted either. (With # doctest: +NORMALIZE_WHITESPACE to get the tests passing etc.)

That's my view, others might disagree.

@tomaarsen tomaarsen added the bug label Dec 5, 2021
@stevenbird
Copy link
Member

@dmcc do you have any advice for us please? Is there any point in us maintaining NLTK's BLLIP parser?

@tomaarsen
Copy link
Member

@dannysepler
#2903 has fixed a broken test case which snuck into this PR. I've merged develop into this PR to resolve this broken test.
That said, this PR still has broken tests from the PR itself.

@dmcc
Copy link
Contributor

dmcc commented Dec 6, 2021

Yikes, sorry this was broken for so long. BLLIP Parser does provide some functionality not available in the other NLTK parsers, so there's an argument for keeping it running. It's a bit surprising that nobody noticed/reported it was broken for so long, that's a concern. That said, I have recently received bug reports for the lower-level BLLIP Python interface, so it appears some people are still using it.

Option 1 sounds good to me -- I don't have easy access to Windows or Mac OS X machines to debug the issues, I'm afraid.

@stevenbird
Copy link
Member

Thanks @dmcc

Copy link
Member

@tomaarsen tomaarsen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dannysepler So, I believe the consensus is to apply the fix, but don't add the test to CI.

I believe all that needs to happen at this point is to remove bliipparser from requirements-ci.txt. Doing so will cause nltk/test/unit/test_bllip.py to be skipped in the CI, but users who have bllipparser installed will still be able to run these tests as a part of tox or pytest nltk/test.

Beyond that, I have confidence that your changes indeed work, as the tests for macOS and Ubuntu passed.

@dannysepler
Copy link
Contributor Author

dannysepler commented Dec 16, 2021

sorry for the delay @tomaarsen, and thank you for the guidance!

thank you as well to @dmcc for confirming the parser's usefulness!

Copy link
Member

@tomaarsen tomaarsen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me. I just have the one small nitpick, which isn't important at all.

nltk/parse/bllip.py Outdated Show resolved Hide resolved
@tomaarsen tomaarsen merged commit 59aa3fb into nltk:develop Dec 16, 2021
@tomaarsen
Copy link
Member

Thanks Danny! Well done.

@dannysepler dannysepler deleted the fix-decode-bllip branch January 5, 2022 18:30
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

BllipParser: AttributeError: 'str' object has no attribute 'decode' on Python 3.5
4 participants