Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Added warnings if nltk_data .zip files exist without any corresponding .xml files. #2908

Merged
merged 2 commits into from Dec 15, 2021

Conversation

tomaarsen
Copy link
Member

@tomaarsen tomaarsen commented Dec 10, 2021

Resolves nltk/nltk_data#178

Hello!

Pull request overview

  • Added warnings if build_index is ran in nltk.downloader, when the root folder contains .zip files without any corresponding .xml files.

The issue

As inspired mainly by nltk/nltk_data#178, as well as by nltk/nltk_data#176 and nltk/nltk_data#177, @ekaf has noticed that there are several resources in nltk_data that have .zip files with data, but no .xml to specify what kind of data it is. When there is no .xml, these .zip files are quietly being ignored.

We experienced this issue with omw-1.4.zip (nltk/nltk_data#176) and wordnet2021.zip (nltk/nltk_data#177) recently, and it seems there are more resources with these issues.

This PR adds a warning in these situations.
Running make pkg_index on nltk_data now also outputs the following:

[sic]\nltk\downloader.py:2294: UserWarning: listing.csv.zip exists, but listing.csv.xml cannot be found! This could mean that listing.csv can not be downloaded.
  for pkg_xml, zf, subdir in _find_packages(os.path.join(root, "packages")):
[sic]\nltk\downloader.py:2294: UserWarning: omw-1.4.zip exists, but omw-1.4.xml cannot be found! This could mean that omw-1.4 can not be downloaded.
  for pkg_xml, zf, subdir in _find_packages(os.path.join(root, "packages")):
[sic]\nltk\downloader.py:2294: UserWarning: ptb3.zip exists, but ptb3.xml cannot be found! This could mean that ptb3 can not be downloaded.
  for pkg_xml, zf, subdir in _find_packages(os.path.join(root, "packages")):
[sic]\nltk\downloader.py:2294: UserWarning: wordnet2021.zip exists, but wordnet2021.xml cannot be found! This could mean that wordnet2021 can not be downloaded.
  for pkg_xml, zf, subdir in _find_packages(os.path.join(root, "packages")):

This should prevent us from accidentally forgetting to add an .xml to a new nltk_data resource.

Notes

Ignore my error in the commit message, it's meant to say .xml instead of .csv. I was distracted by the listing.csv and listing.csv.zip files. Are these files even still being used? GitHub is complaining that the listing.csv isn't even correct, as the number of columns don't match between all rows.

Furthermore, this PR shows that ptb3 (i.e. a stub of Penn Treebank 3) can not be downloaded. It seems that ptb is now being used, and that ptb3 is just an old copy of ptb:
image
Perhaps this means we can just delete ptb3 from nltk_data.

Thank you @ekaf for the suggestion.

  • Tom Aarsen

@tomaarsen
Copy link
Member Author

It seems there were issues with #2909, so I'll revert to not using that in this PR.

@tomaarsen
Copy link
Member Author

Failing tests seem to be unrelated, and about SENNA. Unsure why they're failing as of now.

@tomaarsen
Copy link
Member Author

I refreshed the tests, they re-downloaded the third party tools, and now the tests pass correctly 🎉

@stevenbird stevenbird merged commit 72d9885 into nltk:develop Dec 15, 2021
@stevenbird
Copy link
Member

Thanks @tomaarsen, @purificant – great idea!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Packages missing a corresponding .xml file
3 participants