Added warnings if nltk_data .zip
files exist without any corresponding .xml
files.
#2908
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Resolves nltk/nltk_data#178
Hello!
Pull request overview
build_index
is ran innltk.downloader
, when the root folder contains.zip
files without any corresponding.xml
files.The issue
As inspired mainly by nltk/nltk_data#178, as well as by nltk/nltk_data#176 and nltk/nltk_data#177, @ekaf has noticed that there are several resources in
nltk_data
that have.zip
files with data, but no.xml
to specify what kind of data it is. When there is no.xml
, these.zip
files are quietly being ignored.We experienced this issue with
omw-1.4.zip
(nltk/nltk_data#176) andwordnet2021.zip
(nltk/nltk_data#177) recently, and it seems there are more resources with these issues.This PR adds a warning in these situations.
Running
make pkg_index
onnltk_data
now also outputs the following:This should prevent us from accidentally forgetting to add an
.xml
to a newnltk_data
resource.Notes
Ignore my error in the commit message, it's meant to say
.xml
instead of.csv
. I was distracted by thelisting.csv
andlisting.csv.zip
files. Are these files even still being used? GitHub is complaining that thelisting.csv
isn't even correct, as the number of columns don't match between all rows.Furthermore, this PR shows that
ptb3
(i.e. a stub of Penn Treebank 3) can not be downloaded. It seems thatptb
is now being used, and thatptb3
is just an old copy ofptb
:Perhaps this means we can just delete
ptb3
fromnltk_data
.Thank you @ekaf for the suggestion.