Skip to content

Commit

Permalink
Merge branch 'develop' of https://github.com/ExplorerFreda/nltk into …
Browse files Browse the repository at this point in the history
…develop

* 'develop' of https://github.com/ExplorerFreda/nltk:
  Temporarily pause Python 3.10 CI tests due to scikit-learn issues with Windows
  Resolve IndexError in `sent_tokenize` (nltk#2922)
  Drop support for Python 3.6, support Python 3.10 (nltk#2920)
  updates for 3.6.6
  minor clean ups
  updates for 3.6.6
  • Loading branch information
ExplorerFreda committed Dec 24, 2021
2 parents 1057d66 + ad78dac commit 6c99522
Show file tree
Hide file tree
Showing 17 changed files with 95 additions and 38 deletions.
2 changes: 1 addition & 1 deletion .github/workflows/ci.yaml
Expand Up @@ -76,7 +76,7 @@ jobs:
needs: [cache_nltk_data, cache_third_party]
strategy:
matrix:
python-version: [3.6, 3.7, 3.8, 3.9]
python-version: ['3.7', '3.8', '3.9']
os: [ubuntu-latest, macos-latest, windows-latest]
fail-fast: false
runs-on: ${{ matrix.os }}
Expand Down
2 changes: 1 addition & 1 deletion .pre-commit-config.yaml
Expand Up @@ -10,7 +10,7 @@ repos:
rev: v2.23.3
hooks:
- id: pyupgrade
args: ["--py36-plus"]
args: ["--py37-plus"]
- repo: https://github.com/ambv/black
rev: 21.7b0
hooks:
Expand Down
4 changes: 2 additions & 2 deletions CONTRIBUTING.md
Expand Up @@ -77,7 +77,7 @@ Summary of our git branching model:
- Do many small commits on that branch locally (`git add files-changed`,
`git commit -m "Add some change"`);
- Run the tests to make sure nothing breaks
(`tox -e py36` if you are on Python 3.6);
(`tox -e py37` if you are on Python 3.7);
- Add your name to the `AUTHORS.md` file as a contributor;
- Push to your fork on GitHub (with the name as your local branch:
`git push origin branch-name`);
Expand Down Expand Up @@ -169,7 +169,7 @@ The [`.github/workflows/ci.yaml`](https://github.com/nltk/nltk/blob/develop/.git
- Otherwise, download all the data packages through `nltk.download('all')`.

- The `test` job
- tests against supported Python versions (`3.6`, `3.7`, `3.8`, `3.9`).
- tests against supported Python versions (`3.7`, `3.8`, `3.9`).
- tests on `ubuntu-latest` and `macos-latest`.
- relies on the `cache_nltk_data` job to ensure that `nltk_data` is available.
- performs these steps:
Expand Down
41 changes: 41 additions & 0 deletions ChangeLog
@@ -1,3 +1,44 @@
Version 3.6.6 2021-12-21

* Refactor `gensim.doctest` to work for gensim 4.0.0 and up (#2914)
* Add Precision, Recall, F-measure, Confusion Matrix to Taggers (#2862)
* Added warnings if .zip files exist without any corresponding .csv files. (#2908)
* Fix `FileNotFoundError` when the `download_dir` is a non-existing nested folder (#2910)
* Rename omw to omw-1.4 (#2907)
* Resolve ReDoS opportunity by fixing incorrectly specified regex (#2906)
* Support OMW 1.4 (#2899)
* Deprecate Tree get and set node methods (#2900)
* Fix broken inaugural test case (#2903)
* Use Multilingual Wordnet Data from OMW with newer Wordnet versions (#2889)
* Keep NLTKs "tokenize" module working with pathlib (#2896)
* Make prettyprinter to be more readable (#2893)
* Update links to the nltk book (#2895)
* Add `CITATION.cff` to nltk (#2880)
* Resolve serious ReDoS in PunktSentenceTokenizer (#2869)
* Delete old CI config files (#2881)
* Improve Tokenize documentation + add TokenizerI as superclass for TweetTokenizer (#2878)
* Fix expected value for BLEU score doctest after changes from #2572
* Add multi Bleu functionality and tests (#2793)
* Deprecate 'return_str' parameter in NLTKWordTokenizer and TreebankWordTokenizer (#2883)
* Allow empty string in CFG's + more (#2888)
* Partition `tree.py` module into `tree` package + pickle fix (#2863)
* Fix several TreebankWordTokenizer and NLTKWordTokenizer bugs (#2877)
* Rewind Wordnet data file after each lookup (#2868)
* Correct __init__ call for SyntaxCorpusReader subclasses (#2872)
* Documentation fixes (#2873)
* Fix levenstein distance for duplicated letters (#2849)
* Support alternative Wordnet versions (#2860)
* Remove hundreds of formatting warnings for nltk.org (#2859)
* Modernize `nltk.org/howto` pages (#2856)
* Fix Bleu Score smoothing function from taking log(0) (#2839)
* Update third party tools to newer versions and removing MaltParser fixed version (#2832)
* Fix TypeError: _pretty() takes 1 positional argument but 2 were given in sem/drt.py (#2854)
* Replace `http` with `https` in most URLs (#2852)

Thanks to the following contributors to 3.6.6
Adam Hawley, BatMrE, Danny Sepler, Eric Kafe, Gavish Poddar, Panagiotis Simakis,
RnDevelover, Robby Horvath, Tom Aarsen, Yuta Nakamura, Mohaned Mashaly

Version 3.6.5 2021-10-11

* modernised nltk.org website
Expand Down
3 changes: 2 additions & 1 deletion Makefile
Expand Up @@ -51,10 +51,11 @@ windist: clean_code
########################################################################

clean: clean_code
rm -rf build iso dist api MANIFEST nltk-$(VERSION) nltk.egg-info
rm -rf build web/_build iso dist api MANIFEST nltk-$(VERSION) nltk.egg-info

clean_code:
rm -f `find nltk -name '*.pyc'`
rm -f `find nltk -name '*.pyo'`
rm -f `find . -name '*~'`
rm -rf `find . -name '__pycache__'`
rm -f MANIFEST # regenerate manifest from MANIFEST.in
2 changes: 1 addition & 1 deletion README.md
Expand Up @@ -4,7 +4,7 @@

NLTK -- the Natural Language Toolkit -- is a suite of open source Python
modules, data sets, and tutorials supporting research and development in Natural
Language Processing. NLTK requires Python version 3.6, 3.7, 3.8, or 3.9.
Language Processing. NLTK requires Python version 3.7, 3.8, 3.9 or 3.10.

For documentation, please visit [nltk.org](https://www.nltk.org/).

Expand Down
3 changes: 2 additions & 1 deletion RELEASE-HOWTO.txt
Expand Up @@ -33,14 +33,15 @@ Building an NLTK distribution
- Rebuild the API docs
sphinx-build -E ./web ./build
- Publish them
cd nltk.github.com; git pull (begin with current docs repo)
cd ../nltk.github.com; git pull (begin with current docs repo)
cp -r ../nltk/build/* .
git add .
git commit -m "updates for version 3.X.Y"
git push origin master

4. Create a new version
- Tag this version:
cd ../nltk
git tag -a 3.X.Y -m "version 3.X.Y"
git push --tags
verify that it shows up here: https://github.com/nltk/nltk/releases
Expand Down
2 changes: 1 addition & 1 deletion nltk/VERSION
@@ -1 +1 @@
3.6.5
3.6.6
4 changes: 2 additions & 2 deletions nltk/__init__.py
Expand Up @@ -52,7 +52,7 @@
# Description of the toolkit, keywords, and the project's primary URL.
__longdescr__ = """\
The Natural Language Toolkit (NLTK) is a Python package for
natural language processing. NLTK requires Python 3.6, 3.7, 3.8, or 3.9."""
natural language processing. NLTK requires Python 3.7, 3.8, 3.9 or 3.10."""
__keywords__ = [
"NLP",
"CL",
Expand Down Expand Up @@ -84,10 +84,10 @@
"Intended Audience :: Science/Research",
"License :: OSI Approved :: Apache Software License",
"Operating System :: OS Independent",
"Programming Language :: Python :: 3.6",
"Programming Language :: Python :: 3.7",
"Programming Language :: Python :: 3.8",
"Programming Language :: Python :: 3.9",
"Programming Language :: Python :: 3.10",
"Topic :: Scientific/Engineering",
"Topic :: Scientific/Engineering :: Artificial Intelligence",
"Topic :: Scientific/Engineering :: Human Machine Interfaces",
Expand Down
5 changes: 5 additions & 0 deletions nltk/test/tokenize.doctest
Expand Up @@ -310,6 +310,11 @@ Testing mutable default arguments for https://github.com/nltk/nltk/pull/2067
>>> type(pst._lang_vars)
<class 'nltk.tokenize.punkt.PunktLanguageVars'>

Testing that inputs can start with dots.

>>> pst = PunktSentenceTokenizer(lang_vars=None)
>>> pst.tokenize(". This input starts with a dot. This used to cause issues.")
['.', 'This input starts with a dot.', 'This used to cause issues.']

Regression Tests: align_tokens
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Expand Down
2 changes: 1 addition & 1 deletion nltk/tokenize/punkt.py
Expand Up @@ -1379,7 +1379,7 @@ def _match_potential_end_contexts(self, text):
# Find the word before the current match
split = text[: match.start()].rsplit(maxsplit=1)
before_start = len(split[0]) if len(split) == 2 else 0
before_words[match] = split[-1]
before_words[match] = split[-1] if split else ""
matches.append(match)

return [
Expand Down
6 changes: 3 additions & 3 deletions setup.py
Expand Up @@ -67,7 +67,7 @@
},
long_description="""\
The Natural Language Toolkit (NLTK) is a Python package for
natural language processing. NLTK requires Python 3.6, 3.7, 3.8, or 3.9.""",
natural language processing. NLTK requires Python 3.7, 3.8, 3.9 or 3.10.""",
license="Apache License, Version 2.0",
keywords=[
"NLP",
Expand Down Expand Up @@ -95,10 +95,10 @@
"Intended Audience :: Science/Research",
"License :: OSI Approved :: Apache Software License",
"Operating System :: OS Independent",
"Programming Language :: Python :: 3.6",
"Programming Language :: Python :: 3.7",
"Programming Language :: Python :: 3.8",
"Programming Language :: Python :: 3.9",
"Programming Language :: Python :: 3.10",
"Topic :: Scientific/Engineering",
"Topic :: Scientific/Engineering :: Artificial Intelligence",
"Topic :: Scientific/Engineering :: Human Machine Interfaces",
Expand All @@ -110,7 +110,7 @@
"Topic :: Text Processing :: Linguistic",
],
package_data={"nltk": ["test/*.doctest", "VERSION"]},
python_requires=">=3.6",
python_requires=">=3.7",
install_requires=[
"click",
"joblib",
Expand Down
28 changes: 10 additions & 18 deletions tox.ini
@@ -1,9 +1,9 @@
[tox]
envlist =
py{36,37,38,39}
py{37,38,39,310}
pypy
py{36,37,38,39}-nodeps
py{36,37,38,39}-jenkins
py{37,38,39,310}-nodeps
py{37,38,39,310}-jenkins
py-travis

[testenv]
Expand Down Expand Up @@ -51,13 +51,6 @@ deps =
commands =
pytest

[testenv:py36-nodeps]
basepython = python3.6
deps =
pytest
pytest-mock
commands = pytest

[testenv:py37-nodeps]
basepython = python3.7
deps =
Expand All @@ -79,18 +72,17 @@ deps =
pytest-mock
commands = pytest

[testenv:py310-nodeps]
basepython = python3.10
deps =
pytest
pytest-mock
commands = pytest

# Use minor version agnostic basepython, but specify testenv
# control Python2/3 versions using jenkins' user-defined matrix instead.
# Available Python versions: http://repository-cloudbees.forge.cloudbees.com/distributions/ci-addons/python/fc25/

[testenv:py3.6.4-jenkins]
basepython = python3
commands = {toxinidir}/jenkins.sh
setenv =
STANFORD_MODELS = {homedir}/third/stanford-parser/
STANFORD_PARSER = {homedir}/third/stanford-parser/
STANFORD_POSTAGGER = {homedir}/third/stanford-postagger/

[testenv:py-travis]
extras = all
setenv =
Expand Down
4 changes: 2 additions & 2 deletions web/conf.py
Expand Up @@ -115,9 +115,9 @@ def generate_howto():
# built documents.
#
# The short X.Y version.
version = "3.6.5"
version = "3.6.6"
# The full version, including alpha/beta/rc tags.
release = "3.6.5"
release = "3.6.6"

# The language for content autogenerated by Sphinx. Refer to documentation
# for a list of supported languages.
Expand Down
6 changes: 3 additions & 3 deletions web/dev/local_testing.rst
Expand Up @@ -25,10 +25,10 @@ Please consult https://tox.wiki for more info about the tox tool.
Examples
--------

Run tests for python 3.6 in verbose mode; executing only tests
Run tests for python 3.7 in verbose mode; executing only tests
that failed in the last test run::

tox -e py36 -- -v --failed
tox -e py37 -- -v --failed

Run tree doctests for all available interpreters::

Expand All @@ -42,7 +42,7 @@ By default, numpy, scipy and scikit-learn are installed in tox virtualenvs.
This is slow, requires working build toolchain and is not always feasible.
In order to skip numpy & friends, use ``..-nodeps`` environments::

tox -e py36-nodeps,py37,pypy
tox -e py37-nodeps,py37,pypy

It is also possible to run tests without tox. This way NLTK would be tested
only under single interpreter, but it may be easier to have numpy and other
Expand Down
2 changes: 1 addition & 1 deletion web/install.rst
@@ -1,7 +1,7 @@
Installing NLTK
===============

NLTK requires Python versions 3.6, 3.7, 3.8, or 3.9
NLTK requires Python versions 3.7, 3.8, 3.9 or 3.10

For Windows users, it is strongly recommended that you go through this guide to install Python 3 successfully https://docs.python-guide.org/starting/install3/win/#install3-windows

Expand Down
17 changes: 17 additions & 0 deletions web/news.rst
Expand Up @@ -4,6 +4,23 @@ Release Notes
2021
----

NLTK 3.6.6 release: December 2021:
add precision, recall, F-measure, confusion matrix to Taggers
support alternative Wordnet versions (#2860)
support OMW 1.4, use Multilingual Wordnet Data from OMW with newer Wordnet versions
add multi Bleu functionality
allow empty string in CFG's + more
fix several TreebankWordTokenizer and NLTKWordTokenizer bugs
fix levenstein distance for duplicated letters
modernize `nltk.org/howto` pages
update third party tools to newer versions

NLTK 3.6.5 release: October 2021:
support emoji ZJW sequences and skin tone modifiers in TweetTokenizer
METEOR evaluation now requires pre-tokenized input
code linting and type hinting
avoid re.Pattern and regex.Pattern which fail for Python 3.6, 3.7

NLTK 3.6.4 release: October 2021
improved phone number recognition in tweet tokenizer
resolved ReDoS vulnerability in Corpus Reader
Expand Down

0 comments on commit 6c99522

Please sign in to comment.