Merge branch 'nltk:develop' into feature/multi-BLEU

nltk · Oct 6, 2021 · 1c2050c · 1c2050c
2 parents c730dc3 + 3ffed20
commit 1c2050c
Show file tree

Hide file tree

Showing 36 changed files with 960 additions and 383 deletions.
diff --git a/CONTRIBUTING.md b/CONTRIBUTING.md
@@ -94,7 +94,7 @@ Summary of our git branching model:
 - Never use `git add .`: it can add unwanted files;
 - Avoid using `git commit -a` unless you know what you're doing;
 - Check every change with `git diff` before adding them to the index (stage
-  area) and with `git diff --cached` before commiting;
+  area) and with `git diff --cached` before committing;
 - Make sure you add your name to our [list of contributors](https://github.com/nltk/nltk/blob/develop/AUTHORS.md);
 - If you have push access to the main repository, please do not commit directly
   to `develop`: your access should be used only to accept pull requests; if you

diff --git a/ChangeLog b/ChangeLog
@@ -1,4 +1,18 @@
-Version 3.6.3 2021-08-??
+Version 3.6.4 2021-10-01
+
+* deprecate `nltk.usage(obj)` in favor of `help(obj)`
+* resolve ReDoS vulnerability in Corpus Reader
+* solidify performance tests
+* improve phone number recognition in tweet tokenizer
+* refactored CISTEM stemmer for German
+* identify NLTK Team as the author
+* replace travis badge with github actions badge
+* add SECURITY.md
+
+Thanks to the following contributors to 3.6.4
+Tom Aarsen, Mohaned Mashaly, Dimitri Papadopoulos Orfanos, purificant, Danny Sepler
+
+Version 3.6.3 2021-09-19
 * Dropped support for Python 3.5
 * Run CI tests on Windows, too
 * Moved from Travis CI to GitHub Actions
@@ -12,11 +26,14 @@ Version 3.6.3 2021-08-??
 * Fixed AttributeError for Arabic ARLSTem2 stemmer
 * Many fixes and improvements to lm language model package
 * Fix bug in nltk.metrics.aline, C_skip = -10
+* Improvements to TweetTokenizer
+* Optional show arg for FreqDist.plot, ConditionalFreqDist.plot
+* edit_distance now computes Damerau-Levenshtein edit-distance
 
 Thanks to the following contributors to 3.6.3
-Tom Aarsen, Michael Wayne Goodman, Michał Górny, Maarten ter Huurne, Manu Joseph,
-Eric Kafe, Ilia Kurenkov, Daniel Loney, Rob Malouf, purificant, Danny Sepler,
-Anthony Sottile
+Tom Aarsen, Abhijnan Bajpai, Michael Wayne Goodman, Michał Górny, Maarten ter Huurne,
+Manu Joseph, Eric Kafe, Ilia Kurenkov, Daniel Loney, Rob Malouf, Mohaned Mashaly,
+purificant, Danny Sepler, Anthony Sottile
 
 Version 3.6.2 2021-04-20
 * move test code to nltk/test
@@ -752,7 +769,7 @@ NLTK:
 Data:
 * Corrected identifiers in Dependency Treebank corpus
 * Basque and Catalan Dependency Treebanks (CoNLL 2007)
-* PE08 Parser Evalution data
+* PE08 Parser Evaluation data
 * New models for POS tagger and named-entity tagger
 
 Book:
@@ -1065,7 +1082,7 @@ Code:
 - changed corpus.util to use the 'rb' flag for opening files, to fix problems
   reading corpora under MSWindows
 - updated stale examples in engineering.txt
-- extended feature stucture interface to permit chained features, e.g. fs['F','G']
+- extended feature structure interface to permit chained features, e.g. fs['F','G']
 - further misc improvements to test code plus some bugfixes
 Tutorials:
 - rewritten opening section of tagging chapter

diff --git a/README.md b/README.md
@@ -1,6 +1,6 @@
 # Natural Language Toolkit (NLTK)
 [![PyPI](https://img.shields.io/pypi/v/nltk.svg)](https://pypi.python.org/pypi/nltk)
-[![Travis](https://travis-ci.org/nltk/nltk.svg?branch=develop)](https://travis-ci.org/nltk/nltk)
+![CI](https://github.com/nltk/nltk/actions/workflows/ci.yaml/badge.svg?branch=develop)
 
 NLTK -- the Natural Language Toolkit -- is a suite of open source Python
 modules, data sets, and tutorials supporting research and development in Natural

diff --git a/RELEASE-HOWTO.txt b/RELEASE-HOWTO.txt
@@ -2,8 +2,8 @@ Building an NLTK distribution
 ----------------------------------
 
 1. Testing
-   - Ensure CI server isn't reporting any test failures
-     https://www.travis-ci.org/nltk/nltk
+   - Check no errors are reported in our continuous integration service:
+     https://github.com/nltk/nltk/actions
    - Optionally test demonstration code locally
      make demotest
    - Optionally test individual modules:
@@ -29,17 +29,13 @@ Building an NLTK distribution
      (including the range of Python versions that are supported)
      edit web/install.rst setup.py
    - Rebuild the API docs
-     - make sure you have the current revision of the web pages
-       cd nltk.github.com; git pull
-     - build
-       cd ../nltk/web
-       make (slow; lots of warning messages about cross references)
-     - publish
-       cd ../../nltk.github.com
-       git add _modules _sources _static api *.html objects.inv searchindex.js
-       git status (missing any important looking files?)
-       git commit -m "updates for version 3.X.Y"
-       git push origin master
+     python setup.py build_sphinx -b man --build-dir build/sphinx
+   - Publish them
+     cd nltk.github.com; git pull (begin with current docs repo)
+     <copy them over from build/sphinx to ../nltk.github.com>
+     git add .
+     git commit -m "updates for version 3.X.Y"
+     git push origin master
 
 4. Create a new version
    - (Optionally do this in a release branch, branching from develop branch
@@ -65,12 +61,8 @@ Building an NLTK distribution
      nltk-dev (for beta releases)
      nltk-users (for final releases)
      nltk twitter account
-   - announce to external mailing lists, for major N.N releases only
-     CORPORA@uib.no, linguist@linguistlist.org,
-     PythonSIL@lists.sil.org, edu-sig@python.org
-     mailing lists for any local courses using NLTK
 
-7. Optionally update to new version
+7. Optionally update repo version
    - we don't want builds from the repository to have the same release number
      e.g. after release X.Y.4, update repository version to X.Y.5a (alpha)
 

diff --git a/SECURITY.md b/SECURITY.md
@@ -0,0 +1,5 @@
+# Security Policy
+
+## Reporting a Vulnerability
+
+Please report security issues to `nltk.team@gmail.com`
diff --git a/jenkins.sh b/jenkins.sh
@@ -24,7 +24,7 @@ if [[ ! -d $senna_folder_name ]]; then
         rm ${senna_file_name}
 fi
 
-# Setup the Enviroment variable
+# Setup the Environment variable
 export SENNA=$(pwd)'/senna'
 
 popd

diff --git a/nltk/VERSION b/nltk/VERSION
@@ -1 +1 @@
-3.6.2
+3.6.4
diff --git a/nltk/__init__.py b/nltk/__init__.py
@@ -70,8 +70,8 @@
 __url__ = "http://nltk.org/"
 
 # Maintainer, contributors, etc.
-__maintainer__ = "Steven Bird"
-__maintainer_email__ = "stevenbird1@gmail.com"
+__maintainer__ = "NLTK Team"
+__maintainer_email__ = "nltk.team@gmail.com"
 __author__ = __maintainer__
 __author_email__ = __maintainer_email__
 

diff --git a/nltk/corpus/reader/comparative_sents.py b/nltk/corpus/reader/comparative_sents.py
@@ -45,7 +45,7 @@
 GRAD_COMPARISON = re.compile(r"<cs-[123]>")
 NON_GRAD_COMPARISON = re.compile(r"<cs-4>")
 ENTITIES_FEATS = re.compile(r"(\d)_((?:[\.\w\s/-](?!\d_))+)")
-KEYWORD = re.compile(r"\((?!.*\()(.*)\)$")
+KEYWORD = re.compile(r"\(([^\(]*)\)$")
 
 
 class Comparison:

diff --git a/nltk/corpus/reader/wordnet.py b/nltk/corpus/reader/wordnet.py
@@ -1136,7 +1136,7 @@ def __init__(self, root, omw_reader):
         # Map from lemma -> pos -> synset_index -> offset
         self._lemma_pos_offset_map = defaultdict(dict)
 
-        # A cache so we don't have to reconstuct synsets
+        # A cache so we don't have to reconstruct synsets
         # Map from pos -> offset -> synset
         self._synset_offset_cache = defaultdict(dict)
 

diff --git a/nltk/featstruct.py b/nltk/featstruct.py
@@ -1858,7 +1858,7 @@ def _default_fs_class(obj):
 
 class SubstituteBindingsSequence(SubstituteBindingsI):
     """
-    A mixin class for sequence clases that distributes variables() and
+    A mixin class for sequence classes that distributes variables() and
     substitute_bindings() over the object's elements.
     """
 

diff --git a/nltk/metrics/distance.py b/nltk/metrics/distance.py
@@ -34,7 +34,13 @@ def _edit_dist_init(len1, len2):
     return lev
 
 
-def _edit_dist_step(lev, i, j, s1, s2, substitution_cost=1, transpositions=False):
+def _last_left_t_init(sigma):
+    return {c: 0 for c in sigma}
+
+
+def _edit_dist_step(
+    lev, i, j, s1, s2, last_left, last_right, substitution_cost=1, transpositions=False
+):
     c1 = s1[i - 1]
     c2 = s2[j - 1]
 
@@ -47,9 +53,8 @@ def _edit_dist_step(lev, i, j, s1, s2, substitution_cost=1, transpositions=False
 
     # transposition
     d = c + 1  # never picked by default
-    if transpositions and i > 1 and j > 1:
-        if s1[i - 2] == c2 and s2[j - 2] == c1:
-            d = lev[i - 2][j - 2] + 1
+    if transpositions and last_left > 0 and last_right > 0:
+        d = lev[last_left - 1][last_right - 1] + i - last_left + j - last_right - 1
 
     # pick the cheapest
     lev[i][j] = min(a, b, c, d)
@@ -85,18 +90,33 @@ def edit_distance(s1, s2, substitution_cost=1, transpositions=False):
     len2 = len(s2)
     lev = _edit_dist_init(len1 + 1, len2 + 1)
 
+    # retrieve alphabet
+    sigma = set()
+    sigma.update(s1)
+    sigma.update(s2)
+
+    # set up table to remember positions of last seen occurrence in s1
+    last_left_t = _last_left_t_init(sigma)
+
     # iterate over the array
     for i in range(len1):
+        last_right = 0
         for j in range(len2):
+            last_left = last_left_t[s2[j]]
             _edit_dist_step(
                 lev,
                 i + 1,
                 j + 1,
                 s1,
                 s2,
+                last_left,
+                last_right,
                 substitution_cost=substitution_cost,
                 transpositions=transpositions,
             )
+            if s1[i] == s2[j]:
+                last_right = j + 1
+            last_left_t[s1[i]] = i + 1
     return lev[len1][len2]
 
 
@@ -162,6 +182,8 @@ def edit_distance_align(s1, s2, substitution_cost=1):
                 j + 1,
                 s1,
                 s2,
+                0,
+                0,
                 substitution_cost=substitution_cost,
                 transpositions=False,
             )

diff --git a/nltk/parse/util.py b/nltk/parse/util.py
@@ -162,7 +162,7 @@ def run(self, show_trees=False):
         Sentences in the test suite are divided into two classes:
          - grammatical (``accept``) and
          - ungrammatical (``reject``).
-        If a sentence should parse accordng to the grammar, the value of
+        If a sentence should parse according to the grammar, the value of
         ``trees`` will be a non-empty list. If a sentence should be rejected
         according to the grammar, then the value of ``trees`` will be None.
         """

diff --git a/nltk/sentiment/sentiment_analyzer.py b/nltk/sentiment/sentiment_analyzer.py
@@ -47,10 +47,10 @@ def all_words(self, documents, labeled=None):
         all_words = []
         if labeled is None:
             labeled = documents and isinstance(documents[0], tuple)
-        if labeled == True:
-            for words, sentiment in documents:
+        if labeled:
+            for words, _sentiment in documents:
                 all_words.extend(words)
-        elif labeled == False:
+        elif not labeled:
             for words in documents:
                 all_words.extend(words)
         return all_words
@@ -218,7 +218,7 @@ def evaluate(
             classifier = self.classifier
         print(f"Evaluating {type(classifier).__name__} results...")
         metrics_results = {}
-        if accuracy == True:
+        if accuracy:
             accuracy_score = eval_accuracy(classifier, test_set)
             metrics_results["Accuracy"] = accuracy_score
 
@@ -232,22 +232,22 @@ def evaluate(
             test_results[observed].add(i)
 
         for label in labels:
-            if precision == True:
+            if precision:
                 precision_score = eval_precision(
                     gold_results[label], test_results[label]
                 )
                 metrics_results[f"Precision [{label}]"] = precision_score
-            if recall == True:
+            if recall:
                 recall_score = eval_recall(gold_results[label], test_results[label])
                 metrics_results[f"Recall [{label}]"] = recall_score
-            if f_measure == True:
+            if f_measure:
                 f_measure_score = eval_f_measure(
                     gold_results[label], test_results[label]
                 )
                 metrics_results[f"F-measure [{label}]"] = f_measure_score
 
         # Print evaluation results (in alphabetical order)
-        if verbose == True:
+        if verbose:
             for result in sorted(metrics_results):
                 print(f"{result}: {metrics_results[result]}")