Skip to content

Commit

Permalink
Merge branch 'nltk:develop' into feature/multi-BLEU
Browse files Browse the repository at this point in the history
  • Loading branch information
BatMrE committed Oct 6, 2021
2 parents c730dc3 + 3ffed20 commit 1c2050c
Show file tree
Hide file tree
Showing 36 changed files with 960 additions and 383 deletions.
2 changes: 1 addition & 1 deletion CONTRIBUTING.md
Expand Up @@ -94,7 +94,7 @@ Summary of our git branching model:
- Never use `git add .`: it can add unwanted files;
- Avoid using `git commit -a` unless you know what you're doing;
- Check every change with `git diff` before adding them to the index (stage
area) and with `git diff --cached` before commiting;
area) and with `git diff --cached` before committing;
- Make sure you add your name to our [list of contributors](https://github.com/nltk/nltk/blob/develop/AUTHORS.md);
- If you have push access to the main repository, please do not commit directly
to `develop`: your access should be used only to accept pull requests; if you
Expand Down
29 changes: 23 additions & 6 deletions ChangeLog
@@ -1,4 +1,18 @@
Version 3.6.3 2021-08-??
Version 3.6.4 2021-10-01

* deprecate `nltk.usage(obj)` in favor of `help(obj)`
* resolve ReDoS vulnerability in Corpus Reader
* solidify performance tests
* improve phone number recognition in tweet tokenizer
* refactored CISTEM stemmer for German
* identify NLTK Team as the author
* replace travis badge with github actions badge
* add SECURITY.md

Thanks to the following contributors to 3.6.4
Tom Aarsen, Mohaned Mashaly, Dimitri Papadopoulos Orfanos, purificant, Danny Sepler

Version 3.6.3 2021-09-19
* Dropped support for Python 3.5
* Run CI tests on Windows, too
* Moved from Travis CI to GitHub Actions
Expand All @@ -12,11 +26,14 @@ Version 3.6.3 2021-08-??
* Fixed AttributeError for Arabic ARLSTem2 stemmer
* Many fixes and improvements to lm language model package
* Fix bug in nltk.metrics.aline, C_skip = -10
* Improvements to TweetTokenizer
* Optional show arg for FreqDist.plot, ConditionalFreqDist.plot
* edit_distance now computes Damerau-Levenshtein edit-distance

Thanks to the following contributors to 3.6.3
Tom Aarsen, Michael Wayne Goodman, Michał Górny, Maarten ter Huurne, Manu Joseph,
Eric Kafe, Ilia Kurenkov, Daniel Loney, Rob Malouf, purificant, Danny Sepler,
Anthony Sottile
Tom Aarsen, Abhijnan Bajpai, Michael Wayne Goodman, Michał Górny, Maarten ter Huurne,
Manu Joseph, Eric Kafe, Ilia Kurenkov, Daniel Loney, Rob Malouf, Mohaned Mashaly,
purificant, Danny Sepler, Anthony Sottile

Version 3.6.2 2021-04-20
* move test code to nltk/test
Expand Down Expand Up @@ -752,7 +769,7 @@ NLTK:
Data:
* Corrected identifiers in Dependency Treebank corpus
* Basque and Catalan Dependency Treebanks (CoNLL 2007)
* PE08 Parser Evalution data
* PE08 Parser Evaluation data
* New models for POS tagger and named-entity tagger

Book:
Expand Down Expand Up @@ -1065,7 +1082,7 @@ Code:
- changed corpus.util to use the 'rb' flag for opening files, to fix problems
reading corpora under MSWindows
- updated stale examples in engineering.txt
- extended feature stucture interface to permit chained features, e.g. fs['F','G']
- extended feature structure interface to permit chained features, e.g. fs['F','G']
- further misc improvements to test code plus some bugfixes
Tutorials:
- rewritten opening section of tagging chapter
Expand Down
2 changes: 1 addition & 1 deletion README.md
@@ -1,6 +1,6 @@
# Natural Language Toolkit (NLTK)
[![PyPI](https://img.shields.io/pypi/v/nltk.svg)](https://pypi.python.org/pypi/nltk)
[![Travis](https://travis-ci.org/nltk/nltk.svg?branch=develop)](https://travis-ci.org/nltk/nltk)
![CI](https://github.com/nltk/nltk/actions/workflows/ci.yaml/badge.svg?branch=develop)

NLTK -- the Natural Language Toolkit -- is a suite of open source Python
modules, data sets, and tutorials supporting research and development in Natural
Expand Down
28 changes: 10 additions & 18 deletions RELEASE-HOWTO.txt
Expand Up @@ -2,8 +2,8 @@ Building an NLTK distribution
----------------------------------

1. Testing
- Ensure CI server isn't reporting any test failures
https://www.travis-ci.org/nltk/nltk
- Check no errors are reported in our continuous integration service:
https://github.com/nltk/nltk/actions
- Optionally test demonstration code locally
make demotest
- Optionally test individual modules:
Expand All @@ -29,17 +29,13 @@ Building an NLTK distribution
(including the range of Python versions that are supported)
edit web/install.rst setup.py
- Rebuild the API docs
- make sure you have the current revision of the web pages
cd nltk.github.com; git pull
- build
cd ../nltk/web
make (slow; lots of warning messages about cross references)
- publish
cd ../../nltk.github.com
git add _modules _sources _static api *.html objects.inv searchindex.js
git status (missing any important looking files?)
git commit -m "updates for version 3.X.Y"
git push origin master
python setup.py build_sphinx -b man --build-dir build/sphinx
- Publish them
cd nltk.github.com; git pull (begin with current docs repo)
<copy them over from build/sphinx to ../nltk.github.com>
git add .
git commit -m "updates for version 3.X.Y"
git push origin master

4. Create a new version
- (Optionally do this in a release branch, branching from develop branch
Expand All @@ -65,12 +61,8 @@ Building an NLTK distribution
nltk-dev (for beta releases)
nltk-users (for final releases)
nltk twitter account
- announce to external mailing lists, for major N.N releases only
CORPORA@uib.no, linguist@linguistlist.org,
PythonSIL@lists.sil.org, edu-sig@python.org
mailing lists for any local courses using NLTK

7. Optionally update to new version
7. Optionally update repo version
- we don't want builds from the repository to have the same release number
e.g. after release X.Y.4, update repository version to X.Y.5a (alpha)

Expand Down
5 changes: 5 additions & 0 deletions SECURITY.md
@@ -0,0 +1,5 @@
# Security Policy

## Reporting a Vulnerability

Please report security issues to `nltk.team@gmail.com`
2 changes: 1 addition & 1 deletion jenkins.sh
Expand Up @@ -24,7 +24,7 @@ if [[ ! -d $senna_folder_name ]]; then
rm ${senna_file_name}
fi

# Setup the Enviroment variable
# Setup the Environment variable
export SENNA=$(pwd)'/senna'

popd
Expand Down
2 changes: 1 addition & 1 deletion nltk/VERSION
@@ -1 +1 @@
3.6.2
3.6.4
4 changes: 2 additions & 2 deletions nltk/__init__.py
Expand Up @@ -70,8 +70,8 @@
__url__ = "http://nltk.org/"

# Maintainer, contributors, etc.
__maintainer__ = "Steven Bird"
__maintainer_email__ = "stevenbird1@gmail.com"
__maintainer__ = "NLTK Team"
__maintainer_email__ = "nltk.team@gmail.com"
__author__ = __maintainer__
__author_email__ = __maintainer_email__

Expand Down
2 changes: 1 addition & 1 deletion nltk/corpus/reader/comparative_sents.py
Expand Up @@ -45,7 +45,7 @@
GRAD_COMPARISON = re.compile(r"<cs-[123]>")
NON_GRAD_COMPARISON = re.compile(r"<cs-4>")
ENTITIES_FEATS = re.compile(r"(\d)_((?:[\.\w\s/-](?!\d_))+)")
KEYWORD = re.compile(r"\((?!.*\()(.*)\)$")
KEYWORD = re.compile(r"\(([^\(]*)\)$")


class Comparison:
Expand Down
2 changes: 1 addition & 1 deletion nltk/corpus/reader/wordnet.py
Expand Up @@ -1136,7 +1136,7 @@ def __init__(self, root, omw_reader):
# Map from lemma -> pos -> synset_index -> offset
self._lemma_pos_offset_map = defaultdict(dict)

# A cache so we don't have to reconstuct synsets
# A cache so we don't have to reconstruct synsets
# Map from pos -> offset -> synset
self._synset_offset_cache = defaultdict(dict)

Expand Down
2 changes: 1 addition & 1 deletion nltk/featstruct.py
Expand Up @@ -1858,7 +1858,7 @@ def _default_fs_class(obj):

class SubstituteBindingsSequence(SubstituteBindingsI):
"""
A mixin class for sequence clases that distributes variables() and
A mixin class for sequence classes that distributes variables() and
substitute_bindings() over the object's elements.
"""

Expand Down
30 changes: 26 additions & 4 deletions nltk/metrics/distance.py
Expand Up @@ -34,7 +34,13 @@ def _edit_dist_init(len1, len2):
return lev


def _edit_dist_step(lev, i, j, s1, s2, substitution_cost=1, transpositions=False):
def _last_left_t_init(sigma):
return {c: 0 for c in sigma}


def _edit_dist_step(
lev, i, j, s1, s2, last_left, last_right, substitution_cost=1, transpositions=False
):
c1 = s1[i - 1]
c2 = s2[j - 1]

Expand All @@ -47,9 +53,8 @@ def _edit_dist_step(lev, i, j, s1, s2, substitution_cost=1, transpositions=False

# transposition
d = c + 1 # never picked by default
if transpositions and i > 1 and j > 1:
if s1[i - 2] == c2 and s2[j - 2] == c1:
d = lev[i - 2][j - 2] + 1
if transpositions and last_left > 0 and last_right > 0:
d = lev[last_left - 1][last_right - 1] + i - last_left + j - last_right - 1

# pick the cheapest
lev[i][j] = min(a, b, c, d)
Expand Down Expand Up @@ -85,18 +90,33 @@ def edit_distance(s1, s2, substitution_cost=1, transpositions=False):
len2 = len(s2)
lev = _edit_dist_init(len1 + 1, len2 + 1)

# retrieve alphabet
sigma = set()
sigma.update(s1)
sigma.update(s2)

# set up table to remember positions of last seen occurrence in s1
last_left_t = _last_left_t_init(sigma)

# iterate over the array
for i in range(len1):
last_right = 0
for j in range(len2):
last_left = last_left_t[s2[j]]
_edit_dist_step(
lev,
i + 1,
j + 1,
s1,
s2,
last_left,
last_right,
substitution_cost=substitution_cost,
transpositions=transpositions,
)
if s1[i] == s2[j]:
last_right = j + 1
last_left_t[s1[i]] = i + 1
return lev[len1][len2]


Expand Down Expand Up @@ -162,6 +182,8 @@ def edit_distance_align(s1, s2, substitution_cost=1):
j + 1,
s1,
s2,
0,
0,
substitution_cost=substitution_cost,
transpositions=False,
)
Expand Down
2 changes: 1 addition & 1 deletion nltk/parse/util.py
Expand Up @@ -162,7 +162,7 @@ def run(self, show_trees=False):
Sentences in the test suite are divided into two classes:
- grammatical (``accept``) and
- ungrammatical (``reject``).
If a sentence should parse accordng to the grammar, the value of
If a sentence should parse according to the grammar, the value of
``trees`` will be a non-empty list. If a sentence should be rejected
according to the grammar, then the value of ``trees`` will be None.
"""
Expand Down
16 changes: 8 additions & 8 deletions nltk/sentiment/sentiment_analyzer.py
Expand Up @@ -47,10 +47,10 @@ def all_words(self, documents, labeled=None):
all_words = []
if labeled is None:
labeled = documents and isinstance(documents[0], tuple)
if labeled == True:
for words, sentiment in documents:
if labeled:
for words, _sentiment in documents:
all_words.extend(words)
elif labeled == False:
elif not labeled:
for words in documents:
all_words.extend(words)
return all_words
Expand Down Expand Up @@ -218,7 +218,7 @@ def evaluate(
classifier = self.classifier
print(f"Evaluating {type(classifier).__name__} results...")
metrics_results = {}
if accuracy == True:
if accuracy:
accuracy_score = eval_accuracy(classifier, test_set)
metrics_results["Accuracy"] = accuracy_score

Expand All @@ -232,22 +232,22 @@ def evaluate(
test_results[observed].add(i)

for label in labels:
if precision == True:
if precision:
precision_score = eval_precision(
gold_results[label], test_results[label]
)
metrics_results[f"Precision [{label}]"] = precision_score
if recall == True:
if recall:
recall_score = eval_recall(gold_results[label], test_results[label])
metrics_results[f"Recall [{label}]"] = recall_score
if f_measure == True:
if f_measure:
f_measure_score = eval_f_measure(
gold_results[label], test_results[label]
)
metrics_results[f"F-measure [{label}]"] = f_measure_score

# Print evaluation results (in alphabetical order)
if verbose == True:
if verbose:
for result in sorted(metrics_results):
print(f"{result}: {metrics_results[result]}")

Expand Down

0 comments on commit 1c2050c

Please sign in to comment.