`TreebankWordDetokenizer().detokenize()` introduces unexpected spaces before periods. #3210

Alnusjaponica · 2023-12-05T05:36:28Z

Description

The TreebankWordDetokenizer().detokenize() method introduces extra spaces before periods when periods are treated as separate tokens in the input. The issue arises from the spaces added here:

nltk/nltk/tokenize/treebank.py

Line 362 in d7b428d

text = " " + text + " "

which are not properly removed when there are words following the period.

Reproducible code

import nltk
from nltk import pos_tag, word_tokenize
from nltk.tokenize.treebank import TreebankWordDetokenizer

nltk.download('averaged_perceptron_tagger')
nltk.download('punkt')


text = "Lorem ipsum dolor sit amet. consectetur adipiscing elit."
d = TreebankWordDetokenizer()
tagged_words = pos_tag(word_tokenize(text))
words = [word for word, tag in tagged_words]
print(TreebankWordDetokenizer().detokenize(words))

This code snippet produces the following output:

Lorem ipsum dolor sit amet . consectetur adipiscing elit.

which contains an unexpected space before the first period.

Expected behavior

The expected output from TreebankWordDetokenizer().detokenize() should be:

Lorem ipsum dolor sit amet. consectetur adipiscing elit.

Environment

OS: macOS 14.1.1
Python: 3.11.6
nltk: 3.8.1

The text was updated successfully, but these errors were encountered:

Alnusjaponica mentioned this issue Dec 5, 2023

Eliminate unexpected spaces introduced before periods by TreebankWordDetokenizer().detokenize(). citadel-ai/langcheck#62

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`TreebankWordDetokenizer().detokenize()` introduces unexpected spaces before periods. #3210

`TreebankWordDetokenizer().detokenize()` introduces unexpected spaces before periods. #3210

Alnusjaponica commented Dec 5, 2023 •

edited

TreebankWordDetokenizer().detokenize() introduces unexpected spaces before periods. #3210

TreebankWordDetokenizer().detokenize() introduces unexpected spaces before periods. #3210

Comments

Alnusjaponica commented Dec 5, 2023 • edited

Description

Reproducible code

Expected behavior

Environment

`TreebankWordDetokenizer().detokenize()` introduces unexpected spaces before periods. #3210

`TreebankWordDetokenizer().detokenize()` introduces unexpected spaces before periods. #3210

Alnusjaponica commented Dec 5, 2023 •

edited