span_tokenize yields ValueError for quotation marks #2890

janscholich · 2021-11-24T16:32:26Z

The following code

sentence = str('\'\'The Economist,\'\' a Chicago real estate journal, conceded in 1892 that:  } Early tenants, according to Rand McNally, included "great corporations, banks, and professional men ... among them the')
print(list(TreebankWordTokenizer().span_tokenize(sentence)))

yields

ValueError: substring "''" not found in "''The Economist,'' a Chicago real estate journal, conceded in 1892 that:  } Early tenants, according to Rand McNally, included "great corporations, banks, and professional men ... among them the"

It's raised in line 293 of the function align_tokens. My guess is that this snippet in span_tokenize:

if ('"' in text) or ("''" in text):
            # Find double quotes and converted quotes
            matched = [m.group() for m in re.finditer(r"``|'{2}|\"", text)]

            # Replace converted quotes back to double quotes
            tokens = [
                matched.pop(0) if tok in ['"', "``", "''"] else tok
                for tok in raw_tokens
            ]
        else:
            tokens = raw_tokens

modifies the quotation marks in such a way that it differs from the token in the original raw text which causes a ValueError when calling sentence.index(token, point) (in align_tokens). Not entirely sure though.

The text was updated successfully, but these errors were encountered:

tomaarsen · 2021-11-24T16:38:36Z

You're very right on your analysis there! Feel free to have a look at #2877. It will be included in the next release.
Until then, I would recommend installing the develop branch, which has this fix in place. E.g. with:

pip install -U git+https://github.com/nltk/nltk

Running your test on that branch gives:

from nltk.tokenize import TreebankWordTokenizer

sentence = str('\'\'The Economist,\'\' a Chicago real estate journal, conceded in 1892 that:  } Early tenants, according to Rand McNally, included "great corporations, banks, and professional men ... among them the')
print(list(TreebankWordTokenizer().span_tokenize(sentence)))

[(0, 2), (2, 5), (6, 15), (15, 16), (16, 18), (19, 20), (21, 28), (29, 33), (34, 40), (41, 48), (48, 49), (50, 58), (59, 61), (62, 66), (67, 71), (71, 72), (74, 75), (76, 81), (82, 89), (89, 90), (91, 100), (101, 103), (104, 108), (109, 116), (116, 117), (118, 126), (127, 128), (128, 133), (134, 146), (146, 147), (148, 153), (153, 154), (155, 158), (159, 171), (172, 175), (176, 179), (180, 185), (186, 190), (191, 194)]

janscholich · 2021-11-24T16:41:46Z

Thanks a lot for the quick response! Saves me a lot of time :)

tomaarsen · 2021-11-24T16:42:38Z

Gladly. I'll close this then. Feel free to let us know if you encounter more issues!

tomaarsen added resolved tokenizer labels Nov 24, 2021

tomaarsen closed this as completed Nov 24, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

span_tokenize yields ValueError for quotation marks #2890

span_tokenize yields ValueError for quotation marks #2890

janscholich commented Nov 24, 2021

tomaarsen commented Nov 24, 2021 •

edited

janscholich commented Nov 24, 2021

tomaarsen commented Nov 24, 2021

span_tokenize yields ValueError for quotation marks #2890

span_tokenize yields ValueError for quotation marks #2890

Comments

janscholich commented Nov 24, 2021

tomaarsen commented Nov 24, 2021 • edited

janscholich commented Nov 24, 2021

tomaarsen commented Nov 24, 2021

tomaarsen commented Nov 24, 2021 •

edited