Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

span_tokenize yields ValueError for quotation marks #2890

Closed
janscholich opened this issue Nov 24, 2021 · 3 comments
Closed

span_tokenize yields ValueError for quotation marks #2890

janscholich opened this issue Nov 24, 2021 · 3 comments

Comments

@janscholich
Copy link

The following code

sentence = str('\'\'The Economist,\'\' a Chicago real estate journal, conceded in 1892 that:  } Early tenants, according to Rand McNally, included "great corporations, banks, and professional men ... among them the')
print(list(TreebankWordTokenizer().span_tokenize(sentence)))

yields

ValueError: substring "''" not found in "''The Economist,'' a Chicago real estate journal, conceded in 1892 that:  } Early tenants, according to Rand McNally, included "great corporations, banks, and professional men ... among them the"

It's raised in line 293 of the function align_tokens. My guess is that this snippet in span_tokenize:

if ('"' in text) or ("''" in text):
            # Find double quotes and converted quotes
            matched = [m.group() for m in re.finditer(r"``|'{2}|\"", text)]

            # Replace converted quotes back to double quotes
            tokens = [
                matched.pop(0) if tok in ['"', "``", "''"] else tok
                for tok in raw_tokens
            ]
        else:
            tokens = raw_tokens

modifies the quotation marks in such a way that it differs from the token in the original raw text which causes a ValueError when calling sentence.index(token, point) (in align_tokens). Not entirely sure though.

@tomaarsen
Copy link
Member

tomaarsen commented Nov 24, 2021

You're very right on your analysis there! Feel free to have a look at #2877. It will be included in the next release.
Until then, I would recommend installing the develop branch, which has this fix in place. E.g. with:

pip install -U git+https://github.com/nltk/nltk

Running your test on that branch gives:

from nltk.tokenize import TreebankWordTokenizer

sentence = str('\'\'The Economist,\'\' a Chicago real estate journal, conceded in 1892 that:  } Early tenants, according to Rand McNally, included "great corporations, banks, and professional men ... among them the')
print(list(TreebankWordTokenizer().span_tokenize(sentence)))
[(0, 2), (2, 5), (6, 15), (15, 16), (16, 18), (19, 20), (21, 28), (29, 33), (34, 40), (41, 48), (48, 49), (50, 58), (59, 61), (62, 66), (67, 71), (71, 72), (74, 75), (76, 81), (82, 89), (89, 90), (91, 100), (101, 103), (104, 108), (109, 116), (116, 117), (118, 126), (127, 128), (128, 133), (134, 146), (146, 147), (148, 153), (153, 154), (155, 158), (159, 171), (172, 175), (176, 179), (180, 185), (186, 190), (191, 194)]

@janscholich
Copy link
Author

Thanks a lot for the quick response! Saves me a lot of time :)

@tomaarsen
Copy link
Member

Gladly. I'll close this then. Feel free to let us know if you encounter more issues!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants