You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
sentence = str('\'\'The Economist,\'\' a Chicago real estate journal, conceded in 1892 that: } Early tenants, according to Rand McNally, included "great corporations, banks, and professional men ... among them the')
print(list(TreebankWordTokenizer().span_tokenize(sentence)))
yields
ValueError: substring "''" not found in "''The Economist,'' a Chicago real estate journal, conceded in 1892 that: } Early tenants, according to Rand McNally, included "great corporations, banks, and professional men ... among them the"
It's raised in line 293 of the function align_tokens. My guess is that this snippet in span_tokenize:
if ('"' in text) or ("''" in text):
# Find double quotes and converted quotes
matched = [m.group() for m in re.finditer(r"``|'{2}|\"", text)]
# Replace converted quotes back to double quotes
tokens = [
matched.pop(0) if tok in ['"', "``", "''"] else tok
for tok in raw_tokens
]
else:
tokens = raw_tokens
modifies the quotation marks in such a way that it differs from the token in the original raw text which causes a ValueError when calling sentence.index(token, point) (in align_tokens). Not entirely sure though.
The text was updated successfully, but these errors were encountered:
You're very right on your analysis there! Feel free to have a look at #2877. It will be included in the next release.
Until then, I would recommend installing the develop branch, which has this fix in place. E.g. with:
pipinstall-Ugit+https://github.com/nltk/nltk
Running your test on that branch gives:
fromnltk.tokenizeimportTreebankWordTokenizersentence=str('\'\'The Economist,\'\' a Chicago real estate journal, conceded in 1892 that: } Early tenants, according to Rand McNally, included "great corporations, banks, and professional men ... among them the')
print(list(TreebankWordTokenizer().span_tokenize(sentence)))
The following code
yields
It's raised in line 293 of the function
align_tokens
. My guess is that this snippet inspan_tokenize
:modifies the quotation marks in such a way that it differs from the token in the original raw text which causes a
ValueError
when callingsentence.index(token, point)
(inalign_tokens
). Not entirely sure though.The text was updated successfully, but these errors were encountered: