You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
With NLTK v3.6.6 I get this output of the snippet:
A length of 300 takes 0.0067 s
A length of 3000 takes 0.0130 s
A length of 30000 takes 0.1367 s
A length of 300000 takes 2.4184 s
A length of 3000000 takes 373.1239 s
Going from 300k to 3M characters the running time increases hugely.
Whereas with NLTK v3.6.5 the running time seems just proportional to the number of characters:
A length of 300 takes 0.0070 s
A length of 3000 takes 0.0121 s
A length of 30000 takes 0.1266 s
A length of 300000 takes 1.2523 s
A length of 3000000 takes 12.5471 s
If the text consist of only one sentence (does not include periods), that is if e.g. text = "ab " * length is used in the above code snippet, processing times are much shorter, only 0.12 s even for 3M characters.
I used Python 3.8.10 on Ubuntu 20.04.
Also word_tokenize is slowed down similarly as it uses sent_tokenize by default. This issue could be related to #2934.
The text was updated successfully, but these errors were encountered:
Thanks again for the very quick fix for this issue!
I wonder if you know when you are going to make the next NLTK release? It's been some time since the latest, v3.7. (Last year 2021 there were 8 releases, but this year there's been only one.)
@juhoinkinen Hopefully early to middle of November. Both myself and @stevenbird have been swamped recently, hence the lack of releases. Apologies for the inconvenience.
We encountered a performance regression in
sent_tokenize
introduced in NLTK 3.6.6. The slow-down is serious with texts of length of ~1 M characters.The snippet from #2869 can be modified for demonstration:
With NLTK v3.6.6 I get this output of the snippet:
Going from 300k to 3M characters the running time increases hugely.
Whereas with NLTK v3.6.5 the running time seems just proportional to the number of characters:
If the text consist of only one sentence (does not include periods), that is if e.g.
text = "ab " * length
is used in the above code snippet, processing times are much shorter, only 0.12 s even for 3M characters.I used Python 3.8.10 on Ubuntu 20.04.
Also
word_tokenize
is slowed down similarly as it usessent_tokenize
by default. This issue could be related to #2934.The text was updated successfully, but these errors were encountered: