Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Performance regression in sent_tokenize #3013

Closed
juhoinkinen opened this issue Jun 20, 2022 · 2 comments · Fixed by #3014
Closed

Performance regression in sent_tokenize #3013

juhoinkinen opened this issue Jun 20, 2022 · 2 comments · Fixed by #3014
Assignees

Comments

@juhoinkinen
Copy link

We encountered a performance regression in sent_tokenize introduced in NLTK 3.6.6. The slow-down is serious with texts of length of ~1 M characters.

The snippet from #2869 can be modified for demonstration:

n = 7
for length in [10**i for i in range(2, n)]:
    text = "a. " * length
    start_t = time.time()
    sent_tokenize(text)
    print(f"A length of {length*3:<{n}} takes {time.time() - start_t:.4f} s")

With NLTK v3.6.6 I get this output of the snippet:

A length of 300     takes 0.0067 s
A length of 3000    takes 0.0130 s
A length of 30000   takes 0.1367 s
A length of 300000  takes 2.4184 s
A length of 3000000 takes 373.1239 s

Going from 300k to 3M characters the running time increases hugely.

Whereas with NLTK v3.6.5 the running time seems just proportional to the number of characters:

A length of 300     takes 0.0070 s
A length of 3000    takes 0.0121 s
A length of 30000   takes 0.1266 s
A length of 300000  takes 1.2523 s
A length of 3000000 takes 12.5471 s

If the text consist of only one sentence (does not include periods), that is if e.g. text = "ab " * length is used in the above code snippet, processing times are much shorter, only 0.12 s even for 3M characters.

I used Python 3.8.10 on Ubuntu 20.04.

Also word_tokenize is slowed down similarly as it uses sent_tokenize by default. This issue could be related to #2934.

@juhoinkinen
Copy link
Author

Hi @tomaarsen et al.!

Thanks again for the very quick fix for this issue!

I wonder if you know when you are going to make the next NLTK release? It's been some time since the latest, v3.7. (Last year 2021 there were 8 releases, but this year there's been only one.)

@tomaarsen
Copy link
Member

@juhoinkinen Hopefully early to middle of November. Both myself and @stevenbird have been swamped recently, hence the lack of releases. Apologies for the inconvenience.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants