Performance regression in sent_tokenize #3013

juhoinkinen · 2022-06-20T20:21:03Z

We encountered a performance regression in sent_tokenize introduced in NLTK 3.6.6. The slow-down is serious with texts of length of ~1 M characters.

The snippet from #2869 can be modified for demonstration:

n = 7
for length in [10**i for i in range(2, n)]:
    text = "a. " * length
    start_t = time.time()
    sent_tokenize(text)
    print(f"A length of {length*3:<{n}} takes {time.time() - start_t:.4f} s")

With NLTK v3.6.6 I get this output of the snippet:

A length of 300     takes 0.0067 s
A length of 3000    takes 0.0130 s
A length of 30000   takes 0.1367 s
A length of 300000  takes 2.4184 s
A length of 3000000 takes 373.1239 s

Going from 300k to 3M characters the running time increases hugely.

Whereas with NLTK v3.6.5 the running time seems just proportional to the number of characters:

A length of 300     takes 0.0070 s
A length of 3000    takes 0.0121 s
A length of 30000   takes 0.1266 s
A length of 300000  takes 1.2523 s
A length of 3000000 takes 12.5471 s

If the text consist of only one sentence (does not include periods), that is if e.g. text = "ab " * length is used in the above code snippet, processing times are much shorter, only 0.12 s even for 3M characters.

I used Python 3.8.10 on Ubuntu 20.04.

Also word_tokenize is slowed down similarly as it uses sent_tokenize by default. This issue could be related to #2934.

The text was updated successfully, but these errors were encountered:

juhoinkinen · 2022-10-24T12:21:47Z

Hi @tomaarsen et al.!

Thanks again for the very quick fix for this issue!

I wonder if you know when you are going to make the next NLTK release? It's been some time since the latest, v3.7. (Last year 2021 there were 8 releases, but this year there's been only one.)

tomaarsen · 2022-10-24T12:27:05Z

@juhoinkinen Hopefully early to middle of November. Both myself and @stevenbird have been swamped recently, hence the lack of releases. Apologies for the inconvenience.

tomaarsen self-assigned this Jun 20, 2022

tomaarsen mentioned this issue Jun 21, 2022

Tackle performance and accuracy regression of sentence tokenizer since NLTK 3.6.6 #3014

Merged

stevenbird closed this as completed in #3014 Jul 4, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Performance regression in sent_tokenize #3013

Performance regression in sent_tokenize #3013

juhoinkinen commented Jun 20, 2022

juhoinkinen commented Oct 24, 2022

tomaarsen commented Oct 24, 2022

Performance regression in sent_tokenize #3013

Performance regression in sent_tokenize #3013

Comments

juhoinkinen commented Jun 20, 2022

juhoinkinen commented Oct 24, 2022

tomaarsen commented Oct 24, 2022