Greatly improve time efficiency of SyllableTokenizer
when tokenizing numbers
#3042
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Resolves #3041
Pull request overview
SyllableTokenizer
when tokenizing numbers.SyllableTokenizer
when tokenizing most inputs.SyllableTokenizer
when used on numbers."2014"
->['20', '1', '4']
"2014"
->['2014']
Issue
As can be seen in #3041, tokenizing numbers with
SyllableTokenizer
is quite slow. My experiments showed me that doubling the length of a number-only string increases the time to tokenize that string by significantly more than twice. The primary issues lies inassign_values
:nltk/nltk/tokenize/sonority_sequencing.py
Lines 85 to 110 in 13cea29
The first case of the
except
branch has a warning, i.e. that the character is unknown to the tokenizer. The value is then assigned to be equivalent to a vowel, and crucially, is added toself.vowels
. If the input text is simply"9" * 1000
, then thisassign_values
method will fillself.vowels
to beaeouiy999...
with 1000"9"
's. The remainder of the methods of this class will e.g. do membership checks with thisself.vowels
, useself.vowels
to build a regex pattern, or loop overself.vowels
directly. Long story short, this all gets much, much slower.Changes
We want
self.vowels
to be a kind of set: we don't want any duplication in here. So, I've modified the method to only add toself.vowels
if the character isn't already inself.vowels
. Furthermore, I'm now treating numbers like punctuation.Furthermore, I've modified
validate_syllables
slightly. I just create a pattern before the start of the loop, rather than re-building a regex in the loop itself.Performance
Performance comparison, before and after
I've created a small script to track the time-efficiency:
The
token = "9"
line gets modified during these tests, i.e.Test for "ab"
refers to using"ab"
as thetoken
in this script.Before this PR
Test for "9"
Test for "a"
Test for "b"
Test for "ab"
After this PR
Test for "9"
Verdict: Enormously faster, but note that e.g.
2014
is now tokenized into["2014"]
instead of["20", "1", "4"]
. I would call this a big improvement.Test for "a"
Verdict: Small improvement, likely due to the
re.compile
change.Test for "b"
Verdict: Equivalent.
Test for "ab"
Verdict: Small improvement, likely due to the
re.compile
change.Consequences
Beyond being much faster, this also affects the tokenized output:
Output before this PR:
Output after this PR:
Thank you @BLKSerene for pointing out this issue.