Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Greatly improve time efficiency of SyllableTokenizer when tokenizing numbers #3042

Merged
merged 2 commits into from
Sep 1, 2022

Conversation

tomaarsen
Copy link
Member

Resolves #3041

Pull request overview

  • Greatly improve time efficiency of SyllableTokenizer when tokenizing numbers.
  • Slightly improve time efficiency of SyllableTokenizer when tokenizing most inputs.
  • Improve output of SyllableTokenizer when used on numbers.
    • Before: "2014" -> ['20', '1', '4']
    • After: "2014" -> ['2014']
  • Add a test to showcase the change in output.

Issue

As can be seen in #3041, tokenizing numbers with SyllableTokenizer is quite slow. My experiments showed me that doubling the length of a number-only string increases the time to tokenize that string by significantly more than twice. The primary issues lies in assign_values:

def assign_values(self, token):
"""
Assigns each phoneme its value from the sonority hierarchy.
Note: Sentence/text has to be tokenized first.
:param token: Single word or token
:type token: str
:return: List of tuples, first element is character/phoneme and
second is the soronity value.
:rtype: list(tuple(str, int))
"""
syllables_values = []
for c in token:
try:
syllables_values.append((c, self.phoneme_map[c]))
except KeyError:
if c not in punctuation:
warnings.warn(
"Character not defined in sonority_hierarchy,"
" assigning as vowel: '{}'".format(c)
)
syllables_values.append((c, max(self.phoneme_map.values())))
self.vowels += c
else: # If it's a punctuation, assign -1.
syllables_values.append((c, -1))
return syllables_values

The first case of the except branch has a warning, i.e. that the character is unknown to the tokenizer. The value is then assigned to be equivalent to a vowel, and crucially, is added to self.vowels. If the input text is simply "9" * 1000, then this assign_values method will fill self.vowels to be aeouiy999... with 1000 "9"'s. The remainder of the methods of this class will e.g. do membership checks with this self.vowels, use self.vowels to build a regex pattern, or loop over self.vowels directly. Long story short, this all gets much, much slower.

Changes

We want self.vowels to be a kind of set: we don't want any duplication in here. So, I've modified the method to only add to self.vowels if the character isn't already in self.vowels. Furthermore, I'm now treating numbers like punctuation.
Furthermore, I've modified validate_syllables slightly. I just create a pattern before the start of the loop, rather than re-building a regex in the loop itself.

Performance

Performance comparison, before and after

I've created a small script to track the time-efficiency:

import time
from nltk import SyllableTokenizer

tokenizer = SyllableTokenizer()
for length in [10**i for i in range(2, 7)]:
    token = "9"
    text = token * (length // len(token))
    start_t = time.time()
    output = tokenizer.tokenize(text)
    print(f"Text: \"{token}...{token}\" Length: {length: <7} Time: {time.time() - start_t:.8f}s Output Length: {len(output)}")

The token = "9" line gets modified during these tests, i.e. Test for "ab" refers to using "ab" as the token in this script.

Before this PR

Test for "9"

c:\GitHub\nltk\nltk\tokenize\sonority_sequencing.py:102: UserWarning: Character not defined in sonority_hierarchy, assigning as vowel: '9'
  warnings.warn(
Text: "9...9" Length: 100     Time: 0.00096536s Output Length: 99
Text: "9...9" Length: 1000    Time: 0.03122854s Output Length: 999
Text: "9...9" Length: 10000   Time: 2.04415488s Output Length: 9999
[I terminated the script after a minute or so]

Test for "a"

Text: "a...a" Length: 100     Time: 0.00100040s Output Length: 99
Text: "a...a" Length: 1000    Time: 0.00199938s Output Length: 999
Text: "a...a" Length: 10000   Time: 0.01996183s Output Length: 9999
Text: "a...a" Length: 100000  Time: 0.26595378s Output Length: 99999
Text: "a...a" Length: 1000000 Time: 1.77999187s Output Length: 999999

Test for "b"

Text: "b...b" Length: 100     Time: 0.00000000s Output Length: 1
Text: "b...b" Length: 1000    Time: 0.00037980s Output Length: 1
Text: "b...b" Length: 10000   Time: 0.00175548s Output Length: 1
Text: "b...b" Length: 100000  Time: 0.03833652s Output Length: 1
Text: "b...b" Length: 1000000 Time: 0.19865584s Output Length: 1

Test for "ab"

Text: "ab...ab" Length: 100     Time: 0.00033975s Output Length: 50
Text: "ab...ab" Length: 1000    Time: 0.00164914s Output Length: 500
Text: "ab...ab" Length: 10000   Time: 0.01277876s Output Length: 5000
Text: "ab...ab" Length: 100000  Time: 0.16249561s Output Length: 50000
Text: "ab...ab" Length: 1000000 Time: 1.18614769s Output Length: 500000

After this PR

Test for "9"

Text: "9...9" Length: 100     Time: 0.00000000s Output Length: 1
Text: "9...9" Length: 1000    Time: 0.00000000s Output Length: 1
Text: "9...9" Length: 10000   Time: 0.00303316s Output Length: 1
Text: "9...9" Length: 100000  Time: 0.06793642s Output Length: 1
Text: "9...9" Length: 1000000 Time: 0.35292196s Output Length: 1

Verdict: Enormously faster, but note that e.g. 2014 is now tokenized into ["2014"] instead of ["20", "1", "4"]. I would call this a big improvement.

Test for "a"

Text: "a...a" Length: 100     Time: 0.00032091s Output Length: 99
Text: "a...a" Length: 1000    Time: 0.00131679s Output Length: 999
Text: "a...a" Length: 10000   Time: 0.01204681s Output Length: 9999
Text: "a...a" Length: 100000  Time: 0.18806863s Output Length: 99999
Text: "a...a" Length: 1000000 Time: 1.28347945s Output Length: 999999

Verdict: Small improvement, likely due to the re.compile change.

Test for "b"

Text: "b...b" Length: 100     Time: 0.00000000s Output Length: 1
Text: "b...b" Length: 1000    Time: 0.00000000s Output Length: 1
Text: "b...b" Length: 10000   Time: 0.00199986s Output Length: 1
Text: "b...b" Length: 100000  Time: 0.03462195s Output Length: 1
Text: "b...b" Length: 1000000 Time: 0.18890190s Output Length: 1

Verdict: Equivalent.

Test for "ab"

Text: "ab...ab" Length: 100     Time: 0.00000000s Output Length: 50
Text: "ab...ab" Length: 1000    Time: 0.00116086s Output Length: 500
Text: "ab...ab" Length: 10000   Time: 0.00857925s Output Length: 5000
Text: "ab...ab" Length: 100000  Time: 0.11815405s Output Length: 50000
Text: "ab...ab" Length: 1000000 Time: 0.89497209s Output Length: 500000

Verdict: Small improvement, likely due to the re.compile change.

Consequences

Beyond being much faster, this also affects the tokenized output:

from nltk.tokenize import SyllableTokenizer
from nltk import word_tokenize
tokenizer = SyllableTokenizer()
text = "This is a foobar-like sentence from 2014."
print([tokenizer.tokenize(token) for token in word_tokenize(text)])

Output before this PR:

c:\GitHub\nltk\nltk\tokenize\sonority_sequencing.py:102: UserWarning: Character not defined in sonority_hierarchy, assigning as vowel: '2'
  warnings.warn(
c:\GitHub\nltk\nltk\tokenize\sonority_sequencing.py:102: UserWarning: Character not defined in sonority_hierarchy, assigning as vowel: '0'
  warnings.warn(
c:\GitHub\nltk\nltk\tokenize\sonority_sequencing.py:102: UserWarning: Character not defined in sonority_hierarchy, assigning as vowel: '1'
  warnings.warn(
c:\GitHub\nltk\nltk\tokenize\sonority_sequencing.py:102: UserWarning: Character not defined in sonority_hierarchy, assigning as vowel: '4'
  warnings.warn(
[['This'], ['is'], ['a'], ['foo', 'bar', '-', 'li', 'ke'], ['sen', 'ten', 'ce'], ['from'], ['20', '1', '4'], ['.']]

Output after this PR:

[['This'], ['is'], ['a'], ['foo', 'bar', '-', 'li', 'ke'], ['sen', 'ten', 'ce'], ['from'], ['2014'], ['.']]

Thank you @BLKSerene for pointing out this issue.

  • Tom Aarsen

Ideally this would be a test that times out after e.g. 1 second, and then the length can be set to e.g. 100k
@stevenbird stevenbird merged commit 003cc5d into nltk:develop Sep 1, 2022
@stevenbird
Copy link
Member

Thanks for the careful work @tomaarsen

@tomaarsen tomaarsen changed the title Perf/syllable tokenize Greatly improve time efficiency of SyllableTokenizer when tokenizing numbers Sep 1, 2022
@tomaarsen tomaarsen deleted the perf/syllable_tokenize branch September 1, 2022 14:42
@BLKSerene
Copy link
Contributor

BLKSerene commented Sep 1, 2022

I must say that I'm really awed by the pleasure while reading this PR...

@arademaker
Copy link

arademaker commented Sep 1, 2022

Great work. A model for others (including myself!)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Sonority sequencing syllable tokenizer performs significantly slower on numbers than on words
4 participants