Greatly improve time efficiency of `SyllableTokenizer` when tokenizing numbers #3042

tomaarsen · 2022-09-01T09:44:48Z

Resolves #3041

Pull request overview

Greatly improve time efficiency of SyllableTokenizer when tokenizing numbers.
Slightly improve time efficiency of SyllableTokenizer when tokenizing most inputs.
Improve output of SyllableTokenizer when used on numbers.
- Before: "2014" -> ['20', '1', '4']
- After: "2014" -> ['2014']
Add a test to showcase the change in output.

Issue

As can be seen in #3041, tokenizing numbers with SyllableTokenizer is quite slow. My experiments showed me that doubling the length of a number-only string increases the time to tokenize that string by significantly more than twice. The primary issues lies in assign_values:

nltk/nltk/tokenize/sonority_sequencing.py

Lines 85 to 110 in 13cea29

    
               def assign_values(self, token): 
        
                   """ 
        
                   Assigns each phoneme its value from the sonority hierarchy. 
        
                   Note: Sentence/text has to be tokenized first. 
        
                   :param token: Single word or token 
        
                   :type token: str 
        
                   :return: List of tuples, first element is character/phoneme and 
        
                            second is the soronity value. 
        
                   :rtype: list(tuple(str, int)) 
        
                   """ 
        
                   syllables_values = [] 
        
                   for c in token: 
        
                       try: 
        
                           syllables_values.append((c, self.phoneme_map[c])) 
        
                       except KeyError: 
        
                           if c not in punctuation: 
        
                               warnings.warn( 
        
                                   "Character not defined in sonority_hierarchy," 
        
                                   " assigning as vowel: '{}'".format(c) 
        
                               ) 
        
                               syllables_values.append((c, max(self.phoneme_map.values()))) 
        
                               self.vowels += c 
        
                           else:  # If it's a punctuation, assign -1. 
        
                               syllables_values.append((c, -1)) 
        
                   return syllables_values

The first case of the except branch has a warning, i.e. that the character is unknown to the tokenizer. The value is then assigned to be equivalent to a vowel, and crucially, is added to self.vowels. If the input text is simply "9" * 1000, then this assign_values method will fill self.vowels to be aeouiy999... with 1000 "9"'s. The remainder of the methods of this class will e.g. do membership checks with this self.vowels, use self.vowels to build a regex pattern, or loop over self.vowels directly. Long story short, this all gets much, much slower.

Changes

We want self.vowels to be a kind of set: we don't want any duplication in here. So, I've modified the method to only add to self.vowels if the character isn't already in self.vowels. Furthermore, I'm now treating numbers like punctuation.
Furthermore, I've modified validate_syllables slightly. I just create a pattern before the start of the loop, rather than re-building a regex in the loop itself.

Performance

Performance comparison, before and after

I've created a small script to track the time-efficiency:

import time
from nltk import SyllableTokenizer

tokenizer = SyllableTokenizer()
for length in [10**i for i in range(2, 7)]:
    token = "9"
    text = token * (length // len(token))
    start_t = time.time()
    output = tokenizer.tokenize(text)
    print(f"Text: \"{token}...{token}\" Length: {length: <7} Time: {time.time() - start_t:.8f}s Output Length: {len(output)}")

The token = "9" line gets modified during these tests, i.e. Test for "ab" refers to using "ab" as the token in this script.

Before this PR

Test for "9"

c:\GitHub\nltk\nltk\tokenize\sonority_sequencing.py:102: UserWarning: Character not defined in sonority_hierarchy, assigning as vowel: '9'
  warnings.warn(
Text: "9...9" Length: 100     Time: 0.00096536s Output Length: 99
Text: "9...9" Length: 1000    Time: 0.03122854s Output Length: 999
Text: "9...9" Length: 10000   Time: 2.04415488s Output Length: 9999
[I terminated the script after a minute or so]

Test for "a"

Text: "a...a" Length: 100     Time: 0.00100040s Output Length: 99
Text: "a...a" Length: 1000    Time: 0.00199938s Output Length: 999
Text: "a...a" Length: 10000   Time: 0.01996183s Output Length: 9999
Text: "a...a" Length: 100000  Time: 0.26595378s Output Length: 99999
Text: "a...a" Length: 1000000 Time: 1.77999187s Output Length: 999999

Test for "b"

Text: "b...b" Length: 100     Time: 0.00000000s Output Length: 1
Text: "b...b" Length: 1000    Time: 0.00037980s Output Length: 1
Text: "b...b" Length: 10000   Time: 0.00175548s Output Length: 1
Text: "b...b" Length: 100000  Time: 0.03833652s Output Length: 1
Text: "b...b" Length: 1000000 Time: 0.19865584s Output Length: 1

Test for "ab"

Text: "ab...ab" Length: 100     Time: 0.00033975s Output Length: 50
Text: "ab...ab" Length: 1000    Time: 0.00164914s Output Length: 500
Text: "ab...ab" Length: 10000   Time: 0.01277876s Output Length: 5000
Text: "ab...ab" Length: 100000  Time: 0.16249561s Output Length: 50000
Text: "ab...ab" Length: 1000000 Time: 1.18614769s Output Length: 500000

After this PR

Test for "9"

Text: "9...9" Length: 100     Time: 0.00000000s Output Length: 1
Text: "9...9" Length: 1000    Time: 0.00000000s Output Length: 1
Text: "9...9" Length: 10000   Time: 0.00303316s Output Length: 1
Text: "9...9" Length: 100000  Time: 0.06793642s Output Length: 1
Text: "9...9" Length: 1000000 Time: 0.35292196s Output Length: 1

Verdict: Enormously faster, but note that e.g. 2014 is now tokenized into ["2014"] instead of ["20", "1", "4"]. I would call this a big improvement.

Test for "a"

Text: "a...a" Length: 100     Time: 0.00032091s Output Length: 99
Text: "a...a" Length: 1000    Time: 0.00131679s Output Length: 999
Text: "a...a" Length: 10000   Time: 0.01204681s Output Length: 9999
Text: "a...a" Length: 100000  Time: 0.18806863s Output Length: 99999
Text: "a...a" Length: 1000000 Time: 1.28347945s Output Length: 999999

Verdict: Small improvement, likely due to the re.compile change.

Test for "b"

Text: "b...b" Length: 100     Time: 0.00000000s Output Length: 1
Text: "b...b" Length: 1000    Time: 0.00000000s Output Length: 1
Text: "b...b" Length: 10000   Time: 0.00199986s Output Length: 1
Text: "b...b" Length: 100000  Time: 0.03462195s Output Length: 1
Text: "b...b" Length: 1000000 Time: 0.18890190s Output Length: 1

Verdict: Equivalent.

Test for "ab"

Text: "ab...ab" Length: 100     Time: 0.00000000s Output Length: 50
Text: "ab...ab" Length: 1000    Time: 0.00116086s Output Length: 500
Text: "ab...ab" Length: 10000   Time: 0.00857925s Output Length: 5000
Text: "ab...ab" Length: 100000  Time: 0.11815405s Output Length: 50000
Text: "ab...ab" Length: 1000000 Time: 0.89497209s Output Length: 500000

Verdict: Small improvement, likely due to the re.compile change.

Consequences

Beyond being much faster, this also affects the tokenized output:

from nltk.tokenize import SyllableTokenizer
from nltk import word_tokenize
tokenizer = SyllableTokenizer()
text = "This is a foobar-like sentence from 2014."
print([tokenizer.tokenize(token) for token in word_tokenize(text)])

Output before this PR:

c:\GitHub\nltk\nltk\tokenize\sonority_sequencing.py:102: UserWarning: Character not defined in sonority_hierarchy, assigning as vowel: '2'
  warnings.warn(
c:\GitHub\nltk\nltk\tokenize\sonority_sequencing.py:102: UserWarning: Character not defined in sonority_hierarchy, assigning as vowel: '0'
  warnings.warn(
c:\GitHub\nltk\nltk\tokenize\sonority_sequencing.py:102: UserWarning: Character not defined in sonority_hierarchy, assigning as vowel: '1'
  warnings.warn(
c:\GitHub\nltk\nltk\tokenize\sonority_sequencing.py:102: UserWarning: Character not defined in sonority_hierarchy, assigning as vowel: '4'
  warnings.warn(
[['This'], ['is'], ['a'], ['foo', 'bar', '-', 'li', 'ke'], ['sen', 'ten', 'ce'], ['from'], ['20', '1', '4'], ['.']]

Output after this PR:

[['This'], ['is'], ['a'], ['foo', 'bar', '-', 'li', 'ke'], ['sen', 'ten', 'ce'], ['from'], ['2014'], ['.']]

Thank you @BLKSerene for pointing out this issue.

Tom Aarsen

Mainly when tokenizing numbers

Ideally this would be a test that times out after e.g. 1 second, and then the length can be set to e.g. 100k

stevenbird · 2022-09-01T13:39:05Z

Thanks for the careful work @tomaarsen

BLKSerene · 2022-09-01T16:46:27Z

I must say that I'm really awed by the pleasure while reading this PR...

arademaker · 2022-09-01T16:59:24Z

Great work. A model for others (including myself!)

tomaarsen added 2 commits September 1, 2022 11:04

Greatly improve SyllableTokenizer time performance on some edge cases

ea006df

Mainly when tokenizing numbers

Add SyllableTokenizer test stating that numbers must not be separated

3722cfe

Ideally this would be a test that times out after e.g. 1 second, and then the length can be set to e.g. 100k

tomaarsen added bug tokenizer labels Sep 1, 2022

stevenbird merged commit 003cc5d into nltk:develop Sep 1, 2022

tomaarsen changed the title ~~Perf/syllable tokenize~~ Greatly improve time efficiency of SyllableTokenizer when tokenizing numbers Sep 1, 2022

tomaarsen deleted the perf/syllable_tokenize branch September 1, 2022 14:42

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Greatly improve time efficiency of `SyllableTokenizer` when tokenizing numbers #3042

Greatly improve time efficiency of `SyllableTokenizer` when tokenizing numbers #3042

tomaarsen commented Sep 1, 2022

stevenbird commented Sep 1, 2022

BLKSerene commented Sep 1, 2022 •

edited

arademaker commented Sep 1, 2022 •

edited

	def assign_values(self, token):
	"""
	Assigns each phoneme its value from the sonority hierarchy.
	Note: Sentence/text has to be tokenized first.

	:param token: Single word or token
	:type token: str
	:return: List of tuples, first element is character/phoneme and
	second is the soronity value.
	:rtype: list(tuple(str, int))
	"""
	syllables_values = []
	for c in token:
	try:
	syllables_values.append((c, self.phoneme_map[c]))
	except KeyError:
	if c not in punctuation:
	warnings.warn(
	"Character not defined in sonority_hierarchy,"
	" assigning as vowel: '{}'".format(c)
	)
	syllables_values.append((c, max(self.phoneme_map.values())))
	self.vowels += c
	else: # If it's a punctuation, assign -1.
	syllables_values.append((c, -1))
	return syllables_values

Greatly improve time efficiency of SyllableTokenizer when tokenizing numbers #3042

Greatly improve time efficiency of SyllableTokenizer when tokenizing numbers #3042

Conversation

tomaarsen commented Sep 1, 2022

Pull request overview

Issue

Changes

Performance

Before this PR

After this PR

Consequences

stevenbird commented Sep 1, 2022

BLKSerene commented Sep 1, 2022 • edited

arademaker commented Sep 1, 2022 • edited

Greatly improve time efficiency of `SyllableTokenizer` when tokenizing numbers #3042

Greatly improve time efficiency of `SyllableTokenizer` when tokenizing numbers #3042

BLKSerene commented Sep 1, 2022 •

edited

arademaker commented Sep 1, 2022 •

edited