Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

No numbers in phonemes set and collapse of whitespaces #105

Open
anh opened this issue May 17, 2021 · 2 comments
Open

No numbers in phonemes set and collapse of whitespaces #105

anh opened this issue May 17, 2021 · 2 comments
Assignees

Comments

@anh
Copy link

anh commented May 17, 2021

When using phonemizer (espeak-ng) there are digits to reflex the vowel/sound variants like the following:

text = 'Có lối ra, chúng ta qua đó xem sao.'
phonemizer.phonemize(
    text,
    language='vi',
    backend='espeak',
    strip=False,
    preserve_punctuation=True,
    punctuation_marks=';:,.!?¡¿—…"«»“”',
    with_stress=True,
    language_switch='keep-flags',
    njobs=1
)

output:

'ɡˈɔɜ lˈoɪɜ zˈaː7 , tɕˈuɜŋ t̪ˈaː1 wˈaː1 ɗˈɔɜ sˈɛ1m ʂˈaːʊ7 .'

with tokenizer._postprocess:

text = ''.join([c for c in text if c in all_phonemes]) # --> will remove numbers which are not in phonemes set 
text = _collapse_whitespace(text)

output:

ɡˈɔɜ lˈoɪɜ zˈaː,tɕˈuɜŋ tˈaː wˈaː ɗˈɔɜ sˈɛm ʂˈaːʊ.

Outputs placed together:

ɡˈɔɜ lˈoɪɜ zˈaː7 , tɕˈuɜŋ t̪ˈaː1 wˈaː1 ɗˈɔɜ sˈɛ1m ʂˈaːʊ7 .'
ɡˈɔɜ lˈoɪɜ zˈaː,tɕˈuɜŋ tˈaː wˈaː ɗˈɔɜ sˈɛm ʂˈaːʊ.

My question is the missing of numbers (here 7, 1) and spaces surround punctuation like comma as in zˈaː,tɕˈuɜŋ tˈaː
instead of zˈaː7 , tɕˈuɜŋ t̪ˈaː1 will affect the aligment and pause beetween generated words?

@cfrancesco
Copy link
Contributor

Hi,
the whitespace collapse is a wanted effect, mostly to be able to control where the pauses are allocated with the forward model. You can remove this if you want by removing it from line 91 in data/text/tokenizer.py (return the line above). But I would discourage that, unless you're running into problems.
For the numbers issue, you can add the missing phonemes (for instance 1,2,3,4,5,,6,7,8,9,0) in data/text/symbols.py in all phonemes like so:
all_phonemes = sorted(list(_phonemes) + list(_punctuations) + list('1234567890')
I was not aware that some languages had numbers as phonemes.

TODO: Add optional extra phonemes string to data_config.yaml

@cfrancesco cfrancesco self-assigned this May 17, 2021
@anh
Copy link
Author

anh commented May 17, 2021

Thank you for your clarification and making phonemes configurable is super helpful. I'll try your suggestion.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants