Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Filler words sometimes cause issues with wordBoundary results #680

Open
dwegscheidTSC opened this issue May 22, 2023 · 1 comment
Open
Assignees

Comments

@dwegscheidTSC
Copy link

I'm using the microsoft-cognitiveservices-speech-sdk@1.28.0 on Node.js

When using some voices (en-US-SaraNeural does this the most frequently for me, but I've run across it with other voices), filler words like "um" or "uh" cause the wordBoundary event arguments to get garbled after the filler word. Both the text and timing don't match the audio when this happens.

Example code

Here's a small example that illustrates what I'm running into for the text becoming inaccurate:

synthesizeSpeech: async (): Promise<string[]> => {
    const speechConfig = SpeechConfig.fromSubscription(key, region);
    speechConfig.speechSynthesisOutputFormat = SpeechSynthesisOutputFormat.Audio48Khz192KBitRateMonoMp3;

    const synthesizer = new SpeechSynthesizer(speechConfig, null as unknown as AudioConfig);
    const ssml = `
       <speak version="1.0" xmlns="http://www.w3.org/2001/10/synthesis" xmlns:mstts="https://www.w3.org/2001/mstts" xml:lang="en-US">
          <voice name="en-US-SaraNeural">
             Hi my name is uh Sara. I am a voice that gives good audio but incorrect word boundary results. 
          </voice>
       </speak>
    `;

    return new Promise<string[]>((resolve, reject) => {
       const onError = (err: string | Error): void => {
          synthesizer.close();
          reject(err);
       };

       const words: string[] = [];
       synthesizer.wordBoundary = (_, event) => {
          words.push(event.text);
       };
       synthesizer.speakSsmlAsync(ssml, result => {
          if (result.errorDetails) {
             onError(result.errorDetails);
          } else if (result.reason === ResultReason.SynthesizingAudioCompleted) {
             synthesizer.close();
             resolve(words);
          }
       }, onError);
    });
 }

so that await synthesizeSpeech() returns

[
   "Hi",
   "my",
   "name",
   "is",
   "rrec", // everything after where "uh" should be is garbled
   "t",
   "w",
   "rd",
   "b",
   "undar",
   " res",
   "lts. ",
   "    ",
   "     ",
   " </",
   "oice>\n   ",
   "    ",
   "</speak>",
   "",
   ""
]
@yulin-li
Copy link
Contributor

Thanks for reporting this issue. I will forward this to the service experts for further investigation

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants