New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
bugfix/fix-newline-webvtt-sentence-segmentation #193
bugfix/fix-newline-webvtt-sentence-segmentation #193
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hah wow. Only thing I am going to request is to add a test to webvtt parser tests with a boston generated VTT file.
Codecov Report
@@ Coverage Diff @@
## main #193 +/- ##
=======================================
Coverage 94.60% 94.60%
=======================================
Files 50 50
Lines 2630 2632 +2
=======================================
+ Hits 2488 2490 +2
Misses 142 142
Continue to review full report at Codecov.
|
Tested this on my local setup and seems that the tests fail intermittently, and I'm not sure why. Will look into this in a bit |
For context into why the test case was previously failing, Reduced the file size and only included words that didn't trigger this issue |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me! Thanks!
Link to Relevant Issue
This pull request resolves #190
Description of Changes
Realized that the reason why Boston's parsing is so weird is because the
.vtt
file has several\n
characters. It turns out that for caption segments that have `\n' in them, the regex pattern for finding the end of the sentence fails. I believe it's for a reason related to this.Fixed the issue by replacing all `\n' characters with spaces. Testing this out on the same vtt file resulted in bumping up the number of sentences from 37 to 195.
Here's an example of the first few sentences which are split with more granularity after the fix. (Original):