Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

bugfix/fix-newline-webvtt-sentence-segmentation #193

Conversation

isaacna
Copy link
Collaborator

@isaacna isaacna commented Jun 13, 2022

Link to Relevant Issue

This pull request resolves #190

Description of Changes

Realized that the reason why Boston's parsing is so weird is because the .vtt file has several \n characters. It turns out that for caption segments that have `\n' in them, the regex pattern for finding the end of the sentence fails. I believe it's for a reason related to this.

Fixed the issue by replacing all `\n' characters with spaces. Testing this out on the same vtt file resulted in bumping up the number of sentences from 37 to 195.

Here's an example of the first few sentences which are split with more granularity after the fix. (Original):

uh uh, i'm the city councilor for district five and i'm the vice president of the boston city council.

Viewers can watch the council meeting live on youtube by visiting boston city council tv. i would like to ask my colleagues and those in the audience to please silence their phones and electronic devices.

Thank you. please also be respectful and do not disrupt the meeting while you are here.

If you disruptive you will be asked to leave and if you fail to comply you will be escorted out. please also note the fact that according city council rules there are no signs allowed in the chamber.

Mr. clark, would you please call the roll to ascertain the presence of a quorum? councilor arroyo present councilor baker, councilor book councilor breadon councilor colletta councilor for ananda sanderson, councilor flaherty councilor flynn councilor of council who is general councilor.

Let me here councilor murphy and councilor arcore. thank you mr clark.

I have been informed by the clerk that a quorum is present.

This week's clergy is father bob carr of st. anthony parish in austin invited by councilor breadon. councilor breadon, would you like to come up to the podium, introduce our clergy for today ?

@isaacna isaacna requested a review from evamaxfield June 13, 2022 23:06
Copy link
Member

@evamaxfield evamaxfield left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hah wow. Only thing I am going to request is to add a test to webvtt parser tests with a boston generated VTT file.

@codecov
Copy link

codecov bot commented Jun 13, 2022

Codecov Report

Merging #193 (5232d40) into main (00f9876) will increase coverage by 0.00%.
The diff coverage is 100.00%.

@@           Coverage Diff           @@
##             main     #193   +/-   ##
=======================================
  Coverage   94.60%   94.60%           
=======================================
  Files          50       50           
  Lines        2630     2632    +2     
=======================================
+ Hits         2488     2490    +2     
  Misses        142      142           
Impacted Files Coverage Δ
...dp_backend/tests/sr_models/test_webvtt_sr_model.py 100.00% <ø> (ø)
cdp_backend/sr_models/webvtt_sr_model.py 100.00% <100.00%> (ø)

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 00f9876...5232d40. Read the comment docs.

@isaacna
Copy link
Collaborator Author

isaacna commented Jun 14, 2022

Tested this on my local setup and seems that the tests fail intermittently, and I'm not sure why. Will look into this in a bit

@isaacna
Copy link
Collaborator Author

isaacna commented Jun 14, 2022

For context into why the test case was previously failing, webvtt uses truecasing for certain words that may be capitalized differently ("rideshare "-> "RIDESHARE"). This is non-deterministic however so was causing inconsistent test results.

Reduced the file size and only included words that didn't trigger this issue

Copy link
Member

@evamaxfield evamaxfield left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me! Thanks!

@evamaxfield evamaxfield merged commit c8b6b57 into CouncilDataProject:main Jun 14, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

WebVTT parsing fails for Boston
2 participants