Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

bugfix/fix-newline-webvtt-sentence-segmentation #193

Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Jump to
Jump to file
Failed to load files.
Diff view
Diff view
8 changes: 8 additions & 0 deletions cdp_backend/sr_models/webvtt_sr_model.py
Expand Up @@ -164,7 +164,15 @@ def _get_sentences(
# List of text, representing a sentence
lines: List[str] = []
start_time: Optional[float] = None

for caption in speaker_turn_captions:

# Clean text of line breaks
caption.text = caption.text.replace("\n", " ")

# Remove any double spaces as result of line break removal
caption.text = caption.text.replace(" ", " ")

if start_time is None:
start_time = caption.start_in_seconds
lines.append(caption.text)
Expand Down
36 changes: 36 additions & 0 deletions cdp_backend/tests/resources/boston_captions.vtt
@@ -0,0 +1,36 @@
WEBVTT

00:00:00.000 --> 00:00:00.000


00:00:40.337 --> 00:00:42.272
uh uh,
I'm the city councilor for

00:00:44.975 --> 00:00:46.909
district five and I'm the vice
PRESIDENT Of the boston city
council.

00:00:47.578 --> 00:00:49.445
Viewers can watch the
council meeting live on youtube

00:00:51.415 --> 00:00:53.783
by visiting boston city
council tv.
I would like to ask my

00:00:54.985 --> 00:00:57.653
colleagues and those in the
audience to please silence
their phones and electronic
devices.

00:00:58.522 --> 00:01:00.590
Thank you.
Please also be respectful and
do not disrupt the meeting

00:01:00.590 --> 00:01:02.590
while you are here.