Skip to content

Commit

Permalink
bugfix/fix-newline-webvtt-sentence-segmentation (#193)
Browse files Browse the repository at this point in the history
* Clean newline characters which break sentence segmentation

* Fix lint errors

* Add boston transcript test case

* Reduce boston test file to limit impact of webvtt Truecasing
  • Loading branch information
isaacna committed Jun 14, 2022
1 parent 00f9876 commit c8b6b57
Show file tree
Hide file tree
Showing 4 changed files with 580 additions and 0 deletions.
8 changes: 8 additions & 0 deletions cdp_backend/sr_models/webvtt_sr_model.py
Expand Up @@ -164,7 +164,15 @@ def _get_sentences(
# List of text, representing a sentence
lines: List[str] = []
start_time: Optional[float] = None

for caption in speaker_turn_captions:

# Clean text of line breaks
caption.text = caption.text.replace("\n", " ")

# Remove any double spaces as result of line break removal
caption.text = caption.text.replace(" ", " ")

if start_time is None:
start_time = caption.start_in_seconds
lines.append(caption.text)
Expand Down
36 changes: 36 additions & 0 deletions cdp_backend/tests/resources/boston_captions.vtt
@@ -0,0 +1,36 @@
WEBVTT
00:00:00.000 --> 00:00:00.000


00:00:40.337 --> 00:00:42.272
uh uh,
I'm the city councilor for

00:00:44.975 --> 00:00:46.909
district five and I'm the vice
PRESIDENT Of the boston city
council.

00:00:47.578 --> 00:00:49.445
Viewers can watch the
council meeting live on youtube

00:00:51.415 --> 00:00:53.783
by visiting boston city
council tv.
I would like to ask my

00:00:54.985 --> 00:00:57.653
colleagues and those in the
audience to please silence
their phones and electronic
devices.

00:00:58.522 --> 00:01:00.590
Thank you.
Please also be respectful and
do not disrupt the meeting

00:01:00.590 --> 00:01:02.590
while you are here.

0 comments on commit c8b6b57

Please sign in to comment.