New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feature/speaker-classifier-apply-function #197
Conversation
transcript: Union[str, Path, Transcript], | ||
audio: Union[str, Path, AudioSegment], | ||
model: str = DEFAULT_MODEL, | ||
min_intra_sentence_chunk_duration: float = 0.5, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you think 0.5 is enough for a minimum? For certain things like roundtable one-word answers like 'yes/no' I could see it taking slightly less than half a second
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is one of those things of... "is it valuable to the downstream analysis"
anything less than 0.5 seconds to me seems non-valuable to tag.
Though I am trying too find a better justification for this. Part of the reason is that I only trained on data that ranged from 0.5 - 2 seconds. And that was because anything less than 0.5 seconds worried me for "how much data to predict with" the smaller the clip the less information to use.
model: str = DEFAULT_MODEL, | ||
min_intra_sentence_chunk_duration: float = 0.5, | ||
max_intra_sentence_chunk_duration: float = 2.0, | ||
min_sentence_mean_confidence: float = 0.985, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this based on the confidence you've been seeing with the existing model you trained?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hah. Great question and one that I am struggling with. Right now, this is 0.985 is based off: "i tried a bunch of different thresholds and this one seemed good" but I would love to know if there is a formula from confidence -> p-value??
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me! Just had a few questions about the some of the default values we picked for sentence length and confidence
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
annotate
looks good to me. Just have a couple of questions.
The maximum duration for a sentences audio to split to. This should match | ||
whatever was used during model training | ||
(i.e. trained on 2 second audio chunks, apply on 2 second audio chunks) | ||
Default: 2 seconds |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's the reasoning behind this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Originally couldn't fit more than 2 seconds of audio into GPU during training.
Codecov Report
@@ Coverage Diff @@
## main #197 +/- ##
==========================================
- Coverage 93.38% 91.24% -2.15%
==========================================
Files 51 53 +2
Lines 2677 2740 +63
==========================================
Hits 2500 2500
- Misses 177 240 +63
Continue to review full report at Codecov.
|
Link to Relevant Issue
WIP #131
Description of Changes
Include a description of the proposed changes.
Adds the function to annotate a transcript with a trained speakerbox model!
I am already using this and applying the highest accuracy model for Seattle to all of 2021-01-01 to 2022-01-01 data: