You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Carrying on from #5730, we'll need to decide how to put transcript documents into Solr to meet our needs for query matching. In particular, the documents should be broken up into chunks that can be returned in IIIF Content Search API responses, which will then be handled by the transcript search component in Ramp for display and navigation.
Transcript types and chunking strategy
VTT and SRT: Time and text for each cue
Word and Plain Text: Paragraphs
Done Looks Like
Example documents added to this ticket for the 4 main transcript types
Valid transcripts (types listed above) are indexed into Solr in a way that makes it possible to query and return in IIIF Content Search format responses
Figure out parsing on the documents for indexing
The text was updated successfully, but these errors were encountered:
Here are the test files I created and have been using for the initial work on this issue (zenhub did not like the VTT or SRT extension so I uploaded them as .txt files). VTT and SRT use approximately the first minute of Lunchroom Manners captions. Plain Text and Docx use a chunk of the first chapter of the spec/fixtures/public-domain-book.txt file:
Description
Carrying on from #5730, we'll need to decide how to put transcript documents into Solr to meet our needs for query matching. In particular, the documents should be broken up into chunks that can be returned in IIIF Content Search API responses, which will then be handled by the transcript search component in Ramp for display and navigation.
Transcript types and chunking strategy
VTT and SRT: Time and text for each cue
Word and Plain Text: Paragraphs
Done Looks Like
The text was updated successfully, but these errors were encountered: