Chunking for Transcript Documents in Solr #5812

joncameron · 2024-05-01T18:32:48Z

Description

Carrying on from #5730, we'll need to decide how to put transcript documents into Solr to meet our needs for query matching. In particular, the documents should be broken up into chunks that can be returned in IIIF Content Search API responses, which will then be handled by the transcript search component in Ramp for display and navigation.

Transcript types and chunking strategy

VTT and SRT: Time and text for each cue
Word and Plain Text: Paragraphs

Done Looks Like

Example documents added to this ticket for the 4 main transcript types
Valid transcripts (types listed above) are indexed into Solr in a way that makes it possible to query and return in IIIF Content Search format responses
Figure out parsing on the documents for indexing

masaball · 2024-05-08T19:07:06Z

Here are the test files I created and have been using for the initial work on this issue (zenhub did not like the VTT or SRT extension so I uploaded them as .txt files). VTT and SRT use approximately the first minute of Lunchroom Manners captions. Plain Text and Docx use a chunk of the first chapter of the spec/fixtures/public-domain-book.txt file: