Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Chunking for Transcript Documents in Solr #5812

Open
3 tasks done
joncameron opened this issue May 1, 2024 · 1 comment
Open
3 tasks done

Chunking for Transcript Documents in Solr #5812

joncameron opened this issue May 1, 2024 · 1 comment
Assignees

Comments

@joncameron
Copy link
Contributor

joncameron commented May 1, 2024

Description

Carrying on from #5730, we'll need to decide how to put transcript documents into Solr to meet our needs for query matching. In particular, the documents should be broken up into chunks that can be returned in IIIF Content Search API responses, which will then be handled by the transcript search component in Ramp for display and navigation.

Transcript types and chunking strategy

VTT and SRT: Time and text for each cue
Word and Plain Text: Paragraphs

Done Looks Like

  • Example documents added to this ticket for the 4 main transcript types
  • Valid transcripts (types listed above) are indexed into Solr in a way that makes it possible to query and return in IIIF Content Search format responses
  • Figure out parsing on the documents for indexing
@masaball
Copy link
Contributor

masaball commented May 8, 2024

Here are the test files I created and have been using for the initial work on this issue (zenhub did not like the VTT or SRT extension so I uploaded them as .txt files). VTT and SRT use approximately the first minute of Lunchroom Manners captions. Plain Text and Docx use a chunk of the first chapter of the spec/fixtures/public-domain-book.txt file:

chunk_test.docx

chunk_test.txt

vtt_chunk_test.txt

srt_chunk_test.txt

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants