Skip to content

Is the index in Quikwit language independent? #1388

Answered by fmassot
HeenaBansal2009 asked this question in Q&A
Discussion options

You must be logged in to vote

Currently, you can specify 2 tokenizers:

  • the raw tokenizer that does nothing
  • the default tokenizer that does the following: split on whitespace and punctuations (everything that is not alphanumeri), remove long token (> 40 bytes), lower case each token.

The code of the SimpleTokenizer used:

impl<'a> SimpleTokenStream<'a> {
    // search for the end of the current token.
    fn search_token_end(&mut self) -> usize {
        (&mut self.chars)
            .filter(|&(_, ref c)| !c.is_alphanumeric())
            .map(|(offset, _)| offset)
            .next()
            .unwrap_or(self.text.len())
    }
}

In tantivy you have access to more tokenizers: ngram, stemming in latin languages, thir…

Replies: 1 comment 1 reply

Comment options

You must be logged in to vote
1 reply
@fmassot
Comment options

Answer selected by fmassot
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Category
Q&A
Labels
None yet
2 participants