-
Notifications
You must be signed in to change notification settings - Fork 1.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Enhance Japanese language detection #3569
Conversation
Uffizzi Preview |
Hey @dureuill, |
Hello @ManyTheFish, thanks for thinking of me to review your PR! Before I dive into the code, a question: did we consider the option of having some documents left as "language-ambiguous" and tokenized as many time as necessary in each of the candidate languages? Similarly, for search queries that could match several languages, tokenize them in all of these languages and return results for all the resulting queries? Such a change could cause structural change and so be a large undertaking. Still, if it has been discussed somewhere, I'd love to be pointed to that discussion. Meanwhile, the change proposed here appears like a nice heuristic to prevent the misdetection of languages using frequency. I'm not 100% sure of what the scope of this sentence is:
can you maybe rephrase a bit? |
Hello @dureuill, I added a section in Technical Approach detailing the indexing process and giving an example at the end. Let me know if you need more! 😄 |
Thank you Many it is indeed clearer with the added explanation! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have a couple of suggestions and questions.
I'm OK with the approach :-)
milli/src/update/index_documents/extract/extract_docid_word_positions.rs
Outdated
Show resolved
Hide resolved
milli/src/update/index_documents/extract/extract_docid_word_positions.rs
Outdated
Show resolved
Hide resolved
milli/src/update/index_documents/extract/extract_docid_word_positions.rs
Outdated
Show resolved
Hide resolved
milli/src/update/index_documents/extract/extract_docid_word_positions.rs
Outdated
Show resolved
Hide resolved
milli/src/update/index_documents/extract/extract_docid_word_positions.rs
Outdated
Show resolved
Hide resolved
milli/src/update/index_documents/extract/extract_docid_word_positions.rs
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Approving. There remain unresolved inline comments above that you can take or leave :-).
Co-authored-by: Louis Dureuil <louis@meilisearch.com>
@dureuill, I added a small documentation for buffers don't hesitate to |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
perfect thank you ❤️
bors merge |
Pull Request
This PR is a prototype and can be tested by downloading the dedicated docker image:
Context
Some Languages are harder to detect than others, this miss-detection leads to bad tokenization making some words or even documents completely unsearchable. Japanese is the main Language affected and can be detected as Chinese which has a completely different way of tokenization.
A first iteration has been implemented for v1.1.0 but is an insufficient enhancement to make Japanese work. This first implementation was detecting the Language during the indexing to avoid bad detections during the search.
Unfortunately, some documents (shorter ones) can be wrongly detected as Chinese running bad tokenization for these documents and making possible the detection of Chinese during the search because it has been detected during the indexing.
For instance, a Japanese document
{"id": 1, "name": "東京スカパラダイスオーケストラ"}
is detected as Japanese during indexing, during the search the query東京
will be detected as Japanese because only Japanese documents have been detected during indexing despite the fact that v1.0.2 would detect it as Chinese.However if in the dataset there is at least one document containing a field with only Kanjis like:
A document with only 1 field containing only Kanjis:
A document with 1 field containing only Kanjis and 1 field containing several Japanese characters:
Then, in both cases, the field
name
will be detected as Chinese during indexing allowing the search to detect Chinese in queries. Therefore, the query東京
will be detected as Chinese and only the two last documents will be retrieved by Meilisearch.Technical Approach
The current PR partially fixes these issues by:
Limits
This PR introduces 2 arbitrary thresholds:
This PR only partially fixes these issues:
東京
now find Japanese documents if less than 5% of documents are detected as Chinese.105
containing the Japanese fielddesc
but the miss-detected fieldname
is now completely detected and tokenized as Japanese and is found with the query東京
.4
no longer breaks the search Language detection but continues to be detected as a Chinese document and can't be found during the search.Related issue
Fixes #3565
Possible future enhancements