Enhance Japanese language detection #3569

ManyTheFish · 2023-03-07T17:45:09Z

Pull Request

This PR is a prototype and can be tested by downloading the dedicated docker image:

$ docker pull getmeili/meilisearch:prototype-better-language-detection-0

Context

Some Languages are harder to detect than others, this miss-detection leads to bad tokenization making some words or even documents completely unsearchable. Japanese is the main Language affected and can be detected as Chinese which has a completely different way of tokenization.

A first iteration has been implemented for v1.1.0 but is an insufficient enhancement to make Japanese work. This first implementation was detecting the Language during the indexing to avoid bad detections during the search.
Unfortunately, some documents (shorter ones) can be wrongly detected as Chinese running bad tokenization for these documents and making possible the detection of Chinese during the search because it has been detected during the indexing.

For instance, a Japanese document {"id": 1, "name": "東京スカパラダイスオーケストラ"} is detected as Japanese during indexing, during the search the query 東京 will be detected as Japanese because only Japanese documents have been detected during indexing despite the fact that v1.0.2 would detect it as Chinese.
However if in the dataset there is at least one document containing a field with only Kanjis like:
A document with only 1 field containing only Kanjis:

{
 "id":4,
 "name": "東京特許許可局"
}

A document with 1 field containing only Kanjis and 1 field containing several Japanese characters:

{
 "id":105,
 "name": "東京特許許可局",
 "desc": "日経平均株価は26日 に約8カ月ぶりに2万4000円の心理的な節目を上回った。株高を支える材料のひとつは、自民党総裁選で3選を決めた安倍晋三首相の経済政策への期待だ。恩恵が見込まれるとされる人材サービスや建設株の一角が買われている。ただ思惑が先行して資金が集まっている面 は否めない。実際に政策効果を取り込む企業はどこか、なお未知数だ。"
}

Then, in both cases, the field name will be detected as Chinese during indexing allowing the search to detect Chinese in queries. Therefore, the query 東京 will be detected as Chinese and only the two last documents will be retrieved by Meilisearch.

Technical Approach

The current PR partially fixes these issues by:

Adding a check over potential miss-detections and rerunning the extraction of the document forcing the tokenization over the main Languages detected in it.

run a first extraction allowing the tokenizer to detect any Language in any Script

generate a distribution of tokens by Script and Languages (script_language)

if for a Script we have a token distribution of one of the Language that is under the threshold, then we rerun the extraction forbidding the tokenizer to detect the marginal Languages

the tokenizer will fall back on the other available Languages to tokenize the text. For example, if the Chinese were marginally detected compared to the Japanese on the CJ script, then the second extraction will force Japanese tokenization for CJ text in the document. however, the text on another script like Latin will not be impacted by this restriction.

Adding a filtering threshold during the search over Languages that have been marginally detected in documents

Limits

This PR introduces 2 arbitrary thresholds:

during the indexing, a Language is considered miss-detected if the number of detected tokens of this Language is under 10% of the tokens detected in the same Script (Japanese and Chinese are 2 different Languages sharing the "same" script "CJK").
during the search, a Language is considered marginal if less than 5% of documents are detected as this Language.

This PR only partially fixes these issues:

✅ the query 東京 now find Japanese documents if less than 5% of documents are detected as Chinese.
✅ the document with the id 105 containing the Japanese field desc but the miss-detected field name is now completely detected and tokenized as Japanese and is found with the query 東京.
❌ the document with the id 4 no longer breaks the search Language detection but continues to be detected as a Chinese document and can't be found during the search.

Related issue

Fixes #3565

Possible future enhancements

Change or contribute to the Library used to detect the Language
- the related issue on Whatlang: Can distinguish between Simplified Chinese and Japanese Kanji? greyblake/whatlang-rs#122

github-actions · 2023-03-07T18:05:40Z

Uffizzi Preview deployment-18362 was deleted.

…rch time

ManyTheFish · 2023-03-08T15:32:39Z

Hey @dureuill,
I'd like your POV on this PR, thank you if you have the time to look at it

dureuill · 2023-03-09T08:37:48Z

Hello @ManyTheFish, thanks for thinking of me to review your PR!

Before I dive into the code, a question: did we consider the option of having some documents left as "language-ambiguous" and tokenized as many time as necessary in each of the candidate languages? Similarly, for search queries that could match several languages, tokenize them in all of these languages and return results for all the resulting queries?

Such a change could cause structural change and so be a large undertaking. Still, if it has been discussed somewhere, I'd love to be pointed to that discussion.

Meanwhile, the change proposed here appears like a nice heuristic to prevent the misdetection of languages using frequency.

I'm not 100% sure of what the scope of this sentence is:

rerunning the extraction of the document forcing the tokenization over the main Languages detected

can you maybe rephrase a bit?

ManyTheFish · 2023-03-09T09:17:28Z

Hello @dureuill, I added a section in Technical Approach detailing the indexing process and giving an example at the end. Let me know if you need more! 😄

dureuill · 2023-03-09T09:46:16Z

Thank you Many it is indeed clearer with the added explanation!

dureuill

I have a couple of suggestions and questions.

I'm OK with the approach :-)

meilisearch/src/routes/indexes/mod.rs

meilisearch/src/search.rs

milli/src/index.rs

milli/src/update/index_documents/extract/extract_docid_word_positions.rs

meilisearch/src/routes/indexes/mod.rs

dureuill

Approving. There remain unresolved inline comments above that you can take or leave :-).

Co-authored-by: Louis Dureuil <louis@meilisearch.com>

ManyTheFish · 2023-03-09T14:35:50Z

@dureuill, I added a small documentation for buffers don't hesitate to bors merge if everything is good for you!

dureuill

perfect thank you ❤️

dureuill · 2023-03-09T14:42:15Z

bors merge

bors · 2023-03-09T16:23:12Z

Build succeeded:

Rerun extraction when language detection might have failed

da48506

ManyTheFish changed the base branch from main to release-v1.1.0 March 7, 2023 17:45

curquiza added this to the v1.1.0 milestone Mar 7, 2023

ManyTheFish added 4 commits March 7, 2023 19:38

Add a threshold filtering the Languages allowed to be detected at sea…

37d4551

…rch time

Fix clippy errors

3092cf0

Change indexing threshold

24c0775

Use Language allow list in the highlighter

7e2fd82

ManyTheFish changed the title ~~Enhance indexing language detection~~ Enhance Japanese language detection Mar 8, 2023

ManyTheFish requested a review from dureuill March 8, 2023 15:31

ManyTheFish mentioned this pull request Mar 8, 2023

[v1.1.0-rc.0] Japanese documents cannot be searched properly with Kanji-only documents #3565

Closed

dureuill reviewed Mar 9, 2023

View reviewed changes

ManyTheFish added 3 commits March 9, 2023 10:58

Fix typos

b4b859e

fix clippy too many arguments

5deea63

Try removing needless collect

dff2715

ManyTheFish marked this pull request as ready for review March 9, 2023 10:30

ManyTheFish requested a review from dureuill March 9, 2023 10:30

dureuill reviewed Mar 9, 2023

View reviewed changes

meilisearch/src/routes/indexes/mod.rs Outdated Show resolved Hide resolved

dureuill previously approved these changes Mar 9, 2023

View reviewed changes

Update meilisearch/src/routes/indexes/mod.rs

dea101e

Co-authored-by: Louis Dureuil <louis@meilisearch.com>

ManyTheFish dismissed dureuill’s stale review via dea101e March 9, 2023 14:17

dureuill previously approved these changes Mar 9, 2023

View reviewed changes

last PR fixes

2f8eb4f

ManyTheFish dismissed dureuill’s stale review via 2f8eb4f March 9, 2023 14:34

ManyTheFish requested a review from dureuill March 9, 2023 14:35

dureuill approved these changes Mar 9, 2023

View reviewed changes

bors bot merged commit fb1260e into release-v1.1.0 Mar 9, 2023

bors bot deleted the enhance-indexing-language-detection branch March 9, 2023 16:23

ManyTheFish mentioned this pull request Mar 13, 2023

Force japanese v1.1.0 #3588

Closed

meili-bot added the v1.1.0 PRs/issues solved in v1.1.0 released on 2023-04-03 label Apr 6, 2023

miiton mentioned this pull request Jun 6, 2023

Force japanese v1.2.0 #3812

Closed

ManyTheFish mentioned this pull request Jul 3, 2023

Japanese specialized Meilisearch Docker Image #3882

Draft

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Enhance Japanese language detection #3569

Enhance Japanese language detection #3569

ManyTheFish commented Mar 7, 2023 •

edited

github-actions bot commented Mar 7, 2023 •

edited

ManyTheFish commented Mar 8, 2023

dureuill commented Mar 9, 2023

ManyTheFish commented Mar 9, 2023

dureuill commented Mar 9, 2023

dureuill left a comment

dureuill left a comment

ManyTheFish commented Mar 9, 2023

dureuill left a comment

dureuill commented Mar 9, 2023

bors bot commented Mar 9, 2023

Enhance Japanese language detection #3569

Enhance Japanese language detection #3569

Conversation

ManyTheFish commented Mar 7, 2023 • edited

Pull Request

Context

Technical Approach

Limits

Related issue

Possible future enhancements

github-actions bot commented Mar 7, 2023 • edited

ManyTheFish commented Mar 8, 2023

dureuill commented Mar 9, 2023

ManyTheFish commented Mar 9, 2023

dureuill commented Mar 9, 2023

dureuill left a comment

Choose a reason for hiding this comment

dureuill left a comment

Choose a reason for hiding this comment

ManyTheFish commented Mar 9, 2023

dureuill left a comment

Choose a reason for hiding this comment

dureuill commented Mar 9, 2023

bors bot commented Mar 9, 2023

ManyTheFish commented Mar 7, 2023 •

edited

github-actions bot commented Mar 7, 2023 •

edited