Can distinguish between Simplified Chinese and Japanese Kanji? #122

miiton · 2022-08-26T13:38:37Z

I'm from the Meilisearch community. ( related: meilisearch/meilisearch/issues/2403 )

Wouldn't it be possible to distinguish between Simplified Chinese(Mandarin) and Kanji(Japanese) where the strings consists only of Hanzi/Kanji?

For example

whatlang v0.16.0 detects...

Word	Cmn	Jpn	Mean in english
東京	⭕	❌	"Tokyo" in Kanji
东京	⭕	❌	"Tokyo" in Simplified Chinese
大阪	⭕	❌	"Osaka" in both Kanji and Simplified Chiinese
会員	⭕	❌	"member, customer" in Kanji
会员	⭕	❌	"member, customer" in Simplified Chinese
関西国際空港	⭕	❌	"Kansai International Airport" in Kanji
関西国际空港	⭕	❌	"Kansai International Airport" in Simplified Chinese

My expected result is...

Word	Cmn	Jpn	Mean in english
東京	❌	⭕	"Tokyo" in Kanji
东京	⭕	❌	"Tokyo" in Simplified Chinese
大阪	⭕	⭕	"Osaka" in both Kanji and Simplified Chinese
会員	❌	⭕	"member, customer" in Kanji
会员	⭕	❌	"member, customer" in Simplified Chinese
関西国際空港	❌	⭕	"Kansai International Airport" in Kanji
関西国际空港	⭕	❌	"Kansai International Airport" in Simplified Chinese

References

Hanzi and Kanji: Differences in the Chinese and Japanese Character Sets Today | East Asia Student

greyblake · 2022-08-27T12:21:05Z

@miiton
Hi, it's not first time the issue is raised regarding Japanese VS Chinese (Mandarin).
It was to some extend improved in #45

I'll be honest with you, I have very little knowledge about Chinese and Japanese languages and i would not be able to develop good heuristics to distinguish those languages.

I will take a look at the link you added to see if it helps.
On you side: if you know Japanese and Chinese, you can contribute by providing a bigger set of examples, which we could use for unit tests, in the following format:

Phrase or sentence
Expected result (language)
Explanation, why this is an expected result (with refs to some rules or heuristics)

Thank you.

miiton · 2022-08-27T12:32:12Z

Thank you for your reply.

I'm not familiar with Chinese at all, but I'll give it some thought.

polm · 2022-08-28T12:40:55Z

As a speaker of Japanese but not Chinese, one thing about the strings here is that in the ones that should be Chinese, there are simplified Chinese characters that are never used in normal Japanese, like 东, 际, or 员, and even not knowing Chinese I can recognize them immediately. Unfortunately it looks like there's no Unicode property like "this is a simplified character" (there's something that looks similar but isn't useful for this purpose).

One simple way to make a list of these would be to check if the characters are present in older encodings like Shift-JIS / EUC-JP (Japanese) or GP 2312 (Simplified Chinese). You could also take a big chunk of each language (like Wikipedia) and make some cutoff for character occurrence.

miiton · 2022-09-11T03:45:26Z

As @polm wrote, there is also a way to check whether it is included in Shift-JIS or EUC-JP.
I think it's good in terms of accuracy on focus to Japanese, but when I checked it, there were too many characters that duplicated with traditional Chinese characters, so this time, it seems that the percentage of Chinese characters that are misidentified as Japanese will increase.

So, when I focused on "Joyo kanji", I think I got a pretty good result, so I'm trying it out.

If it looks fine, I'll create a PR.

miiton · 2022-10-11T07:58:14Z

As a result of various investigations, I have almost gave up.
The reason is as shown in the image posted on the link, but I thought it would be unrealistic to correspond because Chinese kanji and Japanese kanji overlap too much.

Unless someone else comes up with a very good idea, I think you can close this issue for now.

meilisearch/product#532 (comment)

OuOu2021 · 2023-02-08T15:49:46Z

This is not a big problem as a slightly longer text in Japanese is likely to have kana which can help to distinguish between Japanese and Chinese, but it's still incorrect to determine undoubtable Kanji only used in Japanese as Chinese. 😥

ManyTheFish · 2023-03-08T09:25:04Z

@OuOu2021,
it depends on your need, as a search engine developer I have to detect the language in a small string like a search query.

3568: CI: Fix `publish-aarch64` job that still uses ubuntu-18.04 r=Kerollmops a=curquiza Fixes #3563 Main change - add the usage of the `ubuntu-18.04` container instead of the native `ubuntu-18.04` of GitHub actions: I had to install docker in the container. Small additional changes - remove useless `fail-fast` and unused/irrelevant matrix inputs (`build`, `linker`, `os`, `use-cross`...) - Remove useless step in job Proof of work with this CI triggered on this current branch: https://github.com/meilisearch/meilisearch/actions/runs/4366233882 3569: Enhance Japanese language detection r=dureuill a=ManyTheFish # Pull Request This PR is a prototype and can be tested by downloading [the dedicated docker image](https://hub.docker.com/layers/getmeili/meilisearch/prototype-better-language-detection-0/images/sha256-a12847de00e21a71ab797879fd09777dadcb0881f65b5f810e7d1ed434d116ef?context=explore): ```bash $ docker pull getmeili/meilisearch:prototype-better-language-detection-0 ``` ## Context Some Languages are harder to detect than others, this miss-detection leads to bad tokenization making some words or even documents completely unsearchable. Japanese is the main Language affected and can be detected as Chinese which has a completely different way of tokenization. A [first iteration has been implemented for v1.1.0](#3347) but is an insufficient enhancement to make Japanese work. This first implementation was detecting the Language during the indexing to avoid bad detections during the search. Unfortunately, some documents (shorter ones) can be wrongly detected as Chinese running bad tokenization for these documents and making possible the detection of Chinese during the search because it has been detected during the indexing. For instance, a Japanese document `{"id": 1, "name": "東京スカパラダイスオーケストラ"}` is detected as Japanese during indexing, during the search the query `東京` will be detected as Japanese because only Japanese documents have been detected during indexing despite the fact that v1.0.2 would detect it as Chinese. However if in the dataset there is at least one document containing a field with only Kanjis like: _A document with only 1 field containing only Kanjis:_ ```json { "id":4, "name": "東京特許許可局" } ``` _A document with 1 field containing only Kanjis and 1 field containing several Japanese characters:_ ```json { "id":105, "name": "東京特許許可局", "desc": "日経平均株価は26日に約8カ月ぶりに2万4000円の心理的な節目を上回った。株高を支える材料のひとつは、自民党総裁選で3選を決めた安倍晋三首相の経済政策への期待だ。恩恵が見込まれるとされる人材サービスや建設株の一角が買われている。ただ思惑が先行して資金が集まっている面は否めない。実際に政策効果を取り込む企業はどこか、なお未知数だ。" } ``` Then, in both cases, the field `name` will be detected as Chinese during indexing allowing the search to detect Chinese in queries. Therefore, the query `東京` will be detected as Chinese and only the two last documents will be retrieved by Meilisearch. ## Technical Approach The current PR partially fixes these issues by: 1) Adding a check over potential miss-detections and rerunning the extraction of the document forcing the tokenization over the main Languages detected in it. > 1) run a first extraction allowing the tokenizer to detect any Language in any Script > 2) generate a distribution of tokens by Script and Languages (`script_language`) > 3) if for a Script we have a token distribution of one of the Language that is under the threshold, then we rerun the extraction forbidding the tokenizer to detect the marginal Languages > 4) the tokenizer will fall back on the other available Languages to tokenize the text. For example, if the Chinese were marginally detected compared to the Japanese on the CJ script, then the second extraction will force Japanese tokenization for CJ text in the document. however, the text on another script like Latin will not be impacted by this restriction. 2) Adding a filtering threshold during the search over Languages that have been marginally detected in documents ## Limits This PR introduces 2 arbitrary thresholds: 1) during the indexing, a Language is considered miss-detected if the number of detected tokens of this Language is under 10% of the tokens detected in the same Script (Japanese and Chinese are 2 different Languages sharing the "same" script "CJK"). 2) during the search, a Language is considered marginal if less than 5% of documents are detected as this Language. This PR only partially fixes these issues: - ✅ the query `東京` now find Japanese documents if less than 5% of documents are detected as Chinese. - ✅ the document with the id `105` containing the Japanese field `desc` but the miss-detected field `name` is now completely detected and tokenized as Japanese and is found with the query `東京`. - ❌ the document with the id `4` no longer breaks the search Language detection but continues to be detected as a Chinese document and can't be found during the search. ## Related issue Fixes #3565 ## Possible future enhancements - Change or contribute to the Library used to detect the Language - the related issue on Whatlang: greyblake/whatlang-rs#122 Co-authored-by: curquiza <clementine@meilisearch.com> Co-authored-by: ManyTheFish <many@meilisearch.com> Co-authored-by: Many the fish <many@meilisearch.com>

miiton added a commit to miiton/whatlang-rs that referenced this issue Sep 11, 2022

Add common-use kanji detection for Japanese. greyblake#122

508ae36

miiton added a commit to miiton/whatlang-rs that referenced this issue Sep 11, 2022

Add comment to common_use_kanji.rs greyblake#122

90c05d5

miiton mentioned this issue Sep 14, 2022

Bug with Japanese Kanji support meilisearch/meilisearch#2403

Closed

OuOu2021 mentioned this issue Feb 9, 2023

Add Kanji support pemistahl/lingua-rs#152

Open

This was referenced Mar 9, 2023

[v1.1.0-rc.0] Japanese documents cannot be searched properly with Kanji-only documents meilisearch/meilisearch#3565

Closed

Enhance Japanese language detection meilisearch/meilisearch#3569

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Can distinguish between Simplified Chinese and Japanese Kanji? #122

Can distinguish between Simplified Chinese and Japanese Kanji? #122

miiton commented Aug 26, 2022

greyblake commented Aug 27, 2022 •

edited

miiton commented Aug 27, 2022

polm commented Aug 28, 2022

miiton commented Sep 11, 2022

miiton commented Oct 11, 2022

OuOu2021 commented Feb 8, 2023

ManyTheFish commented Mar 8, 2023 •

edited

Can distinguish between Simplified Chinese and Japanese Kanji? #122

Can distinguish between Simplified Chinese and Japanese Kanji? #122

Comments

miiton commented Aug 26, 2022

References

greyblake commented Aug 27, 2022 • edited

miiton commented Aug 27, 2022

polm commented Aug 28, 2022

miiton commented Sep 11, 2022

miiton commented Oct 11, 2022

OuOu2021 commented Feb 8, 2023

ManyTheFish commented Mar 8, 2023 • edited

greyblake commented Aug 27, 2022 •

edited

ManyTheFish commented Mar 8, 2023 •

edited