Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bug with Japanese Kanji support #2403

Closed
curquiza opened this issue May 17, 2022 Discussed in #2391 · 10 comments · Fixed by #3347
Closed

Bug with Japanese Kanji support #2403

curquiza opened this issue May 17, 2022 Discussed in #2391 · 10 comments · Fixed by #3347
Assignees
Labels
bug Something isn't working as expected language Anything related to languages tokenizer Related to the tokenizer repo: https://github.com/meilisearch/tokenizer/ v1.1.0 PRs/issues solved in v1.1.0 released on 2023-04-03
Milestone

Comments

@curquiza
Copy link
Member

curquiza commented May 17, 2022

Discussed in #2391

Originally posted by toomozoo May 13, 2022
Thank you for supporting Japanese in the 0.27.0 version!
I have tried several Kanji searches and did not get the results I wanted.
Do I need to do any settings to search in Japanese?

% brew install meilisearch
% meilisearch -V
meilisearch-http 0.27.0
% irb
require 'json'
require 'meilisearch'

MeiliSearch::VERSION
=> "0.18.3"

client = MeiliSearch::Client.new('http://127.0.0.1:7700')
json = JSON.parse('[{"id": "1","name": "東京バナナ"},{"id": "2","name": "東京 ポテチ"}]')
client.index('test').add_documents(json)

# NG
client.index('test').search("東")
=> {"hits"=>[], "nbHits"=>0, "exhaustiveNbHits"=>false, "query"=>"東", "limit"=>20, "offset"=>0, "processingTimeMs"=>0}

# NG
client.index('test').search("東京")
=> {"hits"=>[], "nbHits"=>0, "exhaustiveNbHits"=>false, "query"=>"東京", "limit"=>20, "offset"=>0, "processingTimeMs"=>0}

# OK
client.index('test').search("バ")
=> {"hits"=>[{"id"=>"1", "name"=>"東京バナナ"}], "nbHits"=>1, "exhaustiveNbHits"=>false, "query"=>"バ", "limit"=>20, "offset"=>0, "processingTimeMs"=>0}
```</div>

---

## TODO

- [ ] Implement changes in [Milli](https://github.com/meilisearch/milli/): https://github.com/meilisearch/meilisearch/issues/3357
- [ ] Release a Milli version containing these changes
- [ ] Bump this new Milli version in Meilisearch and merge it into `main`
@curquiza curquiza added bug Something isn't working as expected tokenizer Related to the tokenizer repo: https://github.com/meilisearch/tokenizer/ labels May 17, 2022
@curquiza
Copy link
Member Author

For people following this issue: we are focusing on our v0.28.0 sprint (see the Milestones), and we will investigate once the prioritized tasks are done

@curquiza curquiza changed the title Is Japanese Kanji supported? Bug with Japanese Kanji support May 18, 2022
@miiton
Copy link

miiton commented Aug 19, 2022

Yes, this is a problem with "Kanji-only strings" and I know whatlang-rs needs to do something about it.
However, I don't have an idea to completely distinguish between Kanji and Simplified Chinese.

@miiton
Copy link

miiton commented Sep 14, 2022

I write about what I have found out.

I hope it will be useful to teams.

@ManyTheFish
Copy link
Member

ManyTheFish commented Sep 15, 2022

Hello @miiton, you're right!
The issue comes from the normalization process.
Currently, we are discussing the redesign of the Chinese normalization to be phonologically oriented.
Because Chinese Languages and Japanese shares some characters making it difficult to detect the Language when the provided string is short (like a query), we could think of unifying the normalization for these Languages to avoid this kind of issue. 🤔 We just need to find an accurate phonological representation of Japanese like the Hanyu Pinyin for Mandarin.

I've created a dedicated discussion on the product repository to centralize feedback about Japanese support.
I will feed this discussion with the current behavior, a link to this issue, and the possible enhancement we can do in charabia and its dependencies like whatlang or Lindera.

I've read your issue on whatlang, It's an interesting enhancement by the way.

@miiton
Copy link

miiton commented Sep 16, 2022

@ManyTheFish

Thank you, I'll watch that discussion 😊

@curquiza
Copy link
Member Author

curquiza commented Nov 3, 2022

#3357 needs to be done to fix this.

Due to the huge works we had on during the previous sprint (+ Hacktoberfest) we will not be able to integrate the new charabia version into milli and then Meilisearch for v0.30.0. Will be integrated in the next version (v1)

Sorry for the inconvenience anyone!

@curquiza
Copy link
Member Author

Will be fixed for v1.1.0, not enough time during v1.0 sprint, sorry!
Fix is already on the way, see: meilisearch/milli#749

@curquiza curquiza modified the milestones: v1.0.0, v1.1.0 Jan 16, 2023
@bors bors bot closed this as completed in 3940788 Feb 21, 2023
@curquiza
Copy link
Member Author

curquiza commented Mar 7, 2023

Hello everyone following this issue 👋

We have just released the first RC (release candidate) of Meilisearch containing this new fix!
You can test it by using

docker run -it --rm -p 7700:7700 -v $(pwd)/meili_data:/meili_data getmeili/meilisearch:v1.1.0-rc.0

If you still have the bug, please let us know!
Thanks in advance for your help and your involvement in Meilisearch ❤️

🎉 Official and stable release containing this change will be available on 3rd April 2023

⚠️ RC (release candidates) are not recommended for production

@Rhilip
Copy link

Rhilip commented Mar 7, 2023

#3508
see My test case, Japanese also cannot be search properly with Chinese document, will v1.1.0 fix this bug ?

@curquiza
Copy link
Member Author

curquiza commented Mar 7, 2023

Hello @Rhilip sorry for the delay on the discussion you linked.
Can you test the RC and let us know if the bug you have in #3508 is still there?

@meili-bot meili-bot added the v1.1.0 PRs/issues solved in v1.1.0 released on 2023-04-03 label Apr 6, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working as expected language Anything related to languages tokenizer Related to the tokenizer repo: https://github.com/meilisearch/tokenizer/ v1.1.0 PRs/issues solved in v1.1.0 released on 2023-04-03
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants