Chinese search support: Improve query segmentation #3915

justin5267 · 2022-05-13T10:45:33Z

Contribution guidelines

I've read the contribution guidelines and wholeheartedly agree

I've found a bug and checked that ...

... the problem doesn't occur with the mkdocs or readthedocs themes
... the problem persists when all overrides are removed, i.e. custom_dir, extra_javascript and extra_css
... the documentation does not mention anything about my problem
... there are no open or closed issues that are related to my problem

Description

Using the new search plugin, there are a lot of omissions in Chinese results, and the highlight includes irrelevant content.

Expected behaviour

Fix Chinese segmentation bug

Actual behaviour

I add the code below，entering the chinese keywords “起诉”，and find only10+ results, also the entire paragraph where the keyword is located is highlighted.

plugins:
  - search:
        separator: '[\\s\\u200b\\-]'

When I remove the above code , the number of hit results increased to 200+ and highlighting are correct too, but I cannot find any results with keywords more than two Chinese characters.

Steps to reproduce

clone the insiders repo to local folder
enter into the folder and pip install -e .
mkdocs serve
try search
modify the parameters and search again

Package versions

Python: 3.9
MkDocs: 1.3.0
mkdocs-material: 8.2.14+insiders.4.15.0

Configuration

site_name: DIGLAWS
site_url: https://example.com/
theme: 
  name: material
  language: zh
  logo: assets/logo.png
  features:
    - navigation.instant
    - navigation.tabs
    - toc.follow
    - navigation.top
markdown_extensions:
  - toc:
      toc_depth: 6
plugins:
  - search:
        separator: '[\\s\\u200b\\-]'
  - awesome-pages

System information

win10
edge

The text was updated successfully, but these errors were encountered:

squidfunk · 2022-05-13T11:37:19Z

Thanks for providing. Could you please provide some example phrases and explain precisely what you would expect and what is happening? Note that we use jieba for segmentation, so this might be worth reporting upstream.

When I remove the above code , the number of hit results increased to 200+ and highlighting are correct too, but I cannot find any results with keywords more than two Chinese characters.

This might be related to the fact that segmentation is currently done during the build and not in the browser because Chinese support for lunr-languages does not work in the browser. Thus, we cannot segment the search phrase. This is something we can fix once lunr-languages Chinese language support works in the browser.

justin5267 · 2022-05-13T16:37:21Z

Thank you for your prompt reply, I hope that when searching with two or more Chinese characters, all the results containing the keywords can be found, and the keywords can be highlighted correctly

squidfunk · 2022-05-13T19:47:44Z

Thanks for coming back. I'm unable to understand or write Chinese, so I'll need actual text, not images. Please attach a *.zip file with a minimal reproducible example to this issue. Otherwise, I'm afraid I'm unable to help.

squidfunk · 2022-05-14T07:43:39Z

Side note: the explicit separator configuration was not working because there were redundant backslashes, which I added to normalize escape sequences. However, this was wrong, as those escapes are not necessary for single quotes. I've corrected all instances in 33e65f7 and also improved the search separator for the community edition for Chinese.

justin5267 · 2022-05-14T13:31:39Z

After setting the newly modified parameters, there is still the problem of missing results. Steps to reproduce the problem in the attached file are as follows :

Keywords	Directly Ctrl+f in the source file	Search-plugin prompts the number of results that meet the conditions	Search-plugin actually hits the number of keywords	Remarks
产品	22	1+7	16	“产品”means “product”
起诉	155	1+44	85	“起诉”means “Prosecute”, a high frequency word used in the legal field
违法行为	5	1+2	5	“违法行为”means “Illegal act”, a high-frequency word used in the legal field
合同纠纷	105	1+16	35	“合同纠纷”means “Contract disputes”, a high frequency words used in the legal field
诉讼标的	57	0	0	“诉讼标的”means “Litigation subject”, a low-frequency word in the legal field.
无因管理	2	0	0	“无因管理”means “Negotiorum gestio”, a low-frequency word in the legal field.

My requirement is to achieve precise search, that is to say, when inputting "合同纠纷" to search, there is no need to segment "合同纠纷", but to find all consecutive occurrences of the string "合同纠纷" in all documents, and return the paragraph（or context） where the keyword is located and highlight "合同纠纷" in it.

You may refer to the[teedoc](https://github.com/teedoc/teedoc) ,its search-plugin can be used as a reference.It can find all input keywords in all documents, and the search speed is very high. However, when it returns the hit results, all different results are mixed into one paragraph and the intercepted context is relatively short, which is inconvenient to read. I filed an issue under that project, the author replied that he has no plan to update the search plug-in recently.

I am not specialized in computer-related work, and this is my first time using GitHub. If there is anything wrong, please forgive me
第一编总则.zip
.

squidfunk · 2022-05-14T13:49:53Z

Thanks for providing further information. The thing is – without segmenting, the search will treat non-whitespace separated content as whole words. You can still use a prefix wildcard and find words containing the characters, but it's probably not ideal. I'm not sure this can be fixed because we cannot segment in the browsers for the reasons stated. Somebody needs to port jieba to JavaScript. Sadly, I don't have the resources or expertise (remember, I'm not proficient in Chinese) to do it.

For this reason, you can do two things:

When searching, segment the search query manually with whitespace
Prepend a wildcard (*) to all searches by extending the theme and using query transformation

I know that this is not optimal, but please note that the new Chinese search is also still considered experimental. I'm very interested in improving it, and if we can find a way to segment in the browser, we can probably close the last usability gaps.

justin5267 · 2022-05-14T19:23:14Z

Thank you for your explanation, segmenting the search query manually with whitespace did sovle the problem，but as you said，it is not optimal.

I guess the author of teedoc should not have segmented the document or chose to split the document into single chinese character（the index.json structure is attached bellow）, so that when you search for "abcd" ,teedoc will find all consecutive occurrences of "abcd", and then a fixed length characters before and after every "abcd" are intercepted as the result output .

In this way, there will be some ambiguous results in the retrieval results, such as hitting “abc de”, but it can guarantee a 100% recall rate. On some occasions, such as finding regulations or contract terms, this is very important.

I like the material theme very much, I hope the search plugin can support this search mode, I don't know if it can be achieved.

Some of the above ideas may be layman or inaccurate, but I have tried my best to understand the technical issues, thank you for your patience again.

squidfunk · 2022-05-22T08:01:02Z

So, thanks again for the detailed explanation, especially in #3915 (comment) – it was very helpful in troubleshooting.

I've improved the search recall rate by implementing a new segmentation approach that is based on the data of the search index. The idea is to train the segmenter with the segmentation markers that are present in the search index as a result of the build-time segmentation, in order to learn different ways of how a search query can be segmented. While the recall rate should now be close to optimal (meaning that all possible segmentations should now be present in the query that is sent against lunr), accuracy might have suffered. I'm still trying to learn about the best way to tackle this problem, but I think the new solution could already be a step into the right direction. It should be almost certainly better than segmenting the query at every character. For example, here are the segmentations for the examples you provided:

产品 -> 产品
起诉 -> 起, 起诉, 诉
违法行为 -> 违法, 违法行为, 行为
合同纠纷 -> 合同, 合同纠纷, 纠纷
诉讼标的 -> 诉, 诉讼, 标的
无因管理 -> 无, 无因, 因, 管理

As already noted, accuracy might not be optimal. I'm still trying to understand whether it might be better to always use the longest match and throw away prefixes like 无 is a prefix of 无因. What would be a good strategy from your point of view? 79285fe2b includes the latest changes, so if you update Insiders again, you will get the new Chinese search.

Chinese search is still experimental. Let's improve it together!

justin5267 · 2022-05-22T10:51:49Z

Thank you for your hard work. For your question, I believe returning the longest match is sufficient in most cases. According to my experience, single Chinese character may express different meanings in different words, and cannot help accurately locate the desired results.

In my opinion，to the optimal, when searching for 违法行为, the search engine shall return all 5 results including违法行为 (now done), and do not return result which contains no 违法行为，but only contains 违(disobey), 法（law）, 违法(illegal), 行(move), 为（for），行为（conduct）in it. That is to say, the searching experience is just the same as searching in Microsoft word or acrobat reader.

squidfunk · 2022-05-22T11:03:09Z

Thanks for your feedback. Just to be clear, you're asking for the following behavior:

产品 -> 产品
起诉 -> 起, 起诉, 诉
违法行为 -> 违法, 违法行为, 行为
合同纠纷 -> 合同, 合同纠纷, 纠纷
诉讼标的 -> 诉, 诉讼, 标的
无因管理 -> 无, 无因, 因, 管理

The tokens that are bold should be included, the others not, correct?

justin5267 · 2022-05-22T11:11:28Z

The tokens that are bold should be included, the others not，but like this：
产品 -> 产品
起诉 -> 起, 起诉, 诉
违法行为 -> 违法, 违法行为, 行为
合同纠纷 -> 合同, 合同纠纷, 纠纷
诉讼标的 -> 诉, 诉讼, 标的，诉讼标的
无因管理 -> 无, 无因, 因, 管理，无因管理
今天天气不错->今天天气不错

squidfunk · 2022-05-22T11:25:28Z

诉讼标的 -> 诉, 诉讼, 标的，诉讼标的
无因管理 -> 无, 无因, 因, 管理，无因管理

The problem with these examples is that jieba seems to always tokenize these characters. At least in the document, there's no untokenized instance of these sequences. For this reason, we'd tokenize into 无因 and 管理 here. I understand that this is not optimal, but I'm afraid we're going to make some trade-offs.

squidfunk · 2022-05-22T12:37:46Z

I've pushed the segment prefix omission logic in the last Insiders commit. I'll issue a release today, so the new Chinese search query segmentation can be tested appropriately 😊 The limitations from my last comment still apply, but I think the omission of prefixes greatly improves the accuracy.

squidfunk · 2022-05-22T12:44:47Z

Released as part of 8.2.15+insiders-4.15.2!

squidfunk · 2022-05-22T15:48:42Z

I just pushed another improvement to master, in which I refactored the implementation. The segmentation will now segment with minimum overlap. This should further improve recall rate, as now overlapping tokens are added to the query:

中华人民共和国 -> 中华人民共和国
中华人民共和国民 -> 中华人民共和国
中华人民共和国民事 -> 中华人民共和国, 民事
中华人民共和国民事诉 -> 中华人民共和国, 民事, 诉
中华人民共和国民事诉讼 -> 中华人民共和国, 民事诉讼
中华人民共和国民事诉讼法 -> 中华人民共和国, 民事, 民事诉讼, 诉讼法, 法
中华人民共和国民事诉讼法| -> 中华人民共和国, 民事, 民事诉讼, 诉讼法, 法, |
中华人民共和国民事诉讼法|总 -> 中华人民共和国, 民事, 民事诉讼, 诉讼法, 法, |, 总
中华人民共和国民事诉讼法|总则 -> 中华人民共和国, 民事, 民事诉讼, 诉讼法, 法, |, 总则

The tokens in bold are added because they overlap, i.e. 民事诉讼 could be 民事诉讼 or 民事 and 诉讼法.

justin5267 · 2022-05-22T16:52:20Z

I tried the latest version and it seems to widen the range of matches and for me there is now too much noise to find what I'm looking for. Meilisearch's "phrase search" is exactly what I want, but due to jieba, it occasionally misses some results. #(meilisearch/meilisearch#1714). Thanks for your explanation, let me know that Chinese search is not as easy as I thought, I'll keep watching and participating in the test, looking forward to a better experience.

squidfunk · 2022-05-22T18:37:27Z

We could revert the last change (adding overlapping matches), but I'd like to collect some feedback if other Chinese users also see it like that 😊 We could also make it configurable, but I'd be interested in other opions.

justin5267 · 2022-05-27T19:21:32Z

When using google, people usually do not know what exactly they are looking for, so it is important to find all relevant documents with vague keywords. In other cases, such as searching in technical documents or legal documents, what the user want most is to distinguish the specific tokens with those similar ones, so the search engine should not (unless it’s smart enough) expand the querying tokens.

Generally speaking, the current Chinese search is good for the first need in small documents, but not suitable for the second. In order to do accurate Chinese full-text search, I tried the following things:

I followed your suggestion and canceled the query transformation, but when searching "执行异议" , it will also hits "执行" and "异议". I noticed the author of lunr.js mentioned that it was possible to query exactly the tokens inputed by deleting a certain piece of code #Exact phrase matching? olivernn/lunr.js#62, https://github.com/olivernn/lunr.js/blob/master/lib/index.js#L301 but such code was not found in the new search plugin.
I turn to the default search of mkdocs, and used spaces to separate each Chinese character when indexing. I hope that when searching for "执行异议", it can match the content of "执行异议" , but the characters separated by spaces cannot match.

Could you kindly let me know if the above two directions are correct ?

squidfunk · 2022-05-27T19:35:04Z

Thanks for your input. Regarding exact phrase matching – this is supported in Material for MkDocs. Try searching for:

+执行 +异议

or

"执行异议"

Other than that, could you provide a small Chinese text that shows your issue? You could put content in different sections and explain which query terms should find which section and which section should not be found by the search.

justin5267 · 2022-05-28T10:52:53Z

Take the following content as an example:
执行权是人民法院依法采取各类执行措施以及对执行异议、复议、申诉等事项进行审查的权力，包括执行实施权和执行审查权。

Its segmentation result with jieba.cut is as follows：
'执行权', '是', '人民法院', '依法', '采取', '各类', '执行', '措施', '以及', '对', '执行', '异议', '、', '复议', '、', '申诉', '等', '事项', '进行', '审查', '的', '权力', '，', '包括', '执行', '实施', '权', '和', '执行', '审查', '权', '。'
When Searching for 执行 , it hits 执行 and 执行权 , this is as expected.
When searching for 执行权, it hits 执行权 exactly , this is as expected.
When searching for 执行异议, a fixed phrase not recognized by jieba, such tokens were split into 执行 and 异议 then queried separately, the result will hits 执行, 执行权and 执行异议. This is ok if phrase search function mentioned bellow works.
When searching for "执行异议"（with quotes）, I hope it will only hits 执行异议，but the result makes no difference from without quotes, 执行权 and 执行 are also hit, this is not as expected.

squidfunk · 2022-05-28T11:41:32Z

Thanks! I'll see how we can improve it. Reopening.

squidfunk · 2022-05-28T16:10:22Z

Okay, now I understand what the problem is. If the text contains 执行, 异议 and 执行异议, and jieba doesn't recognize 执行异议 as a separator token that should not be cut, Material for MkDocs will only see 执行 and 异议. The issue you linked regarding lunr.js is actually the problem – no span query support. This means that we cannot formulate a search that says 执行 immediately followed by 异议. If in the future lunr.js supports it, we can add support as well.

Sadly, nothing we can fix right now. AFAICT, this should have no drawback on recall, but accuracy is slightly degraded.

justin5267 · 2022-05-30T09:27:42Z

Disappointed to hear this, but thank you for your patience and clear explanation

Fusyong · 2022-06-14T23:03:03Z

You can get as many words as you can from a string by jieba.cut_for_search().
see jieba readme, just this way:

seg_list = jieba.cut_for_search("小明硕士毕业于中国科学院计算所，后在日本京都大学深造")  # 搜索引擎模式
print(", ".join(seg_list))

squidfunk · 2022-06-15T04:52:50Z

Yes, we could theoretically use that part of the API, but the problem is that the search results will then contain all those words, as we use the search_index.json not only for searching, but also to render the results. This would mean that words would be repeated multiple times in the search results.

Fusyong · 2022-06-15T09:30:22Z

This would mean that words would be repeated multiple times in the search results.

Yes. However, on the searching results page, we are not reading text flow, but mainly concerned with the words matched.

On my blog, for the body, I use jieba.cut_for_search(), and the results are acceptable. But for the title, I used jieba.cut(title, cut_all=False), the results are more readable, just as the original titles with some spaces, this is good for us to identify the article, but often fails to match(this is bad in the cases accuracy required). as follows:

Perhaps it would be more appropriate to give users two toggle switches?

squidfunk · 2022-06-15T16:13:55Z

Yes, we can add a flag to switch between cut and cut_for_search. Please create a new issue explaining the problem and a minimal reproducible example, so we have something to test against. Note, however, that this will completely alter the search results. If that's fine, we can do it, but I'm not proficient in Chinese to estimate whether it makes sense.

Edit

Perhaps it would be more appropriate to give users two toggle switches?

You mean the user that is using your site? In the browser? That would mean we would need two search indexes, one which is cut for search, one which isn't. I'm sorry, but this is not practical, so I'm not for adding this functionality. It's also something that would only be needed for Chinese (AFAIK), so a very limited use case.

squidfunk · 2022-06-15T16:16:28Z

BTW, note that now custom dictionaries for jieba are supported, which you can use to adjust segmentation.

squidfunk added the needs input label May 13, 2022

squidfunk changed the title ~~Chinese segmentation doesn't work well~~ Chinese search support: Improve query segmentation May 22, 2022

squidfunk added change request needs help resolved and removed needs input labels May 22, 2022

squidfunk closed this as completed May 22, 2022

squidfunk removed the needs help label May 22, 2022

squidfunk reopened this May 28, 2022

squidfunk closed this as completed May 28, 2022

Jackiexiao mentioned this issue Mar 22, 2023

[Draft] Don't merge: Add vercel config/chinese search/disable graph view Enveloppe/template-netlify-vercel#2

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Chinese search support: Improve query segmentation #3915

Chinese search support: Improve query segmentation #3915

justin5267 commented May 13, 2022

squidfunk commented May 13, 2022 •

edited

Loading

justin5267 commented May 13, 2022

squidfunk commented May 13, 2022 •

edited

Loading

squidfunk commented May 14, 2022

justin5267 commented May 14, 2022 •

edited

Loading

squidfunk commented May 14, 2022

justin5267 commented May 14, 2022

squidfunk commented May 22, 2022 •

edited

Loading

justin5267 commented May 22, 2022

squidfunk commented May 22, 2022 •

edited

Loading

justin5267 commented May 22, 2022

squidfunk commented May 22, 2022 •

edited

Loading

squidfunk commented May 22, 2022

squidfunk commented May 22, 2022

squidfunk commented May 22, 2022 •

edited

Loading

justin5267 commented May 22, 2022

squidfunk commented May 22, 2022

justin5267 commented May 27, 2022 •

edited

Loading

squidfunk commented May 27, 2022

justin5267 commented May 28, 2022

squidfunk commented May 28, 2022

squidfunk commented May 28, 2022

justin5267 commented May 30, 2022

Fusyong commented Jun 14, 2022 •

edited

Loading

squidfunk commented Jun 15, 2022 •

edited

Loading

Fusyong commented Jun 15, 2022

squidfunk commented Jun 15, 2022 •

edited

Loading

squidfunk commented Jun 15, 2022

Chinese search support: Improve query segmentation #3915

Chinese search support: Improve query segmentation #3915

Comments

justin5267 commented May 13, 2022

Contribution guidelines

I've found a bug and checked that ...

Description

Expected behaviour

Actual behaviour

Steps to reproduce

Package versions

Configuration

System information

squidfunk commented May 13, 2022 • edited Loading

justin5267 commented May 13, 2022

squidfunk commented May 13, 2022 • edited Loading

squidfunk commented May 14, 2022

justin5267 commented May 14, 2022 • edited Loading

squidfunk commented May 14, 2022

justin5267 commented May 14, 2022

squidfunk commented May 22, 2022 • edited Loading

justin5267 commented May 22, 2022

squidfunk commented May 22, 2022 • edited Loading

justin5267 commented May 22, 2022

squidfunk commented May 22, 2022 • edited Loading

squidfunk commented May 22, 2022

squidfunk commented May 22, 2022

squidfunk commented May 22, 2022 • edited Loading

justin5267 commented May 22, 2022

squidfunk commented May 22, 2022

justin5267 commented May 27, 2022 • edited Loading

squidfunk commented May 27, 2022

justin5267 commented May 28, 2022

squidfunk commented May 28, 2022

squidfunk commented May 28, 2022

justin5267 commented May 30, 2022

Fusyong commented Jun 14, 2022 • edited Loading

squidfunk commented Jun 15, 2022 • edited Loading

Fusyong commented Jun 15, 2022

squidfunk commented Jun 15, 2022 • edited Loading

squidfunk commented Jun 15, 2022

squidfunk commented May 13, 2022 •

edited

Loading

squidfunk commented May 13, 2022 •

edited

Loading

justin5267 commented May 14, 2022 •

edited

Loading

squidfunk commented May 22, 2022 •

edited

Loading

squidfunk commented May 22, 2022 •

edited

Loading

squidfunk commented May 22, 2022 •

edited

Loading

squidfunk commented May 22, 2022 •

edited

Loading

justin5267 commented May 27, 2022 •

edited

Loading

Fusyong commented Jun 14, 2022 •

edited

Loading

squidfunk commented Jun 15, 2022 •

edited

Loading

squidfunk commented Jun 15, 2022 •

edited

Loading