Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Chinese search support: Improve query segmentation #3915

Closed
5 tasks done
justin5267 opened this issue May 13, 2022 · 28 comments
Closed
5 tasks done

Chinese search support: Improve query segmentation #3915

justin5267 opened this issue May 13, 2022 · 28 comments
Labels
change request Issue requests a new feature or improvement resolved Issue is resolved, yet unreleased if open

Comments

@justin5267
Copy link

Contribution guidelines

I've found a bug and checked that ...

  • ... the problem doesn't occur with the mkdocs or readthedocs themes
  • ... the problem persists when all overrides are removed, i.e. custom_dir, extra_javascript and extra_css
  • ... the documentation does not mention anything about my problem
  • ... there are no open or closed issues that are related to my problem

Description

Using the new search plugin, there are a lot of omissions in Chinese results, and the highlight includes irrelevant content.

Expected behaviour

Fix Chinese segmentation bug

Actual behaviour

I add the code below,entering the chinese keywords “起诉”,and find only10+ results, also the entire paragraph where the keyword is located is highlighted.

plugins:
  - search:
        separator: '[\\s\\u200b\\-]'

When I remove the above code , the number of hit results increased to 200+ and highlighting are correct too, but I cannot find any results with keywords more than two Chinese characters.

Steps to reproduce

  1. clone the insiders repo to local folder
  2. enter into the folder and pip install -e .
  3. mkdocs serve
  4. try search
  5. modify the parameters and search again

Package versions

  • Python: 3.9
  • MkDocs: 1.3.0
  • mkdocs-material: 8.2.14+insiders.4.15.0

Configuration

site_name: DIGLAWS
site_url: https://example.com/
theme: 
  name: material
  language: zh
  logo: assets/logo.png
  features:
    - navigation.instant
    - navigation.tabs
    - toc.follow
    - navigation.top
markdown_extensions:
  - toc:
      toc_depth: 6
plugins:
  - search:
        separator: '[\\s\\u200b\\-]'
  - awesome-pages

System information

  • win10
  • edge
@squidfunk
Copy link
Owner

squidfunk commented May 13, 2022

Thanks for providing. Could you please provide some example phrases and explain precisely what you would expect and what is happening? Note that we use jieba for segmentation, so this might be worth reporting upstream.

When I remove the above code , the number of hit results increased to 200+ and highlighting are correct too, but I cannot find any results with keywords more than two Chinese characters.

This might be related to the fact that segmentation is currently done during the build and not in the browser because Chinese support for lunr-languages does not work in the browser. Thus, we cannot segment the search phrase. This is something we can fix once lunr-languages Chinese language support works in the browser.

@squidfunk squidfunk added the needs input Issue needs further input by the reporter label May 13, 2022
@justin5267
Copy link
Author

Thank you for your prompt reply, I hope that when searching with two or more Chinese characters, all the results containing the keywords can be found, and the keywords can be highlighted correctly
image
image
image

@squidfunk
Copy link
Owner

squidfunk commented May 13, 2022

Thanks for coming back. I'm unable to understand or write Chinese, so I'll need actual text, not images. Please attach a *.zip file with a minimal reproducible example to this issue. Otherwise, I'm afraid I'm unable to help.

@squidfunk
Copy link
Owner

Side note: the explicit separator configuration was not working because there were redundant backslashes, which I added to normalize escape sequences. However, this was wrong, as those escapes are not necessary for single quotes. I've corrected all instances in 33e65f7 and also improved the search separator for the community edition for Chinese.

@justin5267
Copy link
Author

justin5267 commented May 14, 2022

After setting the newly modified parameters, there is still the problem of missing results. Steps to reproduce the problem in the attached file are as follows :

Keywords Directly Ctrl+f in the source file Search-plugin prompts the number of results that meet the conditions Search-plugin actually hits the number of keywords Remarks
产品 22 1+7 16 “产品”means “product”
起诉 155 1+44 85 “起诉”means “Prosecute”, a high frequency word used in the legal field
违法行为 5 1+2 5 “违法行为”means “Illegal act”, a high-frequency word used in the legal field
合同纠纷 105 1+16 35 “合同纠纷”means “Contract disputes”, a high frequency words used in the legal field
诉讼标的 57 0 0 “诉讼标的”means “Litigation subject”, a low-frequency word in the legal field.
无因管理 2 0 0 “无因管理”means “Negotiorum gestio”, a low-frequency word in the legal field.

My requirement is to achieve precise search, that is to say, when inputting "合同纠纷" to search, there is no need to segment "合同纠纷", but to find all consecutive occurrences of the string "合同纠纷" in all documents, and return the paragraph(or context) where the keyword is located and highlight "合同纠纷" in it.

You may refer to the[teedoc](https://github.com/teedoc/teedoc) ,its search-plugin can be used as a reference.It can find all input keywords in all documents, and the search speed is very high. However, when it returns the hit results, all different results are mixed into one paragraph and the intercepted context is relatively short, which is inconvenient to read. I filed an issue under that project, the author replied that he has no plan to update the search plug-in recently.

I am not specialized in computer-related work, and this is my first time using GitHub. If there is anything wrong, please forgive me
第一编 总则.zip
.

@squidfunk
Copy link
Owner

Thanks for providing further information. The thing is – without segmenting, the search will treat non-whitespace separated content as whole words. You can still use a prefix wildcard and find words containing the characters, but it's probably not ideal. I'm not sure this can be fixed because we cannot segment in the browsers for the reasons stated. Somebody needs to port jieba to JavaScript. Sadly, I don't have the resources or expertise (remember, I'm not proficient in Chinese) to do it.

For this reason, you can do two things:

  1. When searching, segment the search query manually with whitespace
  2. Prepend a wildcard (*) to all searches by extending the theme and using query transformation

I know that this is not optimal, but please note that the new Chinese search is also still considered experimental. I'm very interested in improving it, and if we can find a way to segment in the browser, we can probably close the last usability gaps.

@justin5267
Copy link
Author

Thank you for your explanation, segmenting the search query manually with whitespace did sovle the problem,but as you said,it is not optimal.

I guess the author of teedoc should not have segmented the document or chose to split the document into single chinese character(the index.json structure is attached bellow), so that when you search for "abcd" ,teedoc will find all consecutive occurrences of "abcd", and then a fixed length characters before and after every "abcd" are intercepted as the result output .

In this way, there will be some ambiguous results in the retrieval results, such as hitting “abc de”, but it can guarantee a 100% recall rate. On some occasions, such as finding regulations or contract terms, this is very important.

I like the material theme very much, I hope the search plugin can support this search mode, I don't know if it can be achieved.

Some of the above ideas may be layman or inaccurate, but I have tried my best to understand the technical issues, thank you for your patience again.

image

@squidfunk
Copy link
Owner

squidfunk commented May 22, 2022

So, thanks again for the detailed explanation, especially in #3915 (comment) – it was very helpful in troubleshooting.

I've improved the search recall rate by implementing a new segmentation approach that is based on the data of the search index. The idea is to train the segmenter with the segmentation markers that are present in the search index as a result of the build-time segmentation, in order to learn different ways of how a search query can be segmented. While the recall rate should now be close to optimal (meaning that all possible segmentations should now be present in the query that is sent against lunr), accuracy might have suffered. I'm still trying to learn about the best way to tackle this problem, but I think the new solution could already be a step into the right direction. It should be almost certainly better than segmenting the query at every character. For example, here are the segmentations for the examples you provided:

  • 产品 -> 产品
  • 起诉 -> , 起诉,
  • 违法行为 -> 违法, 违法行为, 行为
  • 合同纠纷 -> 合同, 合同纠纷, 纠纷
  • 诉讼标的 -> , 诉讼, 标的
  • 无因管理 -> , 无因, , 管理

As already noted, accuracy might not be optimal. I'm still trying to understand whether it might be better to always use the longest match and throw away prefixes like is a prefix of 无因. What would be a good strategy from your point of view? 79285fe2b includes the latest changes, so if you update Insiders again, you will get the new Chinese search.

Chinese search is still experimental. Let's improve it together!

@squidfunk squidfunk changed the title Chinese segmentation doesn't work well Chinese search support: Improve query segmentation May 22, 2022
@squidfunk squidfunk added change request Issue requests a new feature or improvement needs help Issue needs help by other contributors resolved Issue is resolved, yet unreleased if open and removed needs input Issue needs further input by the reporter labels May 22, 2022
@justin5267
Copy link
Author

Thank you for your hard work. For your question, I believe returning the longest match is sufficient in most cases. According to my experience, single Chinese character may express different meanings in different words, and cannot help accurately locate the desired results.

In my opinion,to the optimal, when searching for 违法行为, the search engine shall return all 5 results including违法行为 (now done), and do not return result which contains no 违法行为,but only contains 违(disobey), 法(law), 违法(illegal), 行(move), 为(for),行为(conduct)in it. That is to say, the searching experience is just the same as searching in Microsoft word or acrobat reader.

@squidfunk
Copy link
Owner

squidfunk commented May 22, 2022

Thanks for your feedback. Just to be clear, you're asking for the following behavior:

  • 产品 -> 产品
  • 起诉 -> , 起诉,
  • 违法行为 -> 违法, 违法行为, 行为
  • 合同纠纷 -> 合同, 合同纠纷, 纠纷
  • 诉讼标的 -> , 诉讼, 标的
  • 无因管理 -> , 无因, , 管理

The tokens that are bold should be included, the others not, correct?

@justin5267
Copy link
Author

The tokens that are bold should be included, the others not,but like this:
产品 -> 产品
起诉 -> 起, 起诉, 诉
违法行为 -> 违法, 违法行为, 行为
合同纠纷 -> 合同, 合同纠纷, 纠纷
诉讼标的 -> 诉, 诉讼, 标的,诉讼标的
无因管理 -> 无, 无因, 因, 管理,无因管理
今天天气不错->今天天气不错

@squidfunk
Copy link
Owner

squidfunk commented May 22, 2022

诉讼标的 -> 诉, 诉讼, 标的,诉讼标的
无因管理 -> 无, 无因, 因, 管理,无因管理

The problem with these examples is that jieba seems to always tokenize these characters. At least in the document, there's no untokenized instance of these sequences. For this reason, we'd tokenize into 无因 and 管理 here. I understand that this is not optimal, but I'm afraid we're going to make some trade-offs.

@squidfunk
Copy link
Owner

I've pushed the segment prefix omission logic in the last Insiders commit. I'll issue a release today, so the new Chinese search query segmentation can be tested appropriately 😊 The limitations from my last comment still apply, but I think the omission of prefixes greatly improves the accuracy.

@squidfunk
Copy link
Owner

Released as part of 8.2.15+insiders-4.15.2!

@squidfunk squidfunk removed the needs help Issue needs help by other contributors label May 22, 2022
@squidfunk
Copy link
Owner

squidfunk commented May 22, 2022

I just pushed another improvement to master, in which I refactored the implementation. The segmentation will now segment with minimum overlap. This should further improve recall rate, as now overlapping tokens are added to the query:

  • 中华人民共和国 -> 中华人民共和国
  • 中华人民共和国民 -> 中华人民共和国
  • 中华人民共和国民事 -> 中华人民共和国, 民事
  • 中华人民共和国民事诉 -> 中华人民共和国, 民事,
  • 中华人民共和国民事诉讼 -> 中华人民共和国, 民事诉讼
  • 中华人民共和国民事诉讼法 -> 中华人民共和国, 民事, 民事诉讼, 诉讼法,
  • 中华人民共和国民事诉讼法| -> 中华人民共和国, 民事, 民事诉讼, 诉讼法, , |
  • 中华人民共和国民事诉讼法|总 -> 中华人民共和国, 民事, 民事诉讼, 诉讼法, , |,
  • 中华人民共和国民事诉讼法|总则 -> 中华人民共和国, 民事, 民事诉讼, 诉讼法, , |, 总则

The tokens in bold are added because they overlap, i.e. 民事诉讼 could be 民事诉讼 or 民事 and 诉讼法.

@justin5267
Copy link
Author

I tried the latest version and it seems to widen the range of matches and for me there is now too much noise to find what I'm looking for. Meilisearch's "phrase search" is exactly what I want, but due to jieba, it occasionally misses some results. #(meilisearch/meilisearch#1714). Thanks for your explanation, let me know that Chinese search is not as easy as I thought, I'll keep watching and participating in the test, looking forward to a better experience.

@squidfunk
Copy link
Owner

We could revert the last change (adding overlapping matches), but I'd like to collect some feedback if other Chinese users also see it like that 😊 We could also make it configurable, but I'd be interested in other opions.

@justin5267
Copy link
Author

justin5267 commented May 27, 2022

When using google, people usually do not know what exactly they are looking for, so it is important to find all relevant documents with vague keywords. In other cases, such as searching in technical documents or legal documents, what the user want most is to distinguish the specific tokens with those similar ones, so the search engine should not (unless it’s smart enough) expand the querying tokens.

Generally speaking, the current Chinese search is good for the first need in small documents, but not suitable for the second. In order to do accurate Chinese full-text search, I tried the following things:

  1. I followed your suggestion and canceled the query transformation, but when searching "执行异议" , it will also hits "执行" and "异议". I noticed the author of lunr.js mentioned that it was possible to query exactly the tokens inputed by deleting a certain piece of code #Exact phrase matching? olivernn/lunr.js#62, https://github.com/olivernn/lunr.js/blob/master/lib/index.js#L301 but such code was not found in the new search plugin.

  2. I turn to the default search of mkdocs, and used spaces to separate each Chinese character when indexing. I hope that when searching for "执行异议", it can match the content of "执 行 异 议" , but the characters separated by spaces cannot match.

Could you kindly let me know if the above two directions are correct ?

@squidfunk
Copy link
Owner

Thanks for your input. Regarding exact phrase matching – this is supported in Material for MkDocs. Try searching for:

+执行 +异议

or

"执行异议"

Other than that, could you provide a small Chinese text that shows your issue? You could put content in different sections and explain which query terms should find which section and which section should not be found by the search.

@justin5267
Copy link
Author

Take the following content as an example:
执行权是人民法院依法采取各类执行措施以及对执行异议、复议、申诉等事项进行审查的权力,包括执行实施权和执行审查权。

  1. Its segmentation result with jieba.cut is as follows:
    '执行权', '是', '人民法院', '依法', '采取', '各类', '执行', '措施', '以及', '对', '执行', '异议', '、', '复议', '、', '申诉', '等', '事项', '进行', '审查', '的', '权力', ',', '包括', '执行', '实施', '权', '和', '执行', '审查', '权', '。'

  2. When Searching for 执行 , it hits 执行 and 执行权 , this is as expected.
    image

  3. When searching for 执行权, it hits 执行权 exactly , this is as expected.
    image

  4. When searching for 执行异议, a fixed phrase not recognized by jieba, such tokens were split into 执行 and 异议 then queried separately, the result will hits 执行, 执行权and 执行异议. This is ok if phrase search function mentioned bellow works.
    image

  5. When searching for "执行异议"(with quotes), I hope it will only hits 执行异议,but the result makes no difference from without quotes, 执行权 and 执行 are also hit, this is not as expected.
    image

@squidfunk
Copy link
Owner

Thanks! I'll see how we can improve it. Reopening.

@squidfunk squidfunk reopened this May 28, 2022
@squidfunk
Copy link
Owner

Okay, now I understand what the problem is. If the text contains 执行, 异议 and 执行异议, and jieba doesn't recognize 执行异议 as a separator token that should not be cut, Material for MkDocs will only see 执行 and 异议. The issue you linked regarding lunr.js is actually the problem – no span query support. This means that we cannot formulate a search that says 执行 immediately followed by 异议. If in the future lunr.js supports it, we can add support as well.

Sadly, nothing we can fix right now. AFAICT, this should have no drawback on recall, but accuracy is slightly degraded.

@justin5267
Copy link
Author

Disappointed to hear this, but thank you for your patience and clear explanation

@Fusyong
Copy link

Fusyong commented Jun 14, 2022

You can get as many words as you can from a string by jieba.cut_for_search().
see jieba readme, just this way:

seg_list = jieba.cut_for_search("小明硕士毕业于中国科学院计算所,后在日本京都大学深造")  # 搜索引擎模式
print(", ".join(seg_list))

@squidfunk
Copy link
Owner

squidfunk commented Jun 15, 2022

Yes, we could theoretically use that part of the API, but the problem is that the search results will then contain all those words, as we use the search_index.json not only for searching, but also to render the results. This would mean that words would be repeated multiple times in the search results.

@Fusyong
Copy link

Fusyong commented Jun 15, 2022

This would mean that words would be repeated multiple times in the search results.

Yes. However, on the searching results page, we are not reading text flow, but mainly concerned with the words matched.

On my blog, for the body, I use jieba.cut_for_search(), and the results are acceptable. But for the title, I used jieba.cut(title, cut_all=False), the results are more readable, just as the original titles with some spaces, this is good for us to identify the article, but often fails to match(this is bad in the cases accuracy required). as follows:

image

Perhaps it would be more appropriate to give users two toggle switches?

@squidfunk
Copy link
Owner

squidfunk commented Jun 15, 2022

Yes, we can add a flag to switch between cut and cut_for_search. Please create a new issue explaining the problem and a minimal reproducible example, so we have something to test against. Note, however, that this will completely alter the search results. If that's fine, we can do it, but I'm not proficient in Chinese to estimate whether it makes sense.

Edit

Perhaps it would be more appropriate to give users two toggle switches?

You mean the user that is using your site? In the browser? That would mean we would need two search indexes, one which is cut for search, one which isn't. I'm sorry, but this is not practical, so I'm not for adding this functionality. It's also something that would only be needed for Chinese (AFAIK), so a very limited use case.

@squidfunk
Copy link
Owner

BTW, note that now custom dictionaries for jieba are supported, which you can use to adjust segmentation.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
change request Issue requests a new feature or improvement resolved Issue is resolved, yet unreleased if open
Projects
None yet
Development

No branches or pull requests

3 participants