New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Sanitizing search entry titles #3560
base: master
Are you sure you want to change the base?
Conversation
I can't make out the issue with the 2 remaining mypy errors. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have no knowledge of the search plugin so I'll mostly just trust you on this one. I did notice several things that stood out to me generally though.
Co-authored-by: Oleh Prypin <oleh@pryp.in>
The current lint errors appear to be related to a new markdownlint rule that was added in the most recent release. Presumably the various tables in the documentation need to be cleanup up and/or the default for the new rule needs to be changed to match the style used. Regardless, that has no effect on the changes here and I am ignoring it in this PR. |
Thanks for the feedback. I have made all of the changes... except for the one where we are meeting lunr.js' API. I think this is ready to go. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is good!
Just to note:
- I intend to merge this myself
- I intend to first check Ensure page titles from rendered Markdown do not contain any markup #3546
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've been looking at the "keywords" feature and I'm really having second thoughts on it.
-
I don't see how the changes in this pull request cause a need for this new feature. I don't think they do.
For# `foo(bar=0)`
the behavior both before and after is to indexfoo(bar=0)
.
The fact that data-toc-label no longer affects anything is certainly a change, but I don't think it's a problem. -
It bloats the size of search index files by adding
"keywords": ""
to every heading. -
You're saying that "boost" doesn't really work anyway?
-
When using "material", the most widely used theme, people will be confused why this newly announced "keywords" feature doesn't work at all. That theme replaces the "search" plugin and simply will not have this feature.
Accordingly pushed a commit to revert the "keywords" feature. Let me know what you think. |
The problem is that without keywords the search term
Correct. That has always been the case and is an issue with the upstream project. I had forgotten that when I added the boost. We can either remove it or leave it. Doesn't matter either way to me.
So we can't make MkDocs better because a third party addon doesn't make use of that part of it? That seems like a strange argument to me. |
OK so there are several changes that this causes. They can be observed by building mkdocs' own site before & after. Here's output from diff -U3 <(jq <sitebefore/search/search_index.json) <(jq <siteafter/search/search_index.json) | delta --light These diffs are for the positive: - most importantly, it fixes the failure to escape angle brackets: However, I think this diff is for the negative:
With this, I am not confident that we really understand this change |
Discussion can continue on #3213, I'll post another finding there |
Yes, and that is why I added keywords. Lets take this diff: - documentation_pages
+ \n documentation_pages(*, inclusion: Callable[[InclusionLevel], bool] = InclusionLevel.is_included) -> Sequence[File]\n\n Both the before and after are bad for different reasons. Before is bad because that entry won't show up in a search for |
Here's what you get when searching for That is problematic for sure. Even if it's an abuse of functionality by mkdocstrings. But it will need to be hashed out first. Even though of course as you're saying in the current state it is difficult to find anything if you're searching for |
Let's run with that. Right now, today, without my proposed changes, on mkdocs.org, if you search for My change clearly gives better results. I don't understand why you think that it is worse. In your screenshot, the first result has the heading text which exactly matches the heading text of the linked section in the docs. That is what I expect and what this PR is trying to accomplish. Yet, nowhere have you expressed the idea that you disagree that that is better. So I don't understand what it is you think I'm doing or what you expect the search results should look like. Because, to me, your screen shot proves that my fix does exactly what it is supposed to do. And if mkdocstrings later adds support for keywords (by including search keywords in its output), then the search results would be even better. |
Thanks for tagging me!
I too find that @oprypin's screenshot looks good, even if the
Looks easy (defining the |
@pawamoy I'm not sure I understand what
We'll probably allow both, because the new search will allow to configure what metadata is searchable and eventually, how it is presented. We still have way to go, but we're getting there! Tracked in squidfunk/mkdocs-material#6307. Also note that we are considering moving the search plugin out of Material for MkDocs in the mid-term, and providing a unified search plugin for all themes with a nice UI and UX. This would however be a second step |
The reason for is it all layed out in the second half of the first comment. The keywords are invisible to the user (the visitor to the sight using the search feature). Because the underlying search tool (lunr.js) is not able to properly tokenize code in the search index, the document author can define search keywords which will ensure that the relevant results will be returned for searches.
That is great news. I always intended the existing search plugin to be a filler until a better solution came along. If your plugin can be made standalone, then I would be happy to see this one deprecated. That said, we should probably still go forward with some form of the fix here as this addresses a real problem with the existing plugin and it sounds like yours won't be available for some time. |
@waylan thanks for clarifying. I've read your OP and it was not apparent how (if) they are presented to the user.
Interestingly, this is the reason why I reworked the entire search plugin in the first place
Yes, it'll first be part of Material for MkDocs where we will test it, and we'll move it outside when stable. And yes, I currently can't put an ETA on it, but I'm working hard to make that happen. |
So I just looked at Material's plugin and noticed that you are defining a more sophisticated token separator. I had dismissed that in favor of manually defining keywords. Maybe I was too hasty in that decision. I think maybe I'll try that and see if I can eliminate the need for keywords. |
@waylan correct, but please note that this also requires a completely different tokenizer implementation in JavaScript. I've written this tokenizer from scratch, it is compatible with Lunr.js, and it is actually able to handle multi-character separators (which the default tokenizer does not). The new search has an even more sophisticated tokenizer, which will be open sourced in the coming months. I also recommend reading squidfunk/mkdocs-material#6307, which explains why we consider Lunr.js a dead end. |
Wow, I had no idea you were writing your own search engine. I'm happy to hear a lunr.js replacement is coming. As your new engine will be available for all themes in the future, I don't think it makes sense to get too crazy with the changes here. The keywords feature has already been removed and in my last commit I am cleaning up whitespace in the index. This is still an improvement of the search results, and in my testing, it seems that I can use a more sophisticated separator regex to get better results on my pages. I don't think it makes sense to default to a more complex regex as it is really only useful for API documentation and can be added to the config by users who need it. I think this is ready to go. |
MkDocs' search plugin indexes page sections by iterating over the sections of a page and adding an entry to the index for each section. It currently uses the title of a section's table of contents (TOC) entry as the title of the search index entry. As a reminder, the TOC of the page is provided by Markdown's
toc
extension which sanitizes a heading's content into plain text. In the typical case, it still works fine.However, the Markdown
toc
extension later added support for thedata-toc-label
attribute, which a user could assign to a heading (h1-6) element (presumably via theattr_list
extension). The content of that attribute would then be used as the label/title of the TOC item. The problem is that this allows users to introduce unsanitized (not plain) text into the title of a search index entry, which confuses search in weird ways.It is also possible for a third-party plugin to alter the TOC of a page by inserting markup into a TOC entry. For example, the
show_symbol_type_toc
option of mkdocsstrings-python specifically inserts spans into TOC titles. Again, this confuses search. Not only do search terms not return results as expected, but when the results are returned (due to a match in the body text), the title of the search result contains escaped HTML and is potentially confusing to the user (see Python-Markdown/markdown#1432).Note that the same (or similar) issues exist with the Page title, but that is the subject of #3357 and will presumably be addressed with a fix to that larger issue. Therefore, I am not address that here. Although, it is possible we could run the Page title through
striptags
as extra insurance. So I could add that here if desired.Given the above issues, a change needs to be made to ensure search is passed a sanitized version of the title of each section. There are at least 2 ways this could be done:
striptags
and pass the plain text result to the search index.While option 1 would be more consistent with current behavior, we could be indexing text that is different than what appears in the body of the page, which means section headings which should clearly be included in search results could be omitted (or visa-versa). Therefore, the search index should be passed a sanitized version of the actual heading, not the TOC title.
As it turns out, the existing code already strips tags from the headings of the document when building the search index, but then ignores those sanitized headings and uses the TOC title instead by using a section's
id
to associate the TOC title with the section. However, if we use the sanitized heading text, then we don't need to reference the TOC at all and any reference to it can be removed from the code.Given the above reasoning, I have implemented option 2 in this PR.
Consider the following examples:
# Some text {data-toc-label='different text'}
Both the previous behavior and option 1 above would be to index
different text
, and option 2 would indexSome text
. Clearly indexingSome text
makes more sense in this instance.# Some text {data-toc-label='<span class="foo">Some text</span>'}
The previous behavior would be to index
<span class="foo">Some text</span>
and both options 1 and 2 above would indexSome text
.# `foo(bar=0)` {data-toc-label='foo'}
Both the current behavior and option 1 would index
foo
. The stringbar
would not be indexed at all. With option 2, the stringfoo(bar=0)
would be indexed. But that isn't tokenized into separate words by the search index by default (in my testing the search termfoo(bar=0
was needed to get the entry returned as a search result). Of course, the search.separator config option allows one to define their own set of word separator characters. That could potentially result in better search results, and maintains the unmodified title in the search results displayed to the user. That said, using the word separator alone may not be as effective as one might think. According to its developer, lunr.js is not particularly well suited to tokenizing code (see olivernn/lunr.js#424 for at least one discussion about that).Therefore, I have added an additional new feature to my proposal:
Add an additional
keywords
field to the search index and add support for adata-search-keywords
attribute which can be defined on a heading element. When the attribute is defined on a heading, the contents of that attribute are assigned to thekeywords
field in the entry in the search index. A heading without the attribute would have no keywords assigned to that section and would continue to behave as-is. Consider the following example:# `foo(bar=0)` {data-toc-label='foo' data-search-keywords='foo bar'}
With the proposed change, the string
foo bar
would be indexed as keywords for the above example, which would result in reasonable search results. Initially I used the keywords as the title for the search index entry, but that caused the keywords to be used as the title in returned search results, which would be confusing to a user. Therefore, adding a new keywords field seemed like the best way to address this.I also discovered a minor bug where if a heading contained inline markup, then the entire text of the heading was not retained (which presumably went unnoticed for years because it wasn't actually being used). Specifically, each child element's text content would override/replace its previous sibling's and/or parent's text content. That bug has been fixed and a test added. Each child element's content is now appended to the the existing content of the
title
.Finally, I added
boost=10
to thekeywords
field of the search index so that results with matching keywords would get sorted higher in the results. Although, in my limited testing, the boost doesn't seem to make much difference. I have a vague recollection that similar observations were made in the past about the ineffectiveness of boost. 🤷