Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Sanitizing search entry titles #3560

Open
wants to merge 20 commits into
base: master
Choose a base branch
from
Open

Sanitizing search entry titles #3560

wants to merge 20 commits into from

Conversation

waylan
Copy link
Member

@waylan waylan commented Jan 26, 2024

MkDocs' search plugin indexes page sections by iterating over the sections of a page and adding an entry to the index for each section. It currently uses the title of a section's table of contents (TOC) entry as the title of the search index entry. As a reminder, the TOC of the page is provided by Markdown's toc extension which sanitizes a heading's content into plain text. In the typical case, it still works fine.

However, the Markdown toc extension later added support for the data-toc-label attribute, which a user could assign to a heading (h1-6) element (presumably via the attr_list extension). The content of that attribute would then be used as the label/title of the TOC item. The problem is that this allows users to introduce unsanitized (not plain) text into the title of a search index entry, which confuses search in weird ways.

It is also possible for a third-party plugin to alter the TOC of a page by inserting markup into a TOC entry. For example, the show_symbol_type_toc option of mkdocsstrings-python specifically inserts spans into TOC titles. Again, this confuses search. Not only do search terms not return results as expected, but when the results are returned (due to a match in the body text), the title of the search result contains escaped HTML and is potentially confusing to the user (see Python-Markdown/markdown#1432).

Note that the same (or similar) issues exist with the Page title, but that is the subject of #3357 and will presumably be addressed with a fix to that larger issue. Therefore, I am not address that here. Although, it is possible we could run the Page title through striptags as extra insurance. So I could add that here if desired.

Given the above issues, a change needs to be made to ensure search is passed a sanitized version of the title of each section. There are at least 2 ways this could be done:

  1. Pass the TOC title through striptags and pass the plain text result to the search index.
  2. Retrieve the actual heading (h1-6) element, and pass the plain text result to the search index.

While option 1 would be more consistent with current behavior, we could be indexing text that is different than what appears in the body of the page, which means section headings which should clearly be included in search results could be omitted (or visa-versa). Therefore, the search index should be passed a sanitized version of the actual heading, not the TOC title.

As it turns out, the existing code already strips tags from the headings of the document when building the search index, but then ignores those sanitized headings and uses the TOC title instead by using a section's id to associate the TOC title with the section. However, if we use the sanitized heading text, then we don't need to reference the TOC at all and any reference to it can be removed from the code.

Given the above reasoning, I have implemented option 2 in this PR.

Consider the following examples:

# Some text {data-toc-label='different text'}

Both the previous behavior and option 1 above would be to index different text, and option 2 would index Some text. Clearly indexing Some text makes more sense in this instance.

# Some text {data-toc-label='<span class="foo">Some text</span>'}

The previous behavior would be to index <span class="foo">Some text</span> and both options 1 and 2 above would index Some text.

# `foo(bar=0)` {data-toc-label='foo'}

Both the current behavior and option 1 would index foo. The string bar would not be indexed at all. With option 2, the string foo(bar=0) would be indexed. But that isn't tokenized into separate words by the search index by default (in my testing the search term foo(bar=0 was needed to get the entry returned as a search result). Of course, the search.separator config option allows one to define their own set of word separator characters. That could potentially result in better search results, and maintains the unmodified title in the search results displayed to the user. That said, using the word separator alone may not be as effective as one might think. According to its developer, lunr.js is not particularly well suited to tokenizing code (see olivernn/lunr.js#424 for at least one discussion about that).

Therefore, I have added an additional new feature to my proposal:

Add an additional keywords field to the search index and add support for a data-search-keywords attribute which can be defined on a heading element. When the attribute is defined on a heading, the contents of that attribute are assigned to the keywords field in the entry in the search index. A heading without the attribute would have no keywords assigned to that section and would continue to behave as-is. Consider the following example:

# `foo(bar=0)` {data-toc-label='foo' data-search-keywords='foo bar'}

With the proposed change, the string foo bar would be indexed as keywords for the above example, which would result in reasonable search results. Initially I used the keywords as the title for the search index entry, but that caused the keywords to be used as the title in returned search results, which would be confusing to a user. Therefore, adding a new keywords field seemed like the best way to address this.

I also discovered a minor bug where if a heading contained inline markup, then the entire text of the heading was not retained (which presumably went unnoticed for years because it wasn't actually being used). Specifically, each child element's text content would override/replace its previous sibling's and/or parent's text content. That bug has been fixed and a test added. Each child element's content is now appended to the the existing content of the title.

Finally, I added boost=10 to the keywords field of the search index so that results with matching keywords would get sorted higher in the results. Although, in my limited testing, the boost doesn't seem to make much difference. I have a vague recollection that similar observations were made in the past about the ineffectiveness of boost. 🤷

@waylan
Copy link
Member Author

waylan commented Jan 26, 2024

I can't make out the issue with the 2 remaining mypy errors.

Copy link
Contributor

@oprypin oprypin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have no knowledge of the search plugin so I'll mostly just trust you on this one. I did notice several things that stood out to me generally though.

mkdocs/contrib/search/search_index.py Outdated Show resolved Hide resolved
mkdocs/contrib/search/search_index.py Outdated Show resolved Hide resolved
mkdocs/contrib/search/search_index.py Outdated Show resolved Hide resolved
mkdocs/contrib/search/search_index.py Outdated Show resolved Hide resolved
mkdocs/contrib/search/search_index.py Outdated Show resolved Hide resolved
@waylan
Copy link
Member Author

waylan commented Jan 29, 2024

The current lint errors appear to be related to a new markdownlint rule that was added in the most recent release. Presumably the various tables in the documentation need to be cleanup up and/or the default for the new rule needs to be changed to match the style used. Regardless, that has no effect on the changes here and I am ignoring it in this PR.

@waylan
Copy link
Member Author

waylan commented Jan 29, 2024

Thanks for the feedback. I have made all of the changes... except for the one where we are meeting lunr.js' API. I think this is ready to go.

Copy link
Contributor

@oprypin oprypin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is good!

Just to note:

Copy link
Contributor

@oprypin oprypin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've been looking at the "keywords" feature and I'm really having second thoughts on it.

  1. I don't see how the changes in this pull request cause a need for this new feature. I don't think they do.
    For # `foo(bar=0)` the behavior both before and after is to index foo(bar=0).
    The fact that data-toc-label no longer affects anything is certainly a change, but I don't think it's a problem.

  2. It bloats the size of search index files by adding "keywords": "" to every heading.

  3. You're saying that "boost" doesn't really work anyway?

  4. When using "material", the most widely used theme, people will be confused why this newly announced "keywords" feature doesn't work at all. That theme replaces the "search" plugin and simply will not have this feature.

@oprypin
Copy link
Contributor

oprypin commented Mar 16, 2024

Accordingly pushed a commit to revert the "keywords" feature. Let me know what you think.

@waylan
Copy link
Member Author

waylan commented Mar 18, 2024

  1. I don't see how the changes in this pull request cause a need for this new feature. I don't think they do.
    For # `foo(bar=0)` the behavior both before and after is to index foo(bar=0).
    The fact that data-toc-label no longer affects anything is certainly a change, but I don't think it's a problem.

The problem is that without keywords the search term foo or even foo bar will not include the # `foo(bar=0)` section in the search results. The only search term which will get that section to be included in the search result is the search term foo(bar=0, which is not helpful to users. Keywords were added as a solution to that problem. With the keywords, then the search terms foo or foo bar will result in that section being included in the results.

3. You're saying that "boost" doesn't really work anyway?

Correct. That has always been the case and is an issue with the upstream project. I had forgotten that when I added the boost. We can either remove it or leave it. Doesn't matter either way to me.

4. When using "material", the most widely used theme, people will be confused why this newly announced "keywords" feature doesn't work at all. That theme replaces the "search" plugin and simply will not have this feature.

So we can't make MkDocs better because a third party addon doesn't make use of that part of it? That seems like a strange argument to me.

@oprypin oprypin added this to the 1.6.0 milestone Apr 9, 2024
@oprypin
Copy link
Contributor

oprypin commented Apr 10, 2024

OK so there are several changes that this causes. They can be observed by building mkdocs' own site before & after.

Here's output from

diff -U3 <(jq <sitebefore/search/search_index.json) <(jq <siteafter/search/search_index.json) | delta --light

These diffs are for the positive: - most importantly, it fixes the failure to escape angle brackets:

image


However, I think this diff is for the negative:

image

data-toc-label is indeed no longer used, as was intended by the author, but that affects mkdocstrings' results for the worse. @pawamoy


And there was also this one but it's a pretty easy fix, I'll push it now

image


With this, I am not confident that we really understand this change

@oprypin
Copy link
Contributor

oprypin commented Apr 10, 2024

Discussion can continue on #3213, I'll post another finding there

@oprypin oprypin removed this from the 1.6.0 milestone Apr 10, 2024
@waylan
Copy link
Member Author

waylan commented Apr 10, 2024

However, I think this diff is for the negative:

Yes, and that is why I added keywords. Lets take this diff:

- documentation_pages
+ \n        documentation_pages(*, inclusion: Callable[[InclusionLevel], bool] = InclusionLevel.is_included) -> Sequence[File]\n\n

Both the before and after are bad for different reasons. Before is bad because that entry won't show up in a search for InclusionLevel, Callable etc. After is also bad for the reasons explained in detail in previous comments. However, if a set of keywords are defined, then the entry will be returned for a search on any of they provided keywords. Quite frankly, I am tired of re-explaining this multiple times.

@oprypin
Copy link
Contributor

oprypin commented Apr 10, 2024

Here's what you get when searching for InclusionLevel with this change. (And it is not affected by my removal of the keywords feature, as it's entirely opt-in)

image

That is problematic for sure.

Even if it's an abuse of functionality by mkdocstrings. But it will need to be hashed out first.


Even though of course as you're saying in the current state it is difficult to find anything if you're searching for InclusionLevel

@waylan
Copy link
Member Author

waylan commented Apr 11, 2024

Here's what you get when searching for InclusionLevel with this change.

Let's run with that. Right now, today, without my proposed changes, on mkdocs.org, if you search for InclusionLevel, this is what you get:

MkDocs

My change clearly gives better results. I don't understand why you think that it is worse. In your screenshot, the first result has the heading text which exactly matches the heading text of the linked section in the docs. That is what I expect and what this PR is trying to accomplish. Yet, nowhere have you expressed the idea that you disagree that that is better. So I don't understand what it is you think I'm doing or what you expect the search results should look like. Because, to me, your screen shot proves that my fix does exactly what it is supposed to do.

And if mkdocstrings later adds support for keywords (by including search keywords in its output), then the search results would be even better.

@pawamoy
Copy link
Sponsor Contributor

pawamoy commented Apr 12, 2024

Thanks for tagging me!

My change clearly gives better results.

I too find that @oprypin's screenshot looks good, even if the title values containing whitespace are a bit weird.

And if mkdocstrings later adds support for keywords (by including search keywords in its output), then the search results would be even better.

Looks easy (defining the data-search-keywords HTML attribute), though with both the built-in search plugin and Material for MkDocs own search plugin starting to diverge, I'm not sure how much work it will represent for mkdocstrings to support both. Tagging @squidfunk just to make sure he gets a chance to learn about this keywords feature.

@squidfunk
Copy link
Sponsor Contributor

@pawamoy I'm not sure I understand what keywords is supposed to be for, and how it is used and presented in the search. Something similar was proposed for Material for MkDocs own search plugin in squidfunk/mkdocs-material#3174, and we're considering implementing it, albeit we've not yet spec'd out how we want the DX to be. For instance:

We'll probably allow both, because the new search will allow to configure what metadata is searchable and eventually, how it is presented. We still have way to go, but we're getting there! Tracked in squidfunk/mkdocs-material#6307. Also note that we are considering moving the search plugin out of Material for MkDocs in the mid-term, and providing a unified search plugin for all themes with a nice UI and UX. This would however be a second step ☺️

@waylan
Copy link
Member Author

waylan commented Apr 12, 2024

I'm not sure I understand what keywords is supposed to be for, and how it is used and presented in the search.

The reason for is it all layed out in the second half of the first comment. The keywords are invisible to the user (the visitor to the sight using the search feature). Because the underlying search tool (lunr.js) is not able to properly tokenize code in the search index, the document author can define search keywords which will ensure that the relevant results will be returned for searches.

we are considering moving the search plugin out of Material for MkDocs in the mid-term, and providing a unified search plugin for all themes with a nice UI and UX.

That is great news. I always intended the existing search plugin to be a filler until a better solution came along. If your plugin can be made standalone, then I would be happy to see this one deprecated. That said, we should probably still go forward with some form of the fix here as this addresses a real problem with the existing plugin and it sounds like yours won't be available for some time.

@squidfunk
Copy link
Sponsor Contributor

@waylan thanks for clarifying. I've read your OP and it was not apparent how (if) they are presented to the user.

Because the underlying search tool (lunr.js) is not able to properly tokenize code in the search index, the document author can define search keywords which will ensure that the relevant results will be returned for searches.

Interestingly, this is the reason why I reworked the entire search plugin in the first place ☺️ Material for MkDocs search can parse HTML and render (inline) code blocks without problems. If you feel that this makes sense to integrate into MkDocs, feel free to steal anything you want from Material for MkDocs. If you need assistance where to find things, let me know.

That is great news. I always intended the existing search plugin to be a filler until a better solution came along. If your plugin can be made standalone, then I would be happy to see this one deprecated. That said, we should probably still go forward with some form of the fix here as this addresses a real problem with the existing plugin and it sounds like yours won't be available for some time.

Yes, it'll first be part of Material for MkDocs where we will test it, and we'll move it outside when stable. And yes, I currently can't put an ETA on it, but I'm working hard to make that happen.

@waylan
Copy link
Member Author

waylan commented Apr 12, 2024

So I just looked at Material's plugin and noticed that you are defining a more sophisticated token separator. I had dismissed that in favor of manually defining keywords. Maybe I was too hasty in that decision. I think maybe I'll try that and see if I can eliminate the need for keywords.

@squidfunk
Copy link
Sponsor Contributor

squidfunk commented Apr 12, 2024

@waylan correct, but please note that this also requires a completely different tokenizer implementation in JavaScript. I've written this tokenizer from scratch, it is compatible with Lunr.js, and it is actually able to handle multi-character separators (which the default tokenizer does not). The new search has an even more sophisticated tokenizer, which will be open sourced in the coming months.

I also recommend reading squidfunk/mkdocs-material#6307, which explains why we consider Lunr.js a dead end.

@waylan
Copy link
Member Author

waylan commented Apr 12, 2024

Wow, I had no idea you were writing your own search engine. I'm happy to hear a lunr.js replacement is coming. As your new engine will be available for all themes in the future, I don't think it makes sense to get too crazy with the changes here.

The keywords feature has already been removed and in my last commit I am cleaning up whitespace in the index. This is still an improvement of the search results, and in my testing, it seems that I can use a more sophisticated separator regex to get better results on my pages. I don't think it makes sense to default to a more complex regex as it is really only useful for API documentation and can be added to the config by users who need it.

I think this is ready to go.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Ampersands seen in search results
4 participants