Sanitizing search entry titles #3560

waylan · 2024-01-26T19:47:00Z

Fixes Ampersands seen in search results #3213

MkDocs' search plugin indexes page sections by iterating over the sections of a page and adding an entry to the index for each section. It currently uses the title of a section's table of contents (TOC) entry as the title of the search index entry. As a reminder, the TOC of the page is provided by Markdown's toc extension which sanitizes a heading's content into plain text. In the typical case, it still works fine.

However, the Markdown toc extension later added support for the data-toc-label attribute, which a user could assign to a heading (h1-6) element (presumably via the attr_list extension). The content of that attribute would then be used as the label/title of the TOC item. The problem is that this allows users to introduce unsanitized (not plain) text into the title of a search index entry, which confuses search in weird ways.

It is also possible for a third-party plugin to alter the TOC of a page by inserting markup into a TOC entry. For example, the show_symbol_type_toc option of mkdocsstrings-python specifically inserts spans into TOC titles. Again, this confuses search. Not only do search terms not return results as expected, but when the results are returned (due to a match in the body text), the title of the search result contains escaped HTML and is potentially confusing to the user (see Python-Markdown/markdown#1432).

Note that the same (or similar) issues exist with the Page title, but that is the subject of #3357 and will presumably be addressed with a fix to that larger issue. Therefore, I am not address that here. Although, it is possible we could run the Page title through striptags as extra insurance. So I could add that here if desired.

Given the above issues, a change needs to be made to ensure search is passed a sanitized version of the title of each section. There are at least 2 ways this could be done:

Pass the TOC title through striptags and pass the plain text result to the search index.
Retrieve the actual heading (h1-6) element, and pass the plain text result to the search index.

While option 1 would be more consistent with current behavior, we could be indexing text that is different than what appears in the body of the page, which means section headings which should clearly be included in search results could be omitted (or visa-versa). Therefore, the search index should be passed a sanitized version of the actual heading, not the TOC title.

As it turns out, the existing code already strips tags from the headings of the document when building the search index, but then ignores those sanitized headings and uses the TOC title instead by using a section's id to associate the TOC title with the section. However, if we use the sanitized heading text, then we don't need to reference the TOC at all and any reference to it can be removed from the code.

Given the above reasoning, I have implemented option 2 in this PR.

Consider the following examples:

# Some text {data-toc-label='different text'}

Both the previous behavior and option 1 above would be to index different text, and option 2 would index Some text. Clearly indexing Some text makes more sense in this instance.

# Some text {data-toc-label='<span class="foo">Some text</span>'}

The previous behavior would be to index <span class="foo">Some text</span> and both options 1 and 2 above would index Some text.

# `foo(bar=0)` {data-toc-label='foo'}

Both the current behavior and option 1 would index foo. The string bar would not be indexed at all. With option 2, the string foo(bar=0) would be indexed. But that isn't tokenized into separate words by the search index by default (in my testing the search term foo(bar=0 was needed to get the entry returned as a search result). Of course, the search.separator config option allows one to define their own set of word separator characters. That could potentially result in better search results, and maintains the unmodified title in the search results displayed to the user. That said, using the word separator alone may not be as effective as one might think. According to its developer, lunr.js is not particularly well suited to tokenizing code (see olivernn/lunr.js#424 for at least one discussion about that).

Therefore, I have added an additional new feature to my proposal:

Add an additional keywords field to the search index and add support for a data-search-keywords attribute which can be defined on a heading element. When the attribute is defined on a heading, the contents of that attribute are assigned to the keywords field in the entry in the search index. A heading without the attribute would have no keywords assigned to that section and would continue to behave as-is. Consider the following example:

# `foo(bar=0)` {data-toc-label='foo' data-search-keywords='foo bar'}

With the proposed change, the string foo bar would be indexed as keywords for the above example, which would result in reasonable search results. Initially I used the keywords as the title for the search index entry, but that caused the keywords to be used as the title in returned search results, which would be confusing to a user. Therefore, adding a new keywords field seemed like the best way to address this.

I also discovered a minor bug where if a heading contained inline markup, then the entire text of the heading was not retained (which presumably went unnoticed for years because it wasn't actually being used). Specifically, each child element's text content would override/replace its previous sibling's and/or parent's text content. That bug has been fixed and a test added. Each child element's content is now appended to the the existing content of the title.

Finally, I added boost=10 to the keywords field of the search index so that results with matching keywords would get sorted higher in the results. Although, in my limited testing, the boost doesn't seem to make much difference. I have a vague recollection that similar observations were made in the past about the ineffectiveness of boost. 🤷

waylan · 2024-01-26T21:33:42Z

I can't make out the issue with the 2 remaining mypy errors.

oprypin

I have no knowledge of the search plugin so I'll mostly just trust you on this one. I did notice several things that stood out to me generally though.

mkdocs/contrib/search/search_index.py

Co-authored-by: Oleh Prypin <oleh@pryp.in>

waylan · 2024-01-29T15:40:55Z

The current lint errors appear to be related to a new markdownlint rule that was added in the most recent release. Presumably the various tables in the documentation need to be cleanup up and/or the default for the new rule needs to be changed to match the style used. Regardless, that has no effect on the changes here and I am ignoring it in this PR.

waylan · 2024-01-29T16:35:43Z

Thanks for the feedback. I have made all of the changes... except for the one where we are meeting lunr.js' API. I think this is ready to go.

oprypin

This is good!

Just to note:

I intend to merge this myself
I intend to first check Ensure page titles from rendered Markdown do not contain any markup #3546

oprypin

I've been looking at the "keywords" feature and I'm really having second thoughts on it.

I don't see how the changes in this pull request cause a need for this new feature. I don't think they do.
For # `foo(bar=0)` the behavior both before and after is to index foo(bar=0).
The fact that data-toc-label no longer affects anything is certainly a change, but I don't think it's a problem.
It bloats the size of search index files by adding "keywords": "" to every heading.
You're saying that "boost" doesn't really work anyway?
When using "material", the most widely used theme, people will be confused why this newly announced "keywords" feature doesn't work at all. That theme replaces the "search" plugin and simply will not have this feature.

oprypin · 2024-03-16T13:12:38Z

Accordingly pushed a commit to revert the "keywords" feature. Let me know what you think.

waylan · 2024-03-18T14:50:07Z

I don't see how the changes in this pull request cause a need for this new feature. I don't think they do.
For # `foo(bar=0)` the behavior both before and after is to index foo(bar=0).
The fact that data-toc-label no longer affects anything is certainly a change, but I don't think it's a problem.

The problem is that without keywords the search term foo or even foo bar will not include the # `foo(bar=0)` section in the search results. The only search term which will get that section to be included in the search result is the search term foo(bar=0, which is not helpful to users. Keywords were added as a solution to that problem. With the keywords, then the search terms foo or foo bar will result in that section being included in the results.

3. You're saying that "boost" doesn't really work anyway?

Correct. That has always been the case and is an issue with the upstream project. I had forgotten that when I added the boost. We can either remove it or leave it. Doesn't matter either way to me.

4. When using "material", the most widely used theme, people will be confused why this newly announced "keywords" feature doesn't work at all. That theme replaces the "search" plugin and simply will not have this feature.

So we can't make MkDocs better because a third party addon doesn't make use of that part of it? That seems like a strange argument to me.

oprypin · 2024-04-10T19:45:42Z

OK so there are several changes that this causes. They can be observed by building mkdocs' own site before & after.

Here's output from

diff -U3 <(jq <sitebefore/search/search_index.json) <(jq <siteafter/search/search_index.json) | delta --light

These diffs are for the positive: - most importantly, it fixes the failure to escape angle brackets:

However, I think this diff is for the negative:

data-toc-label is indeed no longer used, as was intended by the author, but that affects mkdocstrings' results for the worse. @pawamoy

And there was also this one but it's a pretty easy fix, I'll push it now

With this, I am not confident that we really understand this change

oprypin · 2024-04-10T19:55:17Z

Discussion can continue on #3213, I'll post another finding there

waylan · 2024-04-10T20:00:50Z

However, I think this diff is for the negative:

Yes, and that is why I added keywords. Lets take this diff:

- documentation_pages
+ \n        documentation_pages(*, inclusion: Callable[[InclusionLevel], bool] = InclusionLevel.is_included) -> Sequence[File]\n\n

Both the before and after are bad for different reasons. Before is bad because that entry won't show up in a search for InclusionLevel, Callable etc. After is also bad for the reasons explained in detail in previous comments. However, if a set of keywords are defined, then the entry will be returned for a search on any of they provided keywords. Quite frankly, I am tired of re-explaining this multiple times.

oprypin · 2024-04-10T20:23:46Z

Here's what you get when searching for InclusionLevel with this change. (And it is not affected by my removal of the keywords feature, as it's entirely opt-in)

That is problematic for sure.

Even if it's an abuse of functionality by mkdocstrings. But it will need to be hashed out first.

Even though of course as you're saying in the current state it is difficult to find anything if you're searching for InclusionLevel

waylan · 2024-04-11T14:06:19Z

Here's what you get when searching for InclusionLevel with this change.

Let's run with that. Right now, today, without my proposed changes, on mkdocs.org, if you search for InclusionLevel, this is what you get:

My change clearly gives better results. I don't understand why you think that it is worse. In your screenshot, the first result has the heading text which exactly matches the heading text of the linked section in the docs. That is what I expect and what this PR is trying to accomplish. Yet, nowhere have you expressed the idea that you disagree that that is better. So I don't understand what it is you think I'm doing or what you expect the search results should look like. Because, to me, your screen shot proves that my fix does exactly what it is supposed to do.

And if mkdocstrings later adds support for keywords (by including search keywords in its output), then the search results would be even better.

pawamoy · 2024-04-12T11:58:12Z

Thanks for tagging me!

My change clearly gives better results.

I too find that @oprypin's screenshot looks good, even if the title values containing whitespace are a bit weird.

And if mkdocstrings later adds support for keywords (by including search keywords in its output), then the search results would be even better.

Looks easy (defining the data-search-keywords HTML attribute), though with both the built-in search plugin and Material for MkDocs own search plugin starting to diverge, I'm not sure how much work it will represent for mkdocstrings to support both. Tagging @squidfunk just to make sure he gets a chance to learn about this keywords feature.

squidfunk · 2024-04-12T13:19:05Z

@pawamoy I'm not sure I understand what keywords is supposed to be for, and how it is used and presented in the search. Something similar was proposed for Material for MkDocs own search plugin in squidfunk/mkdocs-material#3174, and we're considering implementing it, albeit we've not yet spec'd out how we want the DX to be. For instance:

Search: index document by custom metadata squidfunk/mkdocs-material#3174 (comment) – here, keywords are basically like aliases, so not rendered
Search: index document by custom metadata squidfunk/mkdocs-material#3174 (comment) – here, keywords are used in metadata, albeit presentation is not clear

We'll probably allow both, because the new search will allow to configure what metadata is searchable and eventually, how it is presented. We still have way to go, but we're getting there! Tracked in squidfunk/mkdocs-material#6307. Also note that we are considering moving the search plugin out of Material for MkDocs in the mid-term, and providing a unified search plugin for all themes with a nice UI and UX. This would however be a second step ☺️

waylan · 2024-04-12T13:46:29Z

I'm not sure I understand what keywords is supposed to be for, and how it is used and presented in the search.

The reason for is it all layed out in the second half of the first comment. The keywords are invisible to the user (the visitor to the sight using the search feature). Because the underlying search tool (lunr.js) is not able to properly tokenize code in the search index, the document author can define search keywords which will ensure that the relevant results will be returned for searches.

we are considering moving the search plugin out of Material for MkDocs in the mid-term, and providing a unified search plugin for all themes with a nice UI and UX.

That is great news. I always intended the existing search plugin to be a filler until a better solution came along. If your plugin can be made standalone, then I would be happy to see this one deprecated. That said, we should probably still go forward with some form of the fix here as this addresses a real problem with the existing plugin and it sounds like yours won't be available for some time.

squidfunk · 2024-04-12T13:59:01Z

@waylan thanks for clarifying. I've read your OP and it was not apparent how (if) they are presented to the user.

Because the underlying search tool (lunr.js) is not able to properly tokenize code in the search index, the document author can define search keywords which will ensure that the relevant results will be returned for searches.

Interestingly, this is the reason why I reworked the entire search plugin in the first place ☺️ Material for MkDocs search can parse HTML and render (inline) code blocks without problems. If you feel that this makes sense to integrate into MkDocs, feel free to steal anything you want from Material for MkDocs. If you need assistance where to find things, let me know.

That is great news. I always intended the existing search plugin to be a filler until a better solution came along. If your plugin can be made standalone, then I would be happy to see this one deprecated. That said, we should probably still go forward with some form of the fix here as this addresses a real problem with the existing plugin and it sounds like yours won't be available for some time.

Yes, it'll first be part of Material for MkDocs where we will test it, and we'll move it outside when stable. And yes, I currently can't put an ETA on it, but I'm working hard to make that happen.

waylan · 2024-04-12T15:13:35Z

So I just looked at Material's plugin and noticed that you are defining a more sophisticated token separator. I had dismissed that in favor of manually defining keywords. Maybe I was too hasty in that decision. I think maybe I'll try that and see if I can eliminate the need for keywords.

squidfunk · 2024-04-12T15:38:34Z

@waylan correct, but please note that this also requires a completely different tokenizer implementation in JavaScript. I've written this tokenizer from scratch, it is compatible with Lunr.js, and it is actually able to handle multi-character separators (which the default tokenizer does not). The new search has an even more sophisticated tokenizer, which will be open sourced in the coming months.

I also recommend reading squidfunk/mkdocs-material#6307, which explains why we consider Lunr.js a dead end.

waylan · 2024-04-12T17:46:41Z

Wow, I had no idea you were writing your own search engine. I'm happy to hear a lunr.js replacement is coming. As your new engine will be available for all themes in the future, I don't think it makes sense to get too crazy with the changes here.

The keywords feature has already been removed and in my last commit I am cleaning up whitespace in the index. This is still an improvement of the search results, and in my testing, it seems that I can use a more sophisticated separator regex to get better results on my pages. I don't think it makes sense to default to a more complex regex as it is really only useful for API documentation and can be added to the config by users who need it.

I think this is ready to go.

waylan and others added 5 commits January 25, 2024 16:17

Eliminate use of toc for search index

3a2a1a4

update tests

e483a33

add keywords as separate search field

5014795

update docs

05b01df

Merge remote-tracking branch 'origin/master' into search

b46a9c9

waylan mentioned this pull request Jan 26, 2024

API docs are not being properly indexed for search Python-Markdown/markdown#1432

Open

waylan added 3 commits January 26, 2024 15:34

Black cleanup

5685261

lint cleanup

1469b7f

fix typo

6bcceda

Fix lint errors

8b4f1c1

oprypin reviewed Jan 27, 2024

View reviewed changes

waylan and others added 2 commits January 29, 2024 08:35

Update mkdocs/contrib/search/search_index.py

ecfbe0f

Co-authored-by: Oleh Prypin <oleh@pryp.in>

fix permalink identification

ea39290

waylan added 2 commits January 29, 2024 10:56

only call dict(attrs) when needed

e04c5b7

improve comments

60e0aa3

Merge remote-tracking branch 'origin/master' into search

d7cf12f

oprypin approved these changes Jan 29, 2024

View reviewed changes

This was referenced Feb 3, 2024

Header and navigation sidebars display incorrectly arbitrary markup #3357

Closed

Fix MkDocs Search ZeroIntensity/view.py#150

Open

oprypin suggested changes Mar 16, 2024

View reviewed changes

oprypin added 2 commits March 16, 2024 14:06

Merge remote-tracking branch 'origin/master' into search

b7f332f

Revert the "keywords" feature

bb6785d

oprypin added this to the 1.6.0 milestone Apr 9, 2024

Merge remote-tracking branch 'origin/master' into search

b8d487e

Don't add a section with no id

7433028

oprypin mentioned this pull request Apr 10, 2024

Ampersands seen in search results #3213

Open

oprypin removed this from the 1.6.0 milestone Apr 10, 2024

waylan added 2 commits April 12, 2024 13:29

Cleanup whitespace

1350022

Merge remote-tracking branch 'upstream/master' into search

bded183

waylan mentioned this pull request May 7, 2024

Break search plugin out into separate package #3698

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Sanitizing search entry titles #3560

Sanitizing search entry titles #3560

waylan commented Jan 26, 2024 •

edited by oprypin

waylan commented Jan 26, 2024

oprypin left a comment

waylan commented Jan 29, 2024

waylan commented Jan 29, 2024

oprypin left a comment •

edited

oprypin left a comment

oprypin commented Mar 16, 2024

waylan commented Mar 18, 2024

oprypin commented Apr 10, 2024

oprypin commented Apr 10, 2024

waylan commented Apr 10, 2024 •

edited

oprypin commented Apr 10, 2024

waylan commented Apr 11, 2024

pawamoy commented Apr 12, 2024 •

edited

squidfunk commented Apr 12, 2024

waylan commented Apr 12, 2024

squidfunk commented Apr 12, 2024

waylan commented Apr 12, 2024

squidfunk commented Apr 12, 2024 •

edited

waylan commented Apr 12, 2024

Sanitizing search entry titles #3560

Are you sure you want to change the base?

Sanitizing search entry titles #3560

Conversation

waylan commented Jan 26, 2024 • edited by oprypin

waylan commented Jan 26, 2024

oprypin left a comment

Choose a reason for hiding this comment

waylan commented Jan 29, 2024

waylan commented Jan 29, 2024

oprypin left a comment • edited

Choose a reason for hiding this comment

oprypin left a comment

Choose a reason for hiding this comment

oprypin commented Mar 16, 2024

waylan commented Mar 18, 2024

oprypin commented Apr 10, 2024

oprypin commented Apr 10, 2024

waylan commented Apr 10, 2024 • edited

oprypin commented Apr 10, 2024

waylan commented Apr 11, 2024

pawamoy commented Apr 12, 2024 • edited

squidfunk commented Apr 12, 2024

waylan commented Apr 12, 2024

squidfunk commented Apr 12, 2024

waylan commented Apr 12, 2024

squidfunk commented Apr 12, 2024 • edited

waylan commented Apr 12, 2024

waylan commented Jan 26, 2024 •

edited by oprypin

oprypin left a comment •

edited

waylan commented Apr 10, 2024 •

edited

pawamoy commented Apr 12, 2024 •

edited

squidfunk commented Apr 12, 2024 •

edited