Towards better documentation search #6307

squidfunk · 2023-11-07T14:07:05Z

Background

As you may have read in one of my recent comments, we're currently revising our search implementation. The current search is based on Lunr.js, which is also the search engine that MkDocs has been using the time Material for MkDocs started in 2016. In the beginning, we felt that this was a good fit, as Lunr.js allows searching in the browser without the need for an external service. This makes deploying documentation much simpler, since search is and should always be a central component to each and every good documentation site.

In the past years, we've invested hundreds of hours into making search better. With the help of our awesome sponsors, we were able to ship rich search previews, support for more sophisticated tokenizers, support for Chinese, as well as better highlighting. Additionally, we made search almost twice as fast. However, in order to progress, and solve the many open issues that are related to search, we decided to throw out Lunr.js. There are several reasons for that, the most important of which that it is unmaintained since 2020. Additionally, Lunr.js only allows ranking with BM25, which is a good basis, but almost all issues that are related to weird rankings are caused by the fact that BM25 is not ideal for stable typeahead search. It was meant for full-word retrieval and is almost impossible to tame for the many different use cases that we've seen in the wild. Again, we've invested a lot of time to improve the situation, but we've reached an end where this doesn't make sense anymore.

This is the reason why we're currently releasing so few new features, because we're putting our entire energy in finishing the new search implementation. We're already almost en-par with Lunr.js' functionality, but now have an entirely modular architecture, which will allow us to swap out everything. Yes, I mean everything: the ranking algorithm, wildcard matching, the inverted index implementation, yada, yada, yada. Solving the documentation search problem is a personal affair for me. I really hate that there's not yet a solution that works reliably, can run anywhere, and is modular so it can be easily customized.

This is what we're building.

As you may already suspect, this is a pretty big project, which is why it is taking so long. We feel, it is the perfect moment to venture into this problem, because we gathered a lot of use cases that we can now balance and optimize for. However, please understand that this takes time, so I kindly ask you to be a little more patient. Development on this project is after all 99% done by me, @squidfunk, and we're rewriting something that millions of users are using each and every day. That needs care.

Where we're currently at

First of all: search will be a separate, new project! This means you will be able to use the same engine in your other projects as well. Additionally, here's a non-exhaustive list of things we're planning to ship in the first version:

Modular engines: Search should not only allow to search for text in an inverted index, but also support new use cases like nearest neighbors on vector embeddings. We designed the new search so that multiple engines can be configured for the same set of documents, e.g. store text and title in an inverted index, and store embeddings in a vector store – all from the same document. They should then be searched and ranked together. Additionally, document fields can be tokenized differently, and the tokenizing algorithm can be based on a regular expression, or a function, allowing for maximum flexibility.
- Search: Return Results for URLs #5936
Powerful plugin system: Plugins are first-class-citizens! The new search is completely modularized. For example, the inverted index itself does not compute scores – it's implemented as a plugin. This means, alternative ranking plugins can be implemented. The plugin architecture is dead simple, but insanely powerful. From my current knowledge, I know of nothing that could not be implemented as a plugin.
- Search: customize behavior with hooks #4980
Document metadata – authors should be able to configure which parts of document metadata should be included with the documents, so that documents can be indexed with custom metadata. Currently, only text, title, location and tags are included. The new search should allow to configure which fields are indexed how, i.e., how they should render in search results, if they should render at all (think keywords or aliases), etc. This would also allow to slice the search into different sections, e.g. for the blog, API reference, etc., by allowing the author to render those as tabs in the search bar.
Better accuracy – the current implementation uses Lunr.js, which uses OR to combine terms. This is not ideal for document search, as users reported repeatedly that they expect to narrow the number of search results with more terms entered. The new search will make it easy to switch to AND as a default combinator.
Detect misspellings – if a typo is entered, e.g. instlal, the engine should detect the typo and correct it to install. Many engines support this, so we should find a way to do the same.
Offline-first – it goes without saying that one of the highest priorities is that search will keep working offline. The new search implementation will, of course, still not need a server.
Span queries – searches like "single page application" should be ranked higher when those words appear together. This removes the need for exact search within quotes, which is something many non-technical users don't even know is possible in search engines like Google or Bing. The goal is that entering a few words should be enough, no special syntax should be needed.
Compound words – the current search allows to index words like PascalCase as Pascal and Case by using clever lookaheads, but it also means that searching for the entire term in lowercase pascalcase will not return any results. This should be fixed in a way that both can be found.
- Search: find PascalCase, pascalcase and case at the same time #6632

Here's a list of ideas, partially based on open change requests, which we will implement after the first version is out and reached a stable state. We believe all of those features will be great additions:

This list is far from complete. We have so many more ideas, which we'll share when the time has come. We'll keep this issue updated, so feel free to subscribe or check back from time to time. We hope to push our the first candidate before the end of this year! Thank you for your patience and for your trust in Material for MkDocs.

The text was updated successfully, but these errors were encountered:

strausmann · 2023-11-09T06:38:34Z

Great ideas and very great features for searching. The most important thing for us is that the search continued to work completely offline and without a web server. We use MkDocs as documentation, it has to work offline on the plane or on the ship.

squidfunk · 2023-11-09T06:51:26Z

@strausmann that is our priority. It will definitely work offline (our prototype already does), but we'll also add interesting new features like search federation (merging search indexes with other MkDocs sites) for which you obviously need to be online. All of those are optional and will degrade gracefully when offline, of course.

strausmann · 2023-11-09T06:56:37Z

The search federation is of course one of the most interesting features for us too. The documentation also runs on a web server. Several mkdocs instances run side by side for different topics. If the search for one mkdocd now also returns the contents of the other instances, that's brilliant. Of course, these instances then run on a web server in a closed environment.

squidfunk · 2023-11-09T06:59:16Z

Thanks for sharing your setup – that sounds like a perfect test case once we have a prototype. If you like, you can subscribe to #5230 and give it a try once we have the first version out ☺️

strausmann · 2023-11-09T07:00:34Z

Very happy, we would like to test it. we are excited.

squidfunk · 2023-11-12T13:55:42Z

1st search preview is ready in #6321 – We encourage you to try it on your project and give feedback in #6321 ☺️

Note

It's still the same UI/UX, as we're currently focusing on internals. However, this PR fundamentally changes the search results, so we'd be interested to learn if you feel that it works better or worse in your documentation project. We'll be continuing to work on the internals and other parts mentioned in the OP while awaiting your feedback ☺️

squidfunk · 2023-11-20T14:06:07Z

2nd search preview is ready in #6372 – We encourage you to try it on your project and give feedback in #6372 ☺️

Note

It's still the same UI/UX, as we're currently focusing on internals. However, this PR fundamentally changes the search results, so we'd be interested to learn if you feel that it works better or worse in your documentation project. We'll be continuing to work on the internals and other parts mentioned in the OP while awaiting your feedback ☺️

AutonomousCat · 2023-11-30T00:20:38Z

I started using a Python library that uses MkDocs and this theme, and the search experience has been a bit overly stressful compared to what I'm used to with Sphinx, so I'm glad to see a search overhaul is already started.

My number 1 feature request would be a dedicated search page, and an option for the search bar to take you to it. I feel like the small window approach is not possible to fit all projects, for example API wrappers, where the results are large, but rightly so. It simply takes too much effort to go through all the "<#> more on this page" and scrolling through, only to possibly pass what you're looking for multiple times because of the small area.

That's really my main issue with search.

ctalr-jb · 2023-12-18T15:29:39Z

I'm loving the direction of this new search implementation. With the previews so far, I'm seeing a massive improvement in performance on larger sets of docs. Aside from the return of previous features, I'm definitely interested in seeing the "document metadata" and "federated search" ideas come to fruition for my own use cases.

Aruelius · 2023-12-21T09:21:22Z

I have been using mkdocs-material for over a year, it's good, but the support for Chinese search is not perfect. Both jieba and Lunr.js have very limited support for Chinese, and I know you have also been working on improving Chinese search, thank you very much!

In fact, when I was preparing to write the 2.0 document version of my project, I think that I need to use some new framework, such as Nextra, VitePress, dumi etc., cause that framework natively support Chinese search.

But I just saw this issue, I was excited, I got hope and I'll wait for better search to be released.

Merry Christmas and Happy New Year!

squidfunk · 2023-12-21T09:51:38Z

@Aruelius could you share some links to SSGs that support better Chinese search than Material for MkDocs? We're very interested in improving support, and checking some existing solutions is always a good idea. As we're writing everything from scratch, now is the best time to investigate. Please don't only share links to the SSGs, but to resources that explain how search works in those SSGs, i.e., documentation pages, blog posts, repositories. Thank you!

Also please understand that "Chinese search is not perfect" is very hard for me to turn into actionable items. I don't speak Chinese. I'm essentially trying to improve search for a language I don't understand. I will need support from Chinese speaking users. Let's create a better search experience together.

squidfunk · 2023-12-21T09:58:29Z

FWIW, a quick search surfaced that Vitepress supports two search providers:

Local search, implemented using minisearch, which does not support Chinese
Algolia search which is a hosted solution (does not work offline or without Internet connection), which supports Chinese and other languages, but has the drawback of being a third-party service. While their free offering for Open Source projects is nice (DocSearch), their paid solution is ridiculously expensive.

We're aiming to build one of the most powerful search solutions that are Open Source (not like Algolia) and can run in the browser or on the Edge, but in-browser search just cannot compete with a hosted solution. That being said, if you would like to work together on this, I'd be happy to know exactly what you expect from Chinese search, what doesn't work correctly, and maybe if you found any Open Source solutions for this problem, because IMHO, to-date, Material for MkDocs is one of the very, very few SSGs that support Chinese search at all without a third-party service.

Nonetheless, we need to improve it!

Aruelius · 2023-12-21T10:03:27Z

@Aruelius could you share some links to SSGs that support better Chinese search than Material for MkDocs? We're very interested in improving support, and checking some existing solutions is always a good idea. As we're writing everything from scratch, now is the best time to investigate. Please don't only share links to the SSGs, but to resources that explain how search works in those SSGs, i.e., documentation pages, blog posts, repositories. Thank you!

Also please understand that "Chinese search is not perfect" is very hard for me to turn into actionable items. I don't speak Chinese. I'm essentially trying to improve search for a language I don't understand. I will need support from Chinese speaking users. Let's create a better search experience together.

I'm happy to contribute everything I can do.

This is flexsearch(https://github.com/nextapps-de/flexsearch) which is what Nextra is using, maybe it can help you.

Aruelius · 2023-12-21T12:49:28Z

Thanks for sharing! It's safe to say that we will account for this use case as well ☺️

Thank you~

do-me · 2023-12-29T16:08:41Z

Just linking #5483 for some ideas how to implement semantic search without the need for a vector DB and model server, if loading 10-50Mb of resources is not a problem. If it is, stick to the "proper" setup with respective powerful infrastructure.
I'm thinking of creating an mkdocs plugin but could use some helping hands in case (comment on linked discussion) :)

squidfunk · 2023-12-29T16:14:58Z

Thanks! Definitely interesting, but likely not possible in the browser alongside documentation that is shipped to users. 38MB download (as mentioned in the linked issue) is a no-no, but we have alternative ideas to explore ☺️

I'm thinking of creating an mkdocs plugin but could use some helping hands in case (comment on linked discussion) :)

If you want to go ahead, sure! We'll investigate this topic next year. Unfortunately, I have too much to do right now to help you, but once we tackle this, I'll post here, so everybody who is subscribed will be notified.

do-me · 2023-12-29T16:30:43Z

Agree, both variants ( client-only vs server-client) have their tradeoffs (size/speed vs. cost/overhead). With the current hype, the recent hardware & software developments I could well imagine something like on-device inference-server (with default pre-trained models) that could easily be hooked up to the browser or apps system-wise. Once such a system is in reach, we could reevaluate maybe if only the index file would be downloaded (similar to the normal lunr search atm).

we have alternative ideas to explore ☺️

Excited for any kind of development here! :)

syeda-git · 2024-01-12T13:58:34Z

@squidfunk is there a tentative date range that this new search feature would be available?

squidfunk · 2024-01-12T14:14:14Z

Please be assured that we are working hard making the new search available as fast as possible, but it is a pretty big fish to fry – I'm essentially writing a search engine from scratch. You can support us finishing it faster by sponsoring the project, because with more sponsorships, I can delegate more work to other individuals helping out on issues, discussions, questions, etc., and focus on pushing it forward.

Sadly, only a small fraction of companies that uses Material for MkDocs and actually makes or saves money of our work supports our work financially. A lot of companies only free-ride. This makes our work more tedious.

lucaong · 2024-01-15T12:37:21Z

Local search, implemented using minisearch, which does not support Chinese

@squidfunk MiniSearch does support Chinese, although one has to provide a custom tokenizer for it, as explained for example here: lucaong/minisearch#201 (comment)

squidfunk · 2024-01-16T06:38:25Z

Thanks! Note that Intl.Segmenter is not supported in all browsers. Also, according to #6307 (comment), segmenting is not enough to provide a good experience. Infix search seems to be necessary, but we need to investigate.

lucaong · 2024-01-16T08:59:05Z

Understood. I am not personally knowledgeable about supporting full-text search on Chinese language, but I tried to make MiniSearch as configurable as possible. I would definitely be interested in understanding if there is any gap there that cannot be solved with configuration, as well as a working MiniSearch configuration for Chinese to suggest to users.

Regarding infix search, that one is in fact a common request from users needing to support Chinese, and it can be done with MiniSearch (although the index will necessarily get larger). Here is a commend explaining how.

squidfunk · 2024-01-17T01:03:27Z

Yeah, I'm having my troubles understanding Chinese as well 😅 Thanks for explaining how to implement infix search with MiniSearch. However, as you can see from the OP, we're actively working on a new search engine. The reason is that Chinese search is not the only thing we need to support, but we need a solution that is as modular and flexible as possible, and with 65 supported languages and more than 40k installations, we have a lot of use cases to cater to. We've something close to being in prototype stage, so the decision whether to use an existing solution like MiniSearch is already a done deal. Thank you for your understanding.

Lexachoc · 2024-02-09T19:28:06Z

I am new to material for MkDocs. The built-in search is good to use until I have a Markdown page with symbols and latex equations, as below:

Symbol	Description
A_in	absorbance $A_{in}=-\ln[I/I_0]=-\ln\tau_{in}$
$A_{in}$	absorbance $A_{in}=-\ln[I/I_0]=-\ln\tau_{in}$

I can only search for the symbols in the browser using the Ctrl+F function to the first row but not the second row with Ain.
But both rows cannot be searched by entering Ain. That's not intuitive for me.

So it would be very useful if the search bar had the ability to search for the sub (sup) string, like the built-in Ctrl+F in the browser, or even better, to search for Latex

I would expect to enter Ain and get the result of the preview with rendered symbols (equations) instead of the Latex syntax.

NFanoe · 2024-04-26T09:59:29Z

Have you considered some kind of faceted search? When we search for something, we get a ton of API stuff first. It would be great to be able to filter that away, or filter it in, based on maybe a metadata tag or even just a path.

squidfunk · 2024-04-26T10:29:38Z

Yes, filters (facetted search) will definitely be supported ☺️

squidfunk added the announcement Issue announces news or new features label Nov 7, 2023

squidfunk pinned this issue Nov 7, 2023

HonkingGoose mentioned this issue Nov 8, 2023

Weird search result order renovatebot/renovatebot.github.io#337

Open

squidfunk linked a pull request Nov 12, 2023 that will close this issue

Search: Research Preview 1 🧪 #6321

Closed

This was referenced Nov 14, 2023

Search: highlight the first result to indicate current selection #6333

Open

Projects plugin cache not being updated #6306

Closed

HonkingGoose mentioned this issue Nov 25, 2023

Search failures at https://www.eclipse.org/openj9/docs/ eclipse-openj9/openj9-docs#831

Open

squidfunk mentioned this issue Nov 29, 2023

Instant loading: returning to index page breaks search and other features #6275

Closed

4 tasks

squidfunk mentioned this issue Jan 11, 2024

Search: find PascalCase, pascalcase and case at the same time #6632

Open

4 tasks

wofsauge mentioned this issue Jan 11, 2024

search shouldnt be case sensitive TeamREPENTOGON/REPENTOGON#204

Open

squidfunk mentioned this issue Jan 12, 2024

Search: Research Preview 2 🧪 #6372

Closed

squidfunk mentioned this issue Feb 5, 2024

super-high cost when searching for documents apache/pekko#1097

Open

squidfunk mentioned this issue Feb 12, 2024

The info plugin does not includes inherited configurations #6750

Closed

4 tasks

unverbuggt mentioned this issue Feb 22, 2024

mkdocs-material & encrypted search: is it possible to leverage material's custom_dir to deploy your patched index.ts? unverbuggt/mkdocs-encryptcontent-plugin#66

Open

squidfunk mentioned this issue Mar 2, 2024

Search bar location in main page #6858

Closed

4 tasks

etiennebacher mentioned this issue Mar 4, 2024

Improve search on website pola-rs/r-polars#686

Open

StevenMaude mentioned this issue Mar 8, 2024

Search returns "raw" schemas before other schemas opensafely/documentation#1455

Open

AstreaTSS mentioned this issue Mar 9, 2024

[DOCS]: Improve Search Experience [ONGOING] interactions-py/interactions.py#1628

Open

This was referenced Mar 20, 2024

Preview: Live Edit – Feedback wanted! #2110

Open

Using only spaces as search tokenizer fails to process words with '-' character #6958

Closed

squidfunk mentioned this issue Apr 12, 2024

Sanitizing search entry titles mkdocs/mkdocs#3560

Open

squidfunk mentioned this issue Apr 21, 2024

mkdocs validation warnings when referencing sub-project file via absolute link #6879

Closed

4 tasks

squidfunk removed a link to a pull request Apr 26, 2024

Search: Research Preview 1 🧪 #6321

Closed

squidfunk mentioned this issue May 3, 2024

Allow selecting "icons" or "emojis" in the "Icons, Emojis" search box #6628

Closed

4 tasks

waylan mentioned this issue May 7, 2024

Break search plugin out into separate package mkdocs/mkdocs#3698

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Towards better documentation search #6307

Towards better documentation search #6307

squidfunk commented Nov 7, 2023 •

edited

strausmann commented Nov 9, 2023

squidfunk commented Nov 9, 2023

strausmann commented Nov 9, 2023

squidfunk commented Nov 9, 2023

strausmann commented Nov 9, 2023

squidfunk commented Nov 12, 2023 •

edited

squidfunk commented Nov 20, 2023

AutonomousCat commented Nov 30, 2023

ctalr-jb commented Dec 18, 2023

Aruelius commented Dec 21, 2023

squidfunk commented Dec 21, 2023 •

edited

squidfunk commented Dec 21, 2023 •

edited

Aruelius commented Dec 21, 2023

Aruelius commented Dec 21, 2023

do-me commented Dec 29, 2023

squidfunk commented Dec 29, 2023

do-me commented Dec 29, 2023

syeda-git commented Jan 12, 2024

squidfunk commented Jan 12, 2024

lucaong commented Jan 15, 2024

squidfunk commented Jan 16, 2024 •

edited

lucaong commented Jan 16, 2024

squidfunk commented Jan 17, 2024

Lexachoc commented Feb 9, 2024

NFanoe commented Apr 26, 2024

squidfunk commented Apr 26, 2024

Towards better documentation search #6307

Towards better documentation search #6307

Comments

squidfunk commented Nov 7, 2023 • edited

Background

Where we're currently at

strausmann commented Nov 9, 2023

squidfunk commented Nov 9, 2023

strausmann commented Nov 9, 2023

squidfunk commented Nov 9, 2023

strausmann commented Nov 9, 2023

squidfunk commented Nov 12, 2023 • edited

squidfunk commented Nov 20, 2023

AutonomousCat commented Nov 30, 2023

ctalr-jb commented Dec 18, 2023

Aruelius commented Dec 21, 2023

squidfunk commented Dec 21, 2023 • edited

squidfunk commented Dec 21, 2023 • edited

Aruelius commented Dec 21, 2023

Aruelius commented Dec 21, 2023

do-me commented Dec 29, 2023

squidfunk commented Dec 29, 2023

do-me commented Dec 29, 2023

syeda-git commented Jan 12, 2024

squidfunk commented Jan 12, 2024

lucaong commented Jan 15, 2024

squidfunk commented Jan 16, 2024 • edited

lucaong commented Jan 16, 2024

squidfunk commented Jan 17, 2024

Lexachoc commented Feb 9, 2024

NFanoe commented Apr 26, 2024

squidfunk commented Apr 26, 2024

squidfunk commented Nov 7, 2023 •

edited

squidfunk commented Nov 12, 2023 •

edited

squidfunk commented Dec 21, 2023 •

edited

squidfunk commented Dec 21, 2023 •

edited

squidfunk commented Jan 16, 2024 •

edited