Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Towards better documentation search #6307

Open
8 of 24 tasks
squidfunk opened this issue Nov 7, 2023 · 32 comments
Open
8 of 24 tasks

Towards better documentation search #6307

squidfunk opened this issue Nov 7, 2023 · 32 comments
Labels
announcement Issue announces news or new features

Comments

@squidfunk
Copy link
Owner

squidfunk commented Nov 7, 2023

Background

As you may have read in one of my recent comments, we're currently revising our search implementation. The current search is based on Lunr.js, which is also the search engine that MkDocs has been using the time Material for MkDocs started in 2016. In the beginning, we felt that this was a good fit, as Lunr.js allows searching in the browser without the need for an external service. This makes deploying documentation much simpler, since search is and should always be a central component to each and every good documentation site.

In the past years, we've invested hundreds of hours into making search better. With the help of our awesome sponsors, we were able to ship rich search previews, support for more sophisticated tokenizers, support for Chinese, as well as better highlighting. Additionally, we made search almost twice as fast. However, in order to progress, and solve the many open issues that are related to search, we decided to throw out Lunr.js. There are several reasons for that, the most important of which that it is unmaintained since 2020. Additionally, Lunr.js only allows ranking with BM25, which is a good basis, but almost all issues that are related to weird rankings are caused by the fact that BM25 is not ideal for stable typeahead search. It was meant for full-word retrieval and is almost impossible to tame for the many different use cases that we've seen in the wild. Again, we've invested a lot of time to improve the situation, but we've reached an end where this doesn't make sense anymore.

This is the reason why we're currently releasing so few new features, because we're putting our entire energy in finishing the new search implementation. We're already almost en-par with Lunr.js' functionality, but now have an entirely modular architecture, which will allow us to swap out everything. Yes, I mean everything: the ranking algorithm, wildcard matching, the inverted index implementation, yada, yada, yada. Solving the documentation search problem is a personal affair for me. I really hate that there's not yet a solution that works reliably, can run anywhere, and is modular so it can be easily customized.

This is what we're building.

As you may already suspect, this is a pretty big project, which is why it is taking so long. We feel, it is the perfect moment to venture into this problem, because we gathered a lot of use cases that we can now balance and optimize for. However, please understand that this takes time, so I kindly ask you to be a little more patient. Development on this project is after all 99% done by me, @squidfunk, and we're rewriting something that millions of users are using each and every day. That needs care.

Where we're currently at

First of all: search will be a separate, new project! This means you will be able to use the same engine in your other projects as well. Additionally, here's a non-exhaustive list of things we're planning to ship in the first version:

  • Modular engines: Search should not only allow to search for text in an inverted index, but also support new use cases like nearest neighbors on vector embeddings. We designed the new search so that multiple engines can be configured for the same set of documents, e.g. store text and title in an inverted index, and store embeddings in a vector store – all from the same document. They should then be searched and ranked together. Additionally, document fields can be tokenized differently, and the tokenizing algorithm can be based on a regular expression, or a function, allowing for maximum flexibility.

  • Powerful plugin system: Plugins are first-class-citizens! The new search is completely modularized. For example, the inverted index itself does not compute scores – it's implemented as a plugin. This means, alternative ranking plugins can be implemented. The plugin architecture is dead simple, but insanely powerful. From my current knowledge, I know of nothing that could not be implemented as a plugin.

  • Document metadata – authors should be able to configure which parts of document metadata should be included with the documents, so that documents can be indexed with custom metadata. Currently, only text, title, location and tags are included. The new search should allow to configure which fields are indexed how, i.e., how they should render in search results, if they should render at all (think keywords or aliases), etc. This would also allow to slice the search into different sections, e.g. for the blog, API reference, etc., by allowing the author to render those as tabs in the search bar.

  • Better accuracy – the current implementation uses Lunr.js, which uses OR to combine terms. This is not ideal for document search, as users reported repeatedly that they expect to narrow the number of search results with more terms entered. The new search will make it easy to switch to AND as a default combinator.

  • Detect misspellings – if a typo is entered, e.g. instlal, the engine should detect the typo and correct it to install. Many engines support this, so we should find a way to do the same.

  • Offline-first – it goes without saying that one of the highest priorities is that search will keep working offline. The new search implementation will, of course, still not need a server.

  • Span queries – searches like "single page application" should be ranked higher when those words appear together. This removes the need for exact search within quotes, which is something many non-technical users don't even know is possible in search engines like Google or Bing. The goal is that entering a few words should be enough, no special syntax should be needed.

  • Compound words – the current search allows to index words like PascalCase as Pascal and Case by using clever lookaheads, but it also means that searching for the entire term in lowercase pascalcase will not return any results. This should be fixed in a way that both can be found.

Here's a list of ideas, partially based on open change requests, which we will implement after the first version is out and reached a stable state. We believe all of those features will be great additions:

  • Document hierarchy – the search index should be organized hierarchically, so that the explicit navigation structure and implicit table of contents hierarchy yield more context to search results, helping to disambiguate repetitive documentation.

  • Stemming and segmentation – of course, search should be multi-lingual and support language-specific stemming and text segmentation for iconographic languages like Japanese and Chinese. We should check whether we can use browser-based APIs for text segmentation, or if not available, maybe fall back to a polyfill. Alternatively (or ideally?), segmentation could be done during build time, so that the payload shipped to the user is even smaller. Additionally, stopwords should be allowed to be provided by the author. Here's an interesting stemmer implementation.

  • Compact summaries the current search indexes the HTML and divides it into blocks on the top-level. If a long list is contained, and a single word matches in that list, the entire list is rendered as part of the search results. This is not ideal, since the user has to scroll through a lot of irrelevant content. The new search should provide an intelligent summarization algorithm, possibly with a configurable way to detect endings of sentences and paragraphs.

  • Federated search it should be possible to federate the search with other sites that are built with the same engine, so that a single MkDocs site and a federated search can be built from multiple MkDocs projects. This could also be applied to different versions. The author must be able to influence the rendering of federated results.

  • Caching – since we're re-architecting the entire search implementation, we can leverage caching, so that the index can be completely persisted and restored from memory without the need to rebuild it every time.

  • Deep linking – the entire search must be serializable to a URL query string, so that the query the user entered, as well as all filters that were selected can be directly linked to.

  • Adaptive rendering – The search result list should be much smaller than it currently is, only including text when the search has only a few results, adapting to what the user expects. When the user only enters a few characters, a lot of documents will be returned. The more characters are entered, the less results will be returned, and at some threshold, the document text should be shown. This threshold should be configurable and tuneable.

  • Fuzzy finder – as opposed to the common tokenization and ranking with BM25, search should support to index datums like file paths, class names, attribute names, etc. with a fuzzy finder approach, similar to what IDEs like VS Code do when you're using auto complete.

  • Allow to use search as a component in Markdown – allow the user to embed search bars at arbitrary locations, possibly re-configured.

  • Rich results – not only code blocks should be renderable, but also Mermaid diagrams and code annotations.

  • Recommend term removal – if a search matches no results, recommend to the user which term can be removed.

  • Search history – we could allow to preserve the search history, which means that users have an easy way to go back to previous search results without having them to re-enter again. Entries in the search history could be cleared out by the user one-by-one.

  • Synonyms – authors should be allowed to provide synonyms for specific words. We need to think of a good way to signal to the user that a synonym was found, or we just replace the word with the synonym in search results.

  • Index non-Markdown sources – It should be possible to index other contents alongside Markdown, including HTML, PDFs, etc., possibly with the help of plugins.

  • Arbitrary sections – authors should be allowed to add custom sections like Admonitions or tabs (or whatever) to the search, in order to provide an even more flexible structure.

  • Search separator testing – we should provide a method for authors to easily test the search separator on their site.


This list is far from complete. We have so many more ideas, which we'll share when the time has come. We'll keep this issue updated, so feel free to subscribe or check back from time to time. We hope to push our the first candidate before the end of this year! Thank you for your patience and for your trust in Material for MkDocs.

@strausmann
Copy link
Sponsor

Great ideas and very great features for searching. The most important thing for us is that the search continued to work completely offline and without a web server. We use MkDocs as documentation, it has to work offline on the plane or on the ship.

@squidfunk
Copy link
Owner Author

@strausmann that is our priority. It will definitely work offline (our prototype already does), but we'll also add interesting new features like search federation (merging search indexes with other MkDocs sites) for which you obviously need to be online. All of those are optional and will degrade gracefully when offline, of course.

@strausmann
Copy link
Sponsor

The search federation is of course one of the most interesting features for us too. The documentation also runs on a web server. Several mkdocs instances run side by side for different topics. If the search for one mkdocd now also returns the contents of the other instances, that's brilliant. Of course, these instances then run on a web server in a closed environment.

@squidfunk
Copy link
Owner Author

Thanks for sharing your setup – that sounds like a perfect test case once we have a prototype. If you like, you can subscribe to #5230 and give it a try once we have the first version out ☺️

@strausmann
Copy link
Sponsor

Very happy, we would like to test it. we are excited.

@squidfunk
Copy link
Owner Author

squidfunk commented Nov 12, 2023

1st search preview is ready in #6321 – We encourage you to try it on your project and give feedback in #6321 ☺️

Note

It's still the same UI/UX, as we're currently focusing on internals. However, this PR fundamentally changes the search results, so we'd be interested to learn if you feel that it works better or worse in your documentation project. We'll be continuing to work on the internals and other parts mentioned in the OP while awaiting your feedback ☺️

@squidfunk
Copy link
Owner Author

2nd search preview is ready in #6372 – We encourage you to try it on your project and give feedback in #6372 ☺️

Note

It's still the same UI/UX, as we're currently focusing on internals. However, this PR fundamentally changes the search results, so we'd be interested to learn if you feel that it works better or worse in your documentation project. We'll be continuing to work on the internals and other parts mentioned in the OP while awaiting your feedback ☺️

@AutonomousCat
Copy link

I started using a Python library that uses MkDocs and this theme, and the search experience has been a bit overly stressful compared to what I'm used to with Sphinx, so I'm glad to see a search overhaul is already started.

My number 1 feature request would be a dedicated search page, and an option for the search bar to take you to it. I feel like the small window approach is not possible to fit all projects, for example API wrappers, where the results are large, but rightly so. It simply takes too much effort to go through all the "<#> more on this page" and scrolling through, only to possibly pass what you're looking for multiple times because of the small area.

That's really my main issue with search.

@ctalr-jb
Copy link

I'm loving the direction of this new search implementation. With the previews so far, I'm seeing a massive improvement in performance on larger sets of docs. Aside from the return of previous features, I'm definitely interested in seeing the "document metadata" and "federated search" ideas come to fruition for my own use cases.

@Aruelius
Copy link

I have been using mkdocs-material for over a year, it's good, but the support for Chinese search is not perfect. Both jieba and Lunr.js have very limited support for Chinese, and I know you have also been working on improving Chinese search, thank you very much!

In fact, when I was preparing to write the 2.0 document version of my project, I think that I need to use some new framework, such as Nextra, VitePress, dumi etc., cause that framework natively support Chinese search.

But I just saw this issue, I was excited, I got hope and I'll wait for better search to be released.

Merry Christmas and Happy New Year!

@squidfunk
Copy link
Owner Author

squidfunk commented Dec 21, 2023

@Aruelius could you share some links to SSGs that support better Chinese search than Material for MkDocs? We're very interested in improving support, and checking some existing solutions is always a good idea. As we're writing everything from scratch, now is the best time to investigate. Please don't only share links to the SSGs, but to resources that explain how search works in those SSGs, i.e., documentation pages, blog posts, repositories. Thank you!

Also please understand that "Chinese search is not perfect" is very hard for me to turn into actionable items. I don't speak Chinese. I'm essentially trying to improve search for a language I don't understand. I will need support from Chinese speaking users. Let's create a better search experience together.

@squidfunk
Copy link
Owner Author

squidfunk commented Dec 21, 2023

FWIW, a quick search surfaced that Vitepress supports two search providers:

We're aiming to build one of the most powerful search solutions that are Open Source (not like Algolia) and can run in the browser or on the Edge, but in-browser search just cannot compete with a hosted solution. That being said, if you would like to work together on this, I'd be happy to know exactly what you expect from Chinese search, what doesn't work correctly, and maybe if you found any Open Source solutions for this problem, because IMHO, to-date, Material for MkDocs is one of the very, very few SSGs that support Chinese search at all without a third-party service.

Nonetheless, we need to improve it!

@Aruelius
Copy link

@Aruelius could you share some links to SSGs that support better Chinese search than Material for MkDocs? We're very interested in improving support, and checking some existing solutions is always a good idea. As we're writing everything from scratch, now is the best time to investigate. Please don't only share links to the SSGs, but to resources that explain how search works in those SSGs, i.e., documentation pages, blog posts, repositories. Thank you!

Also please understand that "Chinese search is not perfect" is very hard for me to turn into actionable items. I don't speak Chinese. I'm essentially trying to improve search for a language I don't understand. I will need support from Chinese speaking users. Let's create a better search experience together.

I'm happy to contribute everything I can do.

This is flexsearch(https://github.com/nextapps-de/flexsearch) which is what Nextra is using, maybe it can help you.

image

@Aruelius
Copy link

Thanks for sharing! It's safe to say that we will account for this use case as well ☺️

Thank you~

@do-me
Copy link

do-me commented Dec 29, 2023

Just linking #5483 for some ideas how to implement semantic search without the need for a vector DB and model server, if loading 10-50Mb of resources is not a problem. If it is, stick to the "proper" setup with respective powerful infrastructure. 
I'm thinking of creating an mkdocs plugin but could use some helping hands in case (comment on linked discussion) :)

@squidfunk
Copy link
Owner Author

Thanks! Definitely interesting, but likely not possible in the browser alongside documentation that is shipped to users. 38MB download (as mentioned in the linked issue) is a no-no, but we have alternative ideas to explore ☺️

I'm thinking of creating an mkdocs plugin but could use some helping hands in case (comment on linked discussion) :)

If you want to go ahead, sure! We'll investigate this topic next year. Unfortunately, I have too much to do right now to help you, but once we tackle this, I'll post here, so everybody who is subscribed will be notified.

@do-me
Copy link

do-me commented Dec 29, 2023

Agree, both variants ( client-only vs server-client) have their tradeoffs (size/speed vs. cost/overhead). With the current hype, the recent hardware & software developments I could well imagine something like on-device inference-server (with default pre-trained models) that could easily be hooked up to the browser or apps system-wise. Once such a system is in reach, we could reevaluate maybe if only the index file would be downloaded (similar to the normal lunr search atm).

we have alternative ideas to explore ☺️

Excited for any kind of development here! :)

@syeda-git
Copy link

@squidfunk is there a tentative date range that this new search feature would be available?

@squidfunk
Copy link
Owner Author

Please be assured that we are working hard making the new search available as fast as possible, but it is a pretty big fish to fry – I'm essentially writing a search engine from scratch. You can support us finishing it faster by sponsoring the project, because with more sponsorships, I can delegate more work to other individuals helping out on issues, discussions, questions, etc., and focus on pushing it forward.

Sadly, only a small fraction of companies that uses Material for MkDocs and actually makes or saves money of our work supports our work financially. A lot of companies only free-ride. This makes our work more tedious.

@lucaong
Copy link

lucaong commented Jan 15, 2024

Local search, implemented using minisearch, which does not support Chinese

@squidfunk MiniSearch does support Chinese, although one has to provide a custom tokenizer for it, as explained for example here: lucaong/minisearch#201 (comment)

@squidfunk
Copy link
Owner Author

squidfunk commented Jan 16, 2024

Thanks! Note that Intl.Segmenter is not supported in all browsers. Also, according to #6307 (comment), segmenting is not enough to provide a good experience. Infix search seems to be necessary, but we need to investigate.

@lucaong
Copy link

lucaong commented Jan 16, 2024

Understood. I am not personally knowledgeable about supporting full-text search on Chinese language, but I tried to make MiniSearch as configurable as possible. I would definitely be interested in understanding if there is any gap there that cannot be solved with configuration, as well as a working MiniSearch configuration for Chinese to suggest to users.

Regarding infix search, that one is in fact a common request from users needing to support Chinese, and it can be done with MiniSearch (although the index will necessarily get larger). Here is a commend explaining how.

@squidfunk
Copy link
Owner Author

Yeah, I'm having my troubles understanding Chinese as well 😅 Thanks for explaining how to implement infix search with MiniSearch. However, as you can see from the OP, we're actively working on a new search engine. The reason is that Chinese search is not the only thing we need to support, but we need a solution that is as modular and flexible as possible, and with 65 supported languages and more than 40k installations, we have a lot of use cases to cater to. We've something close to being in prototype stage, so the decision whether to use an existing solution like MiniSearch is already a done deal. Thank you for your understanding.

@Lexachoc
Copy link

Lexachoc commented Feb 9, 2024

I am new to material for MkDocs. The built-in search is good to use until I have a Markdown page with symbols and latex equations, as below:

Symbol Description
Ain absorbance
$A_{in}=-\ln[I/I_0]=-\ln\tau_{in}$
$A_{in}$ absorbance
$A_{in}=-\ln[I/I_0]=-\ln\tau_{in}$

I can only search for the symbols in the browser using the Ctrl+F function to the first row but not the second row with Ain.
But both rows cannot be searched by entering Ain. That's not intuitive for me.

So it would be very useful if the search bar had the ability to search for the sub (sup) string, like the built-in Ctrl+F in the browser, or even better, to search for Latex

I would expect to enter Ain and get the result of the preview with rendered symbols (equations) instead of the Latex syntax.

@NFanoe
Copy link

NFanoe commented Apr 26, 2024

Have you considered some kind of faceted search? When we search for something, we get a ton of API stuff first. It would be great to be able to filter that away, or filter it in, based on maybe a metadata tag or even just a path.

@squidfunk
Copy link
Owner Author

Yes, filters (facetted search) will definitely be supported ☺️

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
announcement Issue announces news or new features
Projects
None yet
Development

No branches or pull requests

11 participants