Search: customize behavior with hooks #4980

flynneva · 2023-02-02T18:00:26Z

Context

Since v9.0.0 customizing the search tokenization has been limited to modifying the separator keyword in the config (removed support for custom search transform functions).

Also since v9.0.0 we are able to specify a choice of 3 different lunr.js pipeline functions (stemmer, stopWordFilter, and trimmer) to be able to modify the search pipeline.

Description

Instead of just limiting us to only those 3 options (stemmer, stopWordFilter, and trimmer), we should be able to specify our own pipeline function (ideally just like how we do so with emojis).

So something like this is what I had in mind:

search:
   separator: "\s"
    pipeline:
      - stemmer
      - trimmer
      - stopWordFilter
      - !!python/name:my_cool_package.search.my_cool_pipeline_function

Not sure if that is allowed or if we would be required to write it in ts or js, but you get the idea.

Use Cases

customizing search pipelines to fit different use cases
customizing tokenization for different languages
etc.

Visuals

No response

Before submitting

I have read and followed the change request guidelines.
I have verified that my idea is a change request and not a bug report.
I have ensured that, to the best knowledge, my idea will benefit the entire community.
I have included relevant links to the documentation, related issues and discussions to underline the need for my idea.

The text was updated successfully, but these errors were encountered:

squidfunk · 2023-02-03T09:59:35Z

Thanks for suggesting. Could you please explain your desired outcome? "Customizing to fit different use cases" is too broad to be actionable. Also, customizing the search tokenizer for different languages can be implemented by using different separators for two different sites. In the discussion you linked, you mentioned:

For example on the mkdocs-material site if you search for test::search I'd expect it to return results about the two tokens (test and search) assuming the regex separator was applied to the query.

This was reported and fixed in #4884 and released in 9.0.7. The functionality for search transformation was removed because the current approach was not general enough and did not scale well. I'm happy to add a new way extension or transformation hook that allows to intercept query (and maybe results) and alter them before returning them, but we need to collect some use cases before tackling that.

flynneva · 2023-02-06T17:19:38Z

Could you please explain your desired outcome?

@squidfunk sure: be able to modify the lunr.js pipeline to fit other use cases not covered by the default search pipeline - like this one: Two stage tokenization to add full strings to the index.

customizing the search tokenizer for different languages can be implemented by using different separators for two different sites

@squidfunk I don't think you can add tokens found that are separated by spaces and tokens found that follow some separator regex (two-stage tokenization) using the current separator approach, right?

I'm happy to add a new way extension or transformation hook that allows to intercept query (and maybe results)

@squidfunk I'm confused by this statement. This is the "old way" of doing it. The "new way" would be to take advantage of the features that lunr.js provides (pipelines)...right? Not sure of the effort involved to enable us to customize our own pipelines, but that is what this issue is asking for.

squidfunk · 2023-02-07T06:34:06Z

I'm confused by this statement. This is the "old way" of doing it. The "new way" would be to take advantage of the features that lunr.js provides (pipelines)...right? Not sure of the effort involved to enable us to customize our own pipelines, but that is what this issue is asking for.

I may have formulated a bit badly – yes, it's the old way, but I was talking about rethinking that process, as it only allowed for query transformation and nothing else. Transformation is now done as part of the worker, so maybe we could provide hooks to hook into different parts of the search index, possibly exposing one hook to alter Lunr.js before starting to index documents. This would effectively allow to implement own pipeline functions with which you should achieve what you we're aiming for when we talked about the two-stage tokenization approach.

squidfunk · 2023-02-07T06:40:15Z

To expand on that: we moved transformation into the worker, so the worker is completely self-contained, i.e., defines all behavior. This makes integration of third-party search solutions simpler, as the application itself will apply no processing to the query before sending it to the worker. Before, query transformation was done in the application, then sent to the worker.

All of this made the current approach unfeasible, since it involves defining a function in the global scope that is called by the application if defined. We need a new approach for search transformation / extension, but before I started working on that I wanted to verify that this is still something that is needed 😊 We'll add it back shortly. If you have other ideas that we should consider and requirements we need to fulfill, please share them here. So far we collected:

Transform the query before searching (e.g. fooBar -> foobar foo bar)
Register custom pipeline functions for expanding or filtering tokens before indexing
Add a "Hello World" guide on how to write a custom pipeline function

flynneva · 2023-02-08T17:32:12Z

Transform the query before searching (e.g. fooBar -> foobar foo bar)

@squidfunk so according to the lunr.js docs I think pipelines do this as well, no?

From the lunr.js docs:

lunr.Pipelines maintain an ordered list of functions to be applied to all tokens in documents entering the search index and queries being ran against the index.

So I think only point 2 and 3 would need to be implemented:

Register custom pipeline functions for expanding or filtering tokens before indexing

Add a "Hello World" guide on how to write a custom pipeline function

And point 3 you might just be able to link to the lunrjs docs like you do for the pymdownx stuff

squidfunk · 2023-02-08T18:23:31Z

I'm not sure if pipelines allow to change the entirety of the syntax, that is Lunr.js field references and operators for boosting, as well as inclusion and exclusion. I think pipelines will only allow to remove, replace, expand or add tokens. Thus, I believe that in the following query, only the terms in brackets are moved through the pipeline:

+title:[fooBar]* [fooBar]^2

This would not allow to split/replace meta characters or introduce additional prefix or suffix wildcards. However, more research is needed. If you wish to dig into this, it'll be awesome to get some intel. Otherwise, I'll do that later.

squidfunk · 2023-11-07T14:10:35Z

Please see the announcement in #6307.

squidfunk · 2024-02-23T06:51:41Z

I've reopened #6632 which specifically requests to make PascalCase searchable as PascalCase, pascalcase and case – a shortcoming of the current implementation that was reported several times. I'm confident that this will make it into the next iteration of search, as I was able to quickly throw together a prototype. I'm leaving this issue open, since we also want to allow users to easily change the behavior of search with custom hooks.

squidfunk added change request Issue requests a new feature or improvement needs input Issue needs further input by the reporter labels Feb 3, 2023

squidfunk changed the title ~~Modifiable search pipelines with custom pipeline function~~ Add customization hooks for search to alter behavior Feb 7, 2023

squidfunk removed the needs input Issue needs further input by the reporter label Feb 7, 2023

squidfunk changed the title ~~Add customization hooks for search to alter behavior~~ Search: customize behavior with hooks Aug 10, 2023

squidfunk mentioned this issue Nov 7, 2023

Towards better documentation search #6307

Open

24 tasks

squidfunk closed this as completed Nov 7, 2023

squidfunk reopened this Nov 7, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Search: customize behavior with hooks #4980

Search: customize behavior with hooks #4980

flynneva commented Feb 2, 2023 •

edited

squidfunk commented Feb 3, 2023 •

edited

flynneva commented Feb 6, 2023

squidfunk commented Feb 7, 2023

squidfunk commented Feb 7, 2023 •

edited

flynneva commented Feb 8, 2023 •

edited

squidfunk commented Feb 8, 2023 •

edited

squidfunk commented Nov 7, 2023

squidfunk commented Feb 23, 2024

Search: customize behavior with hooks #4980

Search: customize behavior with hooks #4980

Comments

flynneva commented Feb 2, 2023 • edited

Context

Description

Related links

Use Cases

Visuals

Before submitting

squidfunk commented Feb 3, 2023 • edited

flynneva commented Feb 6, 2023

squidfunk commented Feb 7, 2023

squidfunk commented Feb 7, 2023 • edited

flynneva commented Feb 8, 2023 • edited

squidfunk commented Feb 8, 2023 • edited

squidfunk commented Nov 7, 2023

squidfunk commented Feb 23, 2024

flynneva commented Feb 2, 2023 •

edited

squidfunk commented Feb 3, 2023 •

edited

squidfunk commented Feb 7, 2023 •

edited

flynneva commented Feb 8, 2023 •

edited

squidfunk commented Feb 8, 2023 •

edited