Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

HTML Search: omit anchor reference from document titles in the search index. #12047

Open
wants to merge 11 commits into
base: master
Choose a base branch
from

Conversation

jayaddison
Copy link
Contributor

Feature or Bugfix

  • Bugfix

Purpose

  • Adjusts the contents of the search-index provided to the client so that the browser correctly de-duplicates search results that link to the top-level title in documents.

Detail

Relates

@jayaddison jayaddison changed the title Draft: [HTML search] omit anchor reference from document titles in the search index. [HTML search] omit anchor reference from document titles in the search index. Mar 4, 2024
@jayaddison jayaddison marked this pull request as ready for review March 4, 2024 14:03
@wlach
Copy link
Contributor

wlach commented Mar 9, 2024

This seems preferable to my solution in #11942, just tested and it has the same results while making the search index smaller. 👍 from me.

@jayaddison jayaddison changed the title [HTML search] omit anchor reference from document titles in the search index. HTML Search: omit anchor reference from document titles in the search index. Mar 14, 2024
@jayaddison jayaddison added type:bug awaiting:response Waiting for a response from the author of this issue awaiting:review PR waiting for a review by a maintainer. and removed awaiting:response Waiting for a response from the author of this issue labels Mar 14, 2024
@picnixz
Copy link
Member

picnixz commented Mar 15, 2024

Fine for me (I'm not an expert in that search-related aspect so I'll leave it to you guys). Should I understand that this PR would replace #11942 ? @jayaddison ping me when you want it to be merged

@jayaddison
Copy link
Contributor Author

Fine for me (I'm not an expert in that search-related aspect so I'll leave it to you guys). Should I understand that this PR would replace #11942 ?

That's correct. I think there are three things that I don't like about this pull request, in order of priority:

  1. It doesn't offer JavaScript test coverage to demonstrate the fix -- only Python test coverage that confirms the expected index format.
  2. I haven't been able to think of a way that this would happen yet, but in theory I expect some downstream tools that use the current index format could be broken by this - in particular the possibility of null values where previously everything was a string.
  3. Related to (2), the index format isn't as small as it could be. In raw text, the JavaScript empty string can be represented by either '' or "" (two characters), whereas a null value is represented by the token null (four characters). It may not matter a lot, but over a long enough duration of time the cumulative cost could be significant (or not! maybe it's a waste of time considering it).

@jayaddison
Copy link
Contributor Author

It doesn't offer JavaScript test coverage to demonstrate the fix -- only Python test coverage that confirms the expected index format.

I think I'll begin work on a small JavaScript refactor PR to make test coverage easier to add.

@picnixz
Copy link
Member

picnixz commented Mar 15, 2024

particular the possibility of null values where previously everything was a string.

Is it possible to keep a string or is the null type needed?

@jayaddison
Copy link
Contributor Author

particular the possibility of null values where previously everything was a string.

Is it possible to keep a string or is the null type needed?

If I remember correctly, this conditional needs to be adjusted if we're using empty strings, but otherwise it should be fine.

What I would really like is a way to generate the indexes used in the JavaScript tests from the same Python code that builds searchindex.js when projects are built. It feels fragile that the test indexes are written by hand.

@jayaddison
Copy link
Contributor Author

What I would really like is a way to generate the indexes used in the JavaScript tests from the same Python code that builds searchindex.js when projects are built. It feels fragile that the test indexes are written by hand.

Moved into #12099.

@jayaddison
Copy link
Contributor Author

I'd like to check whether we can get #12102 in place before progressing this pull request further. If that can be added, then I think adding test coverage here will be much easier and more reliable (I'll be able to create a sample Sphinx project that returns duplicate search results, and add test coverage against that).

@jayaddison jayaddison marked this pull request as draft March 16, 2024 11:08
@jayaddison
Copy link
Contributor Author

(and maybe do the null to '' refactoring at the same time and more safely, given the test coverage)

@picnixz
Copy link
Member

picnixz commented Mar 17, 2024

Removing the "awaiting review" label until this PR is ready

@picnixz picnixz removed the awaiting:review PR waiting for a review by a maintainer. label Mar 17, 2024
@jayaddison
Copy link
Contributor Author

Removing the "awaiting review" label until this PR is ready

Oops, thanks. I forgot about that.

@jayaddison jayaddison mentioned this pull request Apr 9, 2024
@jayaddison
Copy link
Contributor Author

While I think it would be sensible to merge #12102 first (if-and-when that pull request is considered acceptable!), I'm going to remove the blocker label and draft status from this pull request.

@jayaddison jayaddison marked this pull request as ready for review May 9, 2024 22:26
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

HTML Search: Contains duplicates based on title and content search
3 participants