HTML Search: Fix duplicate results #11942

wlach · 2024-02-04T18:47:34Z

Subject: Fix duplicate search results

Feature or Bugfix

Bugfix

Purpose

We currently list the same page multiple times if, for example, both the title and content search match it. Since the preview is always the same, this is not very helpful.

Fix this by not using the anchor in the search result if it's an exact match on the document title. This ensures that each search result will appear at most one time.

Detail

An easy way to reproduce the bug is to do a title search: the first result will link to the section, the second will link to the page in general. It appears that the "remove duplicates" feature was added some time ago, perhaps before there were multiple ways in which documents were searched.

https://pradyunsg.me/furo/search/?q=quickstart&check_keywords=yes&area=default

Note that it appears the main sphinx documentation uses readthedocs' custom search, so this bug wouldn't be reproducible there.

Relates

Closes #11961

wlach · 2024-02-04T18:52:48Z

@AA-Turner Sorry for the direct ping again, but here's another hopefully obvious search fix in the vein of #11770

sphinx/themes/basic/static/searchtools.js

jayaddison · 2024-02-04T21:54:50Z

sphinx/themes/basic/static/searchtools.js

-            titles[file] !== title ? `${titles[file]} > ${title}` : title,
-            id !== null ? "#" + id : "",
+            isDocumentTitle ? title : `${titles[file]} > ${title}`,
+            isDocumentTitle ? "" : sectionId, // don't use the section id if we matched on the exact document title (creates duplicates below)


Is there always an entry in foundTitles that is the title of the document? (and does it have a special/fixed id value if so?)

Is there always an entry in foundTitles that is the title of the document?

Should be, assuming that every document has a title and this code is doing what it says it should:

https://github.com/wlach/sphinx/blob/27f4d3605989ea2216c5f370b07833a501de7267/sphinx/search/__init__.py#L397

(and does it have a special/fixed id value if so?)

The id will just be a regular section id (i.e. something like to3-automated-python-2-to-3-code-translation in https://docs.python.org/3/library/2to3.html)

Ok. Basically I'm wondering whether all of the title elements from foundTitles are already in the foundEntries index -- and if so, whether we could omit the foundTitles loop entirely.

Ok. Basically I'm wondering whether all of the title elements from foundTitles are already in the foundEntries index -- and if so, whether we could omit the foundTitles loop entirely.

It seems that it is not possible to omit that loop. The only way to find nested sections (A > B) seems to be from results in foundTitles (I tested with query quick guide for the Sphinx self-build doc dir).

Yeah index entries are their own concept, different from titles:

https://www.sphinx-doc.org/en/master/usage/restructuredtext/directives.html#index-generating-markup
https://coderefinery.github.io/sphinx-lesson/indexing/#indexing

So if I understand: titles have a two-level hierarchy -- top-level titles, and then items found 'within' that title. That index is created from the section headings in RST documents.

Meanwhile entries are flat / have no hierarchy. That index is created from directives, options, domain entities and so on.

And in cases where a title and an entry overlap, we want to de-duplicate.

And in cases where a title and an entry overlap, we want to de-duplicate.

To clarify that: in cases where a top-level title and an entry without an anchor overlap (?).

Hrm, something's off about my understanding here.

✅ I can search the Sphinx documentation for translating with sphinx-intl, and I see a second-level title result: Internationalization > Translating with sphinx-intl.

❌ But I cannot seem to search for using weblate -- if I scroll I eventually find a single, top-level Internationalization ~~entry~~ result -- but I would have expected to find Internationalization > Using Weblate service for team translation.

I'm going to take a break from reviewing for a while @wlach - I'll be back within the next few days. Thank you for looking into this.

❌ But I cannot seem to search for using weblate -- if I scroll I eventually find a single, top-level Internationalization entry result -- but I would have expected to find Internationalization > Using Weblate service for team translation.

"using weblate" is too short a search term for it to match (since the full title is 42 characters long, more than double the size of the section item):

https://github.com/wlach/sphinx/blob/8daf8c997340e67aa77db3a78a2140eaebb8a3ab/sphinx/themes/basic/static/searchtools.js#L293

If you search for e.g. "Using weblate service" you'd get the result you expected above.

I'm going to take a break from reviewing for a while @wlach - I'll be back within the next few days. Thank you for looking into this.

No problem, thank you! I may have limited time/opportunity to revise this PR for a few weeks after this Wednesday, but I won't forget about it. Also welcome maintainer changes / updates.

sphinx/themes/basic/static/searchtools.js

Fix a number of small bugs that this picked up along the way. Intended to eventually be updated after sphinx-doc#11942 is landed (but seperating out for now for ease-of-review).

wlach · 2024-02-05T15:20:44Z

Attempted to add a unit test for this general case in #11950. I filed that as a seperate PR as I found some other minor bugs when working on that and wanted to keep this PR to a reasonable size. It's hard enough to reason about what's going on already.

jayaddison · 2024-02-05T17:49:32Z

Attempted to add a unit test for this general case in #11950. I filed that as a seperate PR as I found some other minor bugs when working on that and wanted to keep this PR to a reasonable size. It's hard enough to reason about what's going on already.

About the test case: it's OK -- in fact much preferred! -- to invert the logic in that so that it fails until your fix from #11942 is applied to it. That helps people to understand the nature of the bug, and demonstrates how the fix addresses it.

Basically I'd recommend the following to make things easier for us all to comprehend and pick up again if it takes a while:

Ensure that each bug has a description somewhere (I count four you've uncovered, with one issue opened so far).
Open a refactor PR to make the Search.query function testable - based on your work so far that seems to be a requirement. That PR shouldn't change any behaviour, and it (hopefully!) should be uncontroversial and easy for us to review and accept.
Open either a combined PR, or one PR to resolve each bug (as you prefer -- as a reviewer, I'd like one per bug, but am OK with either). There should be at least one test case corresponding to each bug, and they should fail without the fix, and succeed with the fix in place.

I'm more-or-less up-to-date with your changes and I think they'll make a big improvement to Sphinx search quality -- all the more reason to get them included and credited in the changelog. Even so, I think total-time-to-merge may be minimized if we spend a little more time up-front to describe and separate each bugfix.

Let me know if I can help to explain anything about my recommended bugfix workflow or the reasoning behind it. There could also be things in the contribution documentation for us to update (we're not strict about bugreports before patches, but I do think they'll be helpful here).

wlach · 2024-02-06T13:57:00Z

Basically I'd recommend the following to make things easier for us all to comprehend and pick up again if it takes a while:
...

Hi yeah, that more or less makes sense. I didn't realize this would be quite the rabbit hole it was when I filed this initial PR otherwise I would have been a bit more systematic in my approach. I'll do something approximately like what you suggest (I think I might not bother with the seperate refactoring PR). One of the issues is that this issue kind of depends on the other two I found yesterday being fixed to be reasonably testable.

sphinx/themes/basic/static/searchtools.js

tests/js/searchtools.js

wlach · 2024-02-06T15:31:56Z

@jayaddison Hey, revised this PR to reference a specific issue as you suggested. It includes changes from other PRs for ease-of-testing, but that should go away once those are merged. Hopefully those are uncontroversial.

jayaddison · 2024-02-06T22:09:39Z

sphinx/themes/basic/static/searchtools.js

-            titles[file] !== title ? `${titles[file]} > ${title}` : title,
-            id !== null ? "#" + id : "",
+            isDocumentTitle ? title : `${titles[file]} > ${title}`,
+            isDocumentTitle ? "" : anchor, // don't provide an anchor if we matched on the exact document title (creates duplicates below)


I don't like this approach, because it creates a disparity between the code that queries titles compared to the code that queries entries.

The way I mentally model this is that we have a series of pipeline inputs -- currently two of them -- that produce results, and each of them passes results to a single filtering function (the de-duplication).

In my opinion it's easier to reason about the overall behaviour of the system when all of the pipelines behave as-near-identically as feasible.

So I'd recomment pushing the fix to the de-duplication phase. Your test coverage should still pass regardless of where the fix is applied.

I think this makes sense, just modified my approach to not de-duplicate on anchor, just on filename, title, and description (the latter is just for object definitions). This will consider two entries in the same document with the same section title but different anchors duplicates, but I think that's probably acceptable. If that's the case I don't think the extra result is going to be particularly meaningful.

The alternative would be to pass down whether a result refers to the document title or not and then ignore the anchor if so, but that seemed significantly more complicated for little benefit.

Interesting case where we have identical titles but unique sections:

https://docs.sqlalchemy.org/en/20/search.html?q=sql&check_keywords=yes&area=default

The top results are:

https://docs.sqlalchemy.org/en/20/changelog/changelog_03.html#change-0.3.0-sql
https://docs.sqlalchemy.org/en/20/changelog/changelog_03.html#change-0.3.4-sql
https://docs.sqlalchemy.org/en/20/changelog/changelog_03.html#change-0.3.5-sql

All of them have the same title 0.3 Changelog > sql. With this change, we'd only select the top result. I think that would probably actually be kind of desirable? Prevents the ridiculous proliferation of nearly-identical search results (6551 on that page as of this writing)

I changed my mind on this after thinking about it a bit more. I don't think it's a good idea to pull out valid unique search results out, so I went back and decided to do things the hard way -- I added an "isDocumentTitle" property to the search results, and omitted the anchor id from the de-duplication key if that's the case.

The end result passes the test, though it is a little awkward. Maybe there's a cleaner way of expressing this.

I've opened wlach#1 with approximately what I had in mind for this - it does handle the sqlalchemy case you mention, although to be honest those search results would be clearer if we included the subsection headings (0.3 > 0.3.11 > sql for example).

Basically what it does is drop the title from the result de-duplication key.

I've opened wlach#1 with approximately what I had in mind for this - it does handle the sqlalchemy case you mention, although to be honest those search results would be clearer if we included the subsection headings (0.3 > 0.3.11 > sql for example).

Hmm, not sure this will work -- see my comment there. I might be missing something, but I don't think there's any way around special casing the "title" result.

it does handle the sqlalchemy case you mention, although to be honest those search results would be clearer if we included the subsection headings (0.3 > 0.3.11 > sql for example).

Might be worth filing another issue for this. 😄

wlach · 2024-02-27T14:19:41Z

Update: @jayaddison and I had a back and forth about different approaches in wlach#1.

After all that discussion, I feel pretty ok about the approach in this PR and feel like it's reasonable to land as-is (as stated in the header, #11960 and #11958 should go in first, then this can be rebased on top of them) though I'm still willing to make minor adjustments if necessary.

We currently list the same page multiple times if, for example, both the title and content search match it. Since the preview is always the same, this is not very helpful.

wlach · 2024-03-03T19:42:34Z

Ok, I rebased this on top of all the work that landed so I think this is ready for review/landing on its own now.

In addition to the unit test, I also tested this end-to-end by running make docs target=html after installing Sphinx locally in a virtualenv.

Example output before:

(note double "Welcome")

Example output after:

(note only one "Welcome" result, pointing at the header)

jayaddison · 2024-03-04T01:57:28Z

@wlach could you try searching for configuration and check what kind of results you get?

When I try it locally, the results seem less relevent than the results from the same query on https://www.sphinx-doc.org

sphinx-doc.org (baseline)

current branch

It's possible that this is a configuration setting related to the way that the sphinx-docs.org site is deployed. Either way, it'd be nice to confirm.

jayaddison · 2024-03-04T09:41:05Z

sphinx/themes/basic/static/searchtools.js

@@ -304,10 +309,13 @@ const Search = {
      if (title.toLowerCase().trim().includes(queryLower) && (queryLower.length >= title.length/2)) {
        for (const [file, id] of foundTitles) {
          let score = Math.round(100 * queryLower.length / title.length)
+          let anchorIsDocumentTitle = titles[file] === title


Could we pre-calculate this when building the alltitles in Python, so that the client doesn't have to transform each entry every time?

This appears theoretically possible. I've drafted some sample code to illustrate that alternative here: #12047 - however it does adjust the contents of the index, and that could be considered a disadvantage.

jayaddison · 2024-03-05T15:51:53Z

@wlach could you try searching for configuration and check what kind of results you get?

When I try it locally, the results seem less relevent than the results from the same query on https://www.sphinx-doc.org

sphinx-doc.org (baseline)

[trimmed]

It's possible that this is a configuration setting related to the way that the sphinx-docs.org site is deployed. Either way, it'd be nice to confirm.

Well, it turns out that this was an unfair baseline to compare against.

The sphinx-doc.org site seems to be hosted by readthedocs, and the relevant Sphinx extension adds some custom code to use their search API in preference to Sphinx's own search index.

Running a search for configuration from the latest current Sphinx commit at the time of writing here (ae51974) produces:

...so basically identical results, but including many duplicates.

jayaddison

Personally I have a preference for fixing this in the Python code so that the JavaScript clients don't have to do repeat work here. But I'm OK with the code here, and it achieves the intended result, so I'm going to approve this.

wlach · 2024-03-09T15:40:12Z

Personally I have a preference for fixing this in the Python code so that the JavaScript clients don't have to do repeat work here. But I'm OK with the code here, and it achieves the intended result, so I'm going to approve this.

I agree FWIW. My preference would be to merge your solution and close this. Will wait for a maintainer to weigh in though.

picnixz · 2024-03-17T09:43:43Z

I've read your comments. AFAICT, everything is ready for that PR to land. Since we merged some related PRs, I'll let you guys tell me whether it's fine to merge (I was a bit confused with the output for "configuration" but my concerns were eventually unfounded). Just ping me next week.

wlach · 2024-03-18T01:24:43Z

I've read your comments. AFAICT, everything is ready for that PR to land. Since we merged some related PRs, I'll let you guys tell me whether it's fine to merge (I was a bit confused with the output for "configuration" but my concerns were eventually unfounded). Just ping me next week.

Ok, I'm just going to close this to reduce confusion. The code is still here if we need to try something else later, but that seems pretty unlikely at this point.

wlach changed the title ~~Fix duplicate search results~~ HTML Search: Fix duplicate results Feb 4, 2024

wlach marked this pull request as ready for review February 4, 2024 18:51

jayaddison reviewed Feb 4, 2024

View reviewed changes

sphinx/themes/basic/static/searchtools.js Outdated Show resolved Hide resolved

jayaddison added the html search label Feb 4, 2024

jayaddison mentioned this pull request Feb 4, 2024

HTML search results: identical summaries displayed for different link targets #11943

Closed

wlach mentioned this pull request Feb 4, 2024

HTML Search: Use anchor for search preview #11944

Merged

jayaddison reviewed Feb 4, 2024

View reviewed changes

sphinx/themes/basic/static/searchtools.js Outdated Show resolved Hide resolved

jayaddison reviewed Feb 4, 2024

View reviewed changes

sphinx/themes/basic/static/searchtools.js Show resolved Hide resolved

wlach mentioned this pull request Feb 5, 2024

HTML Search: Add test for querying #11950

Closed

wlach commented Feb 6, 2024

View reviewed changes

sphinx/themes/basic/static/searchtools.js Outdated Show resolved Hide resolved

wlach commented Feb 6, 2024

View reviewed changes

sphinx/themes/basic/static/searchtools.js Outdated Show resolved Hide resolved

wlach commented Feb 6, 2024

View reviewed changes

tests/js/searchtools.js Show resolved Hide resolved

wlach requested a review from jayaddison February 6, 2024 15:32

jayaddison reviewed Feb 6, 2024

View reviewed changes

wlach requested a review from jayaddison February 7, 2024 02:42

This was referenced Feb 7, 2024

Upstream issue #11961: alternative de-duplication approach wlach/sphinx#1

Closed

[HTML search] named captions have empty targets (anchors) in the titles index #11965

Closed

jayaddison mentioned this pull request Feb 27, 2024

HTML Search: Fix partial matches overwriting full matches #11958

Merged

jayaddison added type:bug javascript Pull requests that update Javascript code labels Mar 2, 2024

Fix duplicate search results

8fcfc22

We currently list the same page multiple times if, for example, both the title and content search match it. Since the preview is always the same, this is not very helpful.

wlach added 2 commits March 3, 2024 14:07

De-duplicate on document title only

37a2241

Update CHANGES.rst

772119c

wlach force-pushed the fix-duplicate-search-results branch 4 times, most recently from 0446152 to 949e131 Compare March 3, 2024 19:21

Further work to clarify, fix tests

d85c6ca

wlach force-pushed the fix-duplicate-search-results branch from 949e131 to d85c6ca Compare March 3, 2024 19:27

Docs lint

cc0e0ad

One last comment tweak

51e56a0

jayaddison reviewed Mar 4, 2024

View reviewed changes

jayaddison mentioned this pull request Mar 4, 2024

HTML Search: omit anchor reference from document titles in the search index. #12047

Open

jayaddison approved these changes Mar 5, 2024

View reviewed changes

jayaddison requested a review from jakobandersen March 6, 2024 01:21

jayaddison added awaiting:response Waiting for a response from the author of this issue awaiting:review PR waiting for a review by a maintainer. and removed awaiting:response Waiting for a response from the author of this issue labels Mar 14, 2024

Merge branch 'master' into fix-duplicate-search-results

b4b8900

picnixz self-requested a review March 17, 2024 09:43

picnixz approved these changes Mar 17, 2024

View reviewed changes

picnixz added priority:high and removed awaiting:review PR waiting for a review by a maintainer. labels Mar 17, 2024

wlach closed this Mar 18, 2024

github-actions bot locked as resolved and limited conversation to collaborators Apr 18, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

HTML Search: Fix duplicate results #11942

HTML Search: Fix duplicate results #11942

wlach commented Feb 4, 2024 •

edited

wlach commented Feb 4, 2024

jayaddison Feb 4, 2024

wlach Feb 4, 2024

jayaddison Feb 4, 2024

jayaddison Feb 4, 2024

wlach Feb 4, 2024

jayaddison Feb 4, 2024

jayaddison Feb 4, 2024

jayaddison Feb 4, 2024 •

edited

jayaddison Feb 4, 2024

wlach Feb 5, 2024

wlach commented Feb 5, 2024

jayaddison commented Feb 5, 2024

wlach commented Feb 6, 2024

wlach commented Feb 6, 2024

jayaddison Feb 6, 2024

wlach Feb 6, 2024 •

edited

wlach Feb 6, 2024 •

edited

wlach Feb 7, 2024

jayaddison Feb 7, 2024

wlach Feb 7, 2024

wlach commented Feb 27, 2024

wlach commented Mar 3, 2024

jayaddison commented Mar 4, 2024

jayaddison Mar 4, 2024

jayaddison Mar 4, 2024

jayaddison commented Mar 5, 2024

sphinx-doc.org (baseline)

jayaddison left a comment

wlach commented Mar 9, 2024

picnixz commented Mar 17, 2024

wlach commented Mar 18, 2024

HTML Search: Fix duplicate results #11942

HTML Search: Fix duplicate results #11942

Conversation

wlach commented Feb 4, 2024 • edited

Feature or Bugfix

Purpose

Detail

Relates

wlach commented Feb 4, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jayaddison Feb 4, 2024 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

wlach commented Feb 5, 2024

jayaddison commented Feb 5, 2024

wlach commented Feb 6, 2024

wlach commented Feb 6, 2024

Choose a reason for hiding this comment

wlach Feb 6, 2024 • edited

Choose a reason for hiding this comment

wlach Feb 6, 2024 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

wlach commented Feb 27, 2024

wlach commented Mar 3, 2024

jayaddison commented Mar 4, 2024

sphinx-doc.org (baseline)

current branch

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jayaddison commented Mar 5, 2024

sphinx-doc.org (baseline)

jayaddison left a comment

Choose a reason for hiding this comment

wlach commented Mar 9, 2024

picnixz commented Mar 17, 2024

wlach commented Mar 18, 2024

wlach commented Feb 4, 2024 •

edited

jayaddison Feb 4, 2024 •

edited

wlach Feb 6, 2024 •

edited

wlach Feb 6, 2024 •

edited