Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

HTML Search: Fix duplicate results #11942

Closed
wants to merge 11 commits into from

Conversation

wlach
Copy link
Contributor

@wlach wlach commented Feb 4, 2024

Subject: Fix duplicate search results

Feature or Bugfix

  • Bugfix

Purpose

We currently list the same page multiple times if, for example, both the title and content search match it. Since the preview is always the same, this is not very helpful.

Fix this by not using the anchor in the search result if it's an exact match on the document title. This ensures that each search result will appear at most one time.

Detail

An easy way to reproduce the bug is to do a title search: the first result will link to the section, the second will link to the page in general. It appears that the "remove duplicates" feature was added some time ago, perhaps before there were multiple ways in which documents were searched.

https://pradyunsg.me/furo/search/?q=quickstart&check_keywords=yes&area=default

Screenshot from 2024-02-04 13-42-21

Note that it appears the main sphinx documentation uses readthedocs' custom search, so this bug wouldn't be reproducible there.

Relates

Closes #11961

@wlach wlach changed the title Fix duplicate search results HTML Search: Fix duplicate results Feb 4, 2024
@wlach wlach marked this pull request as ready for review February 4, 2024 18:51
@wlach
Copy link
Contributor Author

wlach commented Feb 4, 2024

@AA-Turner Sorry for the direct ping again, but here's another hopefully obvious search fix in the vein of #11770

titles[file] !== title ? `${titles[file]} > ${title}` : title,
id !== null ? "#" + id : "",
isDocumentTitle ? title : `${titles[file]} > ${title}`,
isDocumentTitle ? "" : sectionId, // don't use the section id if we matched on the exact document title (creates duplicates below)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there always an entry in foundTitles that is the title of the document? (and does it have a special/fixed id value if so?)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there always an entry in foundTitles that is the title of the document?

Should be, assuming that every document has a title and this code is doing what it says it should:

https://github.com/wlach/sphinx/blob/27f4d3605989ea2216c5f370b07833a501de7267/sphinx/search/__init__.py#L397

(and does it have a special/fixed id value if so?)

The id will just be a regular section id (i.e. something like to3-automated-python-2-to-3-code-translation in https://docs.python.org/3/library/2to3.html)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok. Basically I'm wondering whether all of the title elements from foundTitles are already in the foundEntries index -- and if so, whether we could omit the foundTitles loop entirely.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok. Basically I'm wondering whether all of the title elements from foundTitles are already in the foundEntries index -- and if so, whether we could omit the foundTitles loop entirely.

It seems that it is not possible to omit that loop. The only way to find nested sections (A > B) seems to be from results in foundTitles (I tested with query quick guide for the Sphinx self-build doc dir).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So if I understand: titles have a two-level hierarchy -- top-level titles, and then items found 'within' that title. That index is created from the section headings in RST documents.

Meanwhile entries are flat / have no hierarchy. That index is created from directives, options, domain entities and so on.

And in cases where a title and an entry overlap, we want to de-duplicate.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And in cases where a title and an entry overlap, we want to de-duplicate.

To clarify that: in cases where a top-level title and an entry without an anchor overlap (?).

Copy link
Contributor

@jayaddison jayaddison Feb 4, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hrm, something's off about my understanding here.

  • ✅ I can search the Sphinx documentation for translating with sphinx-intl, and I see a second-level title result: Internationalization > Translating with sphinx-intl.
  • ❌ But I cannot seem to search for using weblate -- if I scroll I eventually find a single, top-level Internationalization entry result -- but I would have expected to find Internationalization > Using Weblate service for team translation.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm going to take a break from reviewing for a while @wlach - I'll be back within the next few days. Thank you for looking into this.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

❌ But I cannot seem to search for using weblate -- if I scroll I eventually find a single, top-level Internationalization entry result -- but I would have expected to find Internationalization > Using Weblate service for team translation.

"using weblate" is too short a search term for it to match (since the full title is 42 characters long, more than double the size of the section item):

https://github.com/wlach/sphinx/blob/8daf8c997340e67aa77db3a78a2140eaebb8a3ab/sphinx/themes/basic/static/searchtools.js#L293

If you search for e.g. "Using weblate service" you'd get the result you expected above.

I'm going to take a break from reviewing for a while @wlach - I'll be back within the next few days. Thank you for looking into this.

No problem, thank you! I may have limited time/opportunity to revise this PR for a few weeks after this Wednesday, but I won't forget about it. Also welcome maintainer changes / updates.

wlach added a commit to wlach/sphinx that referenced this pull request Feb 5, 2024
Fix a number of small bugs that this picked up
along the way.

Intended to eventually be updated after sphinx-doc#11942 is
landed (but seperating out for now for ease-of-review).
wlach added a commit to wlach/sphinx that referenced this pull request Feb 5, 2024
Fix a number of small bugs that this picked up
along the way.

Intended to eventually be updated after sphinx-doc#11942 is
landed (but seperating out for now for ease-of-review).
@wlach
Copy link
Contributor Author

wlach commented Feb 5, 2024

Attempted to add a unit test for this general case in #11950. I filed that as a seperate PR as I found some other minor bugs when working on that and wanted to keep this PR to a reasonable size. It's hard enough to reason about what's going on already.

@jayaddison
Copy link
Contributor

Attempted to add a unit test for this general case in #11950. I filed that as a seperate PR as I found some other minor bugs when working on that and wanted to keep this PR to a reasonable size. It's hard enough to reason about what's going on already.

About the test case: it's OK -- in fact much preferred! -- to invert the logic in that so that it fails until your fix from #11942 is applied to it. That helps people to understand the nature of the bug, and demonstrates how the fix addresses it.

Basically I'd recommend the following to make things easier for us all to comprehend and pick up again if it takes a while:

  1. Ensure that each bug has a description somewhere (I count four you've uncovered, with one issue opened so far).
  2. Open a refactor PR to make the Search.query function testable - based on your work so far that seems to be a requirement. That PR shouldn't change any behaviour, and it (hopefully!) should be uncontroversial and easy for us to review and accept.
  3. Open either a combined PR, or one PR to resolve each bug (as you prefer -- as a reviewer, I'd like one per bug, but am OK with either). There should be at least one test case corresponding to each bug, and they should fail without the fix, and succeed with the fix in place.

I'm more-or-less up-to-date with your changes and I think they'll make a big improvement to Sphinx search quality -- all the more reason to get them included and credited in the changelog. Even so, I think total-time-to-merge may be minimized if we spend a little more time up-front to describe and separate each bugfix.

Let me know if I can help to explain anything about my recommended bugfix workflow or the reasoning behind it. There could also be things in the contribution documentation for us to update (we're not strict about bugreports before patches, but I do think they'll be helpful here).

@wlach
Copy link
Contributor Author

wlach commented Feb 6, 2024

Basically I'd recommend the following to make things easier for us all to comprehend and pick up again if it takes a while:
...

Hi yeah, that more or less makes sense. I didn't realize this would be quite the rabbit hole it was when I filed this initial PR otherwise I would have been a bit more systematic in my approach. I'll do something approximately like what you suggest (I think I might not bother with the seperate refactoring PR). One of the issues is that this issue kind of depends on the other two I found yesterday being fixed to be reasonably testable.

@wlach
Copy link
Contributor Author

wlach commented Feb 6, 2024

@jayaddison Hey, revised this PR to reference a specific issue as you suggested. It includes changes from other PRs for ease-of-testing, but that should go away once those are merged. Hopefully those are uncontroversial.

@wlach wlach requested a review from jayaddison February 6, 2024 15:32
Comment on lines 298 to 306
titles[file] !== title ? `${titles[file]} > ${title}` : title,
id !== null ? "#" + id : "",
isDocumentTitle ? title : `${titles[file]} > ${title}`,
isDocumentTitle ? "" : anchor, // don't provide an anchor if we matched on the exact document title (creates duplicates below)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't like this approach, because it creates a disparity between the code that queries titles compared to the code that queries entries.

The way I mentally model this is that we have a series of pipeline inputs -- currently two of them -- that produce results, and each of them passes results to a single filtering function (the de-duplication).

In my opinion it's easier to reason about the overall behaviour of the system when all of the pipelines behave as-near-identically as feasible.

So I'd recomment pushing the fix to the de-duplication phase. Your test coverage should still pass regardless of where the fix is applied.

Copy link
Contributor Author

@wlach wlach Feb 6, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this makes sense, just modified my approach to not de-duplicate on anchor, just on filename, title, and description (the latter is just for object definitions). This will consider two entries in the same document with the same section title but different anchors duplicates, but I think that's probably acceptable. If that's the case I don't think the extra result is going to be particularly meaningful.

The alternative would be to pass down whether a result refers to the document title or not and then ignore the anchor if so, but that seemed significantly more complicated for little benefit.

Copy link
Contributor Author

@wlach wlach Feb 6, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interesting case where we have identical titles but unique sections:

https://docs.sqlalchemy.org/en/20/search.html?q=sql&check_keywords=yes&area=default

The top results are:

https://docs.sqlalchemy.org/en/20/changelog/changelog_03.html#change-0.3.0-sql
https://docs.sqlalchemy.org/en/20/changelog/changelog_03.html#change-0.3.4-sql
https://docs.sqlalchemy.org/en/20/changelog/changelog_03.html#change-0.3.5-sql

All of them have the same title 0.3 Changelog > sql. With this change, we'd only select the top result. I think that would probably actually be kind of desirable? Prevents the ridiculous proliferation of nearly-identical search results (6551 on that page as of this writing)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I changed my mind on this after thinking about it a bit more. I don't think it's a good idea to pull out valid unique search results out, so I went back and decided to do things the hard way -- I added an "isDocumentTitle" property to the search results, and omitted the anchor id from the de-duplication key if that's the case.

The end result passes the test, though it is a little awkward. Maybe there's a cleaner way of expressing this.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've opened wlach#1 with approximately what I had in mind for this - it does handle the sqlalchemy case you mention, although to be honest those search results would be clearer if we included the subsection headings (0.3 > 0.3.11 > sql for example).

Basically what it does is drop the title from the result de-duplication key.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've opened wlach#1 with approximately what I had in mind for this - it does handle the sqlalchemy case you mention, although to be honest those search results would be clearer if we included the subsection headings (0.3 > 0.3.11 > sql for example).

Hmm, not sure this will work -- see my comment there. I might be missing something, but I don't think there's any way around special casing the "title" result.

it does handle the sqlalchemy case you mention, although to be honest those search results would be clearer if we included the subsection headings (0.3 > 0.3.11 > sql for example).

Might be worth filing another issue for this. 😄

@wlach
Copy link
Contributor Author

wlach commented Feb 27, 2024

Update: @jayaddison and I had a back and forth about different approaches in wlach#1.

After all that discussion, I feel pretty ok about the approach in this PR and feel like it's reasonable to land as-is (as stated in the header, #11960 and #11958 should go in first, then this can be rebased on top of them) though I'm still willing to make minor adjustments if necessary.

@jayaddison jayaddison added type:bug javascript Pull requests that update Javascript code labels Mar 2, 2024
We currently list the same page multiple times if,
for example, both the title and content search
match it. Since the preview is always the same,
this is not very helpful.
@wlach wlach force-pushed the fix-duplicate-search-results branch 4 times, most recently from 0446152 to 949e131 Compare March 3, 2024 19:21
@wlach wlach force-pushed the fix-duplicate-search-results branch from 949e131 to d85c6ca Compare March 3, 2024 19:27
@wlach
Copy link
Contributor Author

wlach commented Mar 3, 2024

Ok, I rebased this on top of all the work that landed so I think this is ready for review/landing on its own now.

In addition to the unit test, I also tested this end-to-end by running make docs target=html after installing Sphinx locally in a virtualenv.

Example output before:

image

(note double "Welcome")

Example output after:

image

(note only one "Welcome" result, pointing at the header)

@jayaddison
Copy link
Contributor

@wlach could you try searching for configuration and check what kind of results you get?

When I try it locally, the results seem less relevent than the results from the same query on https://www.sphinx-doc.org

sphinx-doc.org (baseline)

image

current branch

image

It's possible that this is a configuration setting related to the way that the sphinx-docs.org site is deployed. Either way, it'd be nice to confirm.

@@ -304,10 +309,13 @@ const Search = {
if (title.toLowerCase().trim().includes(queryLower) && (queryLower.length >= title.length/2)) {
for (const [file, id] of foundTitles) {
let score = Math.round(100 * queryLower.length / title.length)
let anchorIsDocumentTitle = titles[file] === title
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we pre-calculate this when building the alltitles in Python, so that the client doesn't have to transform each entry every time?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This appears theoretically possible. I've drafted some sample code to illustrate that alternative here: #12047 - however it does adjust the contents of the index, and that could be considered a disadvantage.

@jayaddison
Copy link
Contributor

@wlach could you try searching for configuration and check what kind of results you get?

When I try it locally, the results seem less relevent than the results from the same query on https://www.sphinx-doc.org

sphinx-doc.org (baseline)

[trimmed]

It's possible that this is a configuration setting related to the way that the sphinx-docs.org site is deployed. Either way, it'd be nice to confirm.

Well, it turns out that this was an unfair baseline to compare against.

The sphinx-doc.org site seems to be hosted by readthedocs, and the relevant Sphinx extension adds some custom code to use their search API in preference to Sphinx's own search index.

Running a search for configuration from the latest current Sphinx commit at the time of writing here (ae51974) produces:

image

...so basically identical results, but including many duplicates.

Copy link
Contributor

@jayaddison jayaddison left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Personally I have a preference for fixing this in the Python code so that the JavaScript clients don't have to do repeat work here. But I'm OK with the code here, and it achieves the intended result, so I'm going to approve this.

@wlach
Copy link
Contributor Author

wlach commented Mar 9, 2024

Personally I have a preference for fixing this in the Python code so that the JavaScript clients don't have to do repeat work here. But I'm OK with the code here, and it achieves the intended result, so I'm going to approve this.

I agree FWIW. My preference would be to merge your solution and close this. Will wait for a maintainer to weigh in though.

@jayaddison jayaddison added awaiting:response Waiting for a response from the author of this issue awaiting:review PR waiting for a review by a maintainer. and removed awaiting:response Waiting for a response from the author of this issue labels Mar 14, 2024
@picnixz
Copy link
Member

picnixz commented Mar 17, 2024

I've read your comments. AFAICT, everything is ready for that PR to land. Since we merged some related PRs, I'll let you guys tell me whether it's fine to merge (I was a bit confused with the output for "configuration" but my concerns were eventually unfounded). Just ping me next week.

@picnixz picnixz self-requested a review March 17, 2024 09:43
@picnixz picnixz added priority:high and removed awaiting:review PR waiting for a review by a maintainer. labels Mar 17, 2024
@wlach
Copy link
Contributor Author

wlach commented Mar 18, 2024

I've read your comments. AFAICT, everything is ready for that PR to land. Since we merged some related PRs, I'll let you guys tell me whether it's fine to merge (I was a bit confused with the output for "configuration" but my concerns were eventually unfounded). Just ping me next week.

Ok, I'm just going to close this to reduce confusion. The code is still here if we need to try something else later, but that seems pretty unlikely at this point.

@wlach wlach closed this Mar 18, 2024
@github-actions github-actions bot locked as resolved and limited conversation to collaborators Apr 18, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
html search javascript Pull requests that update Javascript code priority:high type:bug
Projects
None yet
Development

Successfully merging this pull request may close these issues.

HTML Search: Contains duplicates based on title and content search
3 participants