-
Notifications
You must be signed in to change notification settings - Fork 2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
HTML Search: Fix duplicate results #11942
Conversation
@AA-Turner Sorry for the direct ping again, but here's another hopefully obvious search fix in the vein of #11770 |
titles[file] !== title ? `${titles[file]} > ${title}` : title, | ||
id !== null ? "#" + id : "", | ||
isDocumentTitle ? title : `${titles[file]} > ${title}`, | ||
isDocumentTitle ? "" : sectionId, // don't use the section id if we matched on the exact document title (creates duplicates below) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there always an entry in foundTitles
that is the title of the document? (and does it have a special/fixed id
value if so?)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there always an entry in foundTitles that is the title of the document?
Should be, assuming that every document has a title and this code is doing what it says it should:
(and does it have a special/fixed id value if so?)
The id will just be a regular section id (i.e. something like to3-automated-python-2-to-3-code-translation
in https://docs.python.org/3/library/2to3.html)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok. Basically I'm wondering whether all of the title
elements from foundTitles
are already in the foundEntries
index -- and if so, whether we could omit the foundTitles
loop entirely.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok. Basically I'm wondering whether all of the
title
elements fromfoundTitles
are already in thefoundEntries
index -- and if so, whether we could omit thefoundTitles
loop entirely.
It seems that it is not possible to omit that loop. The only way to find nested sections (A > B
) seems to be from results in foundTitles
(I tested with query quick guide
for the Sphinx self-build doc
dir).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah index entries are their own concept, different from titles:
https://www.sphinx-doc.org/en/master/usage/restructuredtext/directives.html#index-generating-markup
https://coderefinery.github.io/sphinx-lesson/indexing/#indexing
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So if I understand: titles
have a two-level hierarchy -- top-level titles, and then items found 'within' that title. That index is created from the section headings in RST documents.
Meanwhile entries
are flat / have no hierarchy. That index is created from directives, options, domain entities and so on.
And in cases where a title
and an entry
overlap, we want to de-duplicate.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And in cases where a title and an entry overlap, we want to de-duplicate.
To clarify that: in cases where a top-level title and an entry without an anchor overlap (?).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hrm, something's off about my understanding here.
- ✅ I can search the Sphinx documentation for
translating with sphinx-intl
, and I see a second-level title result:Internationalization > Translating with sphinx-intl
. - ❌ But I cannot seem to search for
using weblate
-- if I scroll I eventually find a single, top-levelInternationalization
entryresult -- but I would have expected to findInternationalization > Using Weblate service for team translation
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm going to take a break from reviewing for a while @wlach - I'll be back within the next few days. Thank you for looking into this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
❌ But I cannot seem to search for using weblate -- if I scroll I eventually find a single, top-level Internationalization entry result -- but I would have expected to find Internationalization > Using Weblate service for team translation.
"using weblate" is too short a search term for it to match (since the full title is 42 characters long, more than double the size of the section item):
If you search for e.g. "Using weblate service" you'd get the result you expected above.
I'm going to take a break from reviewing for a while @wlach - I'll be back within the next few days. Thank you for looking into this.
No problem, thank you! I may have limited time/opportunity to revise this PR for a few weeks after this Wednesday, but I won't forget about it. Also welcome maintainer changes / updates.
Fix a number of small bugs that this picked up along the way. Intended to eventually be updated after sphinx-doc#11942 is landed (but seperating out for now for ease-of-review).
Fix a number of small bugs that this picked up along the way. Intended to eventually be updated after sphinx-doc#11942 is landed (but seperating out for now for ease-of-review).
Attempted to add a unit test for this general case in #11950. I filed that as a seperate PR as I found some other minor bugs when working on that and wanted to keep this PR to a reasonable size. It's hard enough to reason about what's going on already. |
About the test case: it's OK -- in fact much preferred! -- to invert the logic in that so that it fails until your fix from #11942 is applied to it. That helps people to understand the nature of the bug, and demonstrates how the fix addresses it. Basically I'd recommend the following to make things easier for us all to comprehend and pick up again if it takes a while:
I'm more-or-less up-to-date with your changes and I think they'll make a big improvement to Sphinx search quality -- all the more reason to get them included and credited in the changelog. Even so, I think total-time-to-merge may be minimized if we spend a little more time up-front to describe and separate each bugfix. Let me know if I can help to explain anything about my recommended bugfix workflow or the reasoning behind it. There could also be things in the contribution documentation for us to update (we're not strict about bugreports before patches, but I do think they'll be helpful here). |
Hi yeah, that more or less makes sense. I didn't realize this would be quite the rabbit hole it was when I filed this initial PR otherwise I would have been a bit more systematic in my approach. I'll do something approximately like what you suggest (I think I might not bother with the seperate refactoring PR). One of the issues is that this issue kind of depends on the other two I found yesterday being fixed to be reasonably testable. |
@jayaddison Hey, revised this PR to reference a specific issue as you suggested. It includes changes from other PRs for ease-of-testing, but that should go away once those are merged. Hopefully those are uncontroversial. |
titles[file] !== title ? `${titles[file]} > ${title}` : title, | ||
id !== null ? "#" + id : "", | ||
isDocumentTitle ? title : `${titles[file]} > ${title}`, | ||
isDocumentTitle ? "" : anchor, // don't provide an anchor if we matched on the exact document title (creates duplicates below) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't like this approach, because it creates a disparity between the code that queries titles
compared to the code that queries entries
.
The way I mentally model this is that we have a series of pipeline inputs -- currently two of them -- that produce results, and each of them passes results to a single filtering function (the de-duplication).
In my opinion it's easier to reason about the overall behaviour of the system when all of the pipelines behave as-near-identically as feasible.
So I'd recomment pushing the fix to the de-duplication phase. Your test coverage should still pass regardless of where the fix is applied.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this makes sense, just modified my approach to not de-duplicate on anchor, just on filename, title, and description (the latter is just for object definitions). This will consider two entries in the same document with the same section title but different anchors duplicates, but I think that's probably acceptable. If that's the case I don't think the extra result is going to be particularly meaningful.
The alternative would be to pass down whether a result refers to the document title or not and then ignore the anchor if so, but that seemed significantly more complicated for little benefit.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Interesting case where we have identical titles but unique sections:
https://docs.sqlalchemy.org/en/20/search.html?q=sql&check_keywords=yes&area=default
The top results are:
https://docs.sqlalchemy.org/en/20/changelog/changelog_03.html#change-0.3.0-sql
https://docs.sqlalchemy.org/en/20/changelog/changelog_03.html#change-0.3.4-sql
https://docs.sqlalchemy.org/en/20/changelog/changelog_03.html#change-0.3.5-sql
All of them have the same title 0.3 Changelog > sql
. With this change, we'd only select the top result. I think that would probably actually be kind of desirable? Prevents the ridiculous proliferation of nearly-identical search results (6551 on that page as of this writing)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I changed my mind on this after thinking about it a bit more. I don't think it's a good idea to pull out valid unique search results out, so I went back and decided to do things the hard way -- I added an "isDocumentTitle" property to the search results, and omitted the anchor id from the de-duplication key if that's the case.
The end result passes the test, though it is a little awkward. Maybe there's a cleaner way of expressing this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've opened wlach#1 with approximately what I had in mind for this - it does handle the sqlalchemy
case you mention, although to be honest those search results would be clearer if we included the subsection headings (0.3 > 0.3.11 > sql
for example).
Basically what it does is drop the title
from the result de-duplication key.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've opened wlach#1 with approximately what I had in mind for this - it does handle the sqlalchemy case you mention, although to be honest those search results would be clearer if we included the subsection headings (0.3 > 0.3.11 > sql for example).
Hmm, not sure this will work -- see my comment there. I might be missing something, but I don't think there's any way around special casing the "title" result.
it does handle the sqlalchemy case you mention, although to be honest those search results would be clearer if we included the subsection headings (0.3 > 0.3.11 > sql for example).
Might be worth filing another issue for this. 😄
Update: @jayaddison and I had a back and forth about different approaches in wlach#1. After all that discussion, I feel pretty ok about the approach in this PR and feel like it's reasonable to land as-is (as stated in the header, #11960 and #11958 should go in first, then this can be rebased on top of them) though I'm still willing to make minor adjustments if necessary. |
We currently list the same page multiple times if, for example, both the title and content search match it. Since the preview is always the same, this is not very helpful.
0446152
to
949e131
Compare
949e131
to
d85c6ca
Compare
Ok, I rebased this on top of all the work that landed so I think this is ready for review/landing on its own now. In addition to the unit test, I also tested this end-to-end by running Example output before: (note double "Welcome") Example output after: (note only one "Welcome" result, pointing at the header) |
@wlach could you try searching for When I try it locally, the results seem less relevent than the results from the same query on https://www.sphinx-doc.org sphinx-doc.org (baseline)current branchIt's possible that this is a configuration setting related to the way that the |
@@ -304,10 +309,13 @@ const Search = { | |||
if (title.toLowerCase().trim().includes(queryLower) && (queryLower.length >= title.length/2)) { | |||
for (const [file, id] of foundTitles) { | |||
let score = Math.round(100 * queryLower.length / title.length) | |||
let anchorIsDocumentTitle = titles[file] === title |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we pre-calculate this when building the alltitles
in Python, so that the client doesn't have to transform each entry every time?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This appears theoretically possible. I've drafted some sample code to illustrate that alternative here: #12047 - however it does adjust the contents of the index, and that could be considered a disadvantage.
[trimmed]
Well, it turns out that this was an unfair baseline to compare against. The Running a search for ...so basically identical results, but including many duplicates. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Personally I have a preference for fixing this in the Python code so that the JavaScript clients don't have to do repeat work here. But I'm OK with the code here, and it achieves the intended result, so I'm going to approve this.
I agree FWIW. My preference would be to merge your solution and close this. Will wait for a maintainer to weigh in though. |
I've read your comments. AFAICT, everything is ready for that PR to land. Since we merged some related PRs, I'll let you guys tell me whether it's fine to merge (I was a bit confused with the output for "configuration" but my concerns were eventually unfounded). Just ping me next week. |
Ok, I'm just going to close this to reduce confusion. The code is still here if we need to try something else later, but that seems pretty unlikely at this point. |
Subject: Fix duplicate search results
Feature or Bugfix
Purpose
We currently list the same page multiple times if, for example, both the title and content search match it. Since the preview is always the same, this is not very helpful.
Fix this by not using the anchor in the search result if it's an exact match on the document title. This ensures that each search result will appear at most one time.
Detail
An easy way to reproduce the bug is to do a title search: the first result will link to the section, the second will link to the page in general. It appears that the "remove duplicates" feature was added some time ago, perhaps before there were multiple ways in which documents were searched.
https://pradyunsg.me/furo/search/?q=quickstart&check_keywords=yes&area=default
Note that it appears the main sphinx documentation uses readthedocs' custom search, so this bug wouldn't be reproducible there.
Relates
Closes #11961