-
-
Notifications
You must be signed in to change notification settings - Fork 135
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
4039 Fix inconsistent results when ordering by date sorting keys. #4041
Conversation
cl/lib/elasticsearch_utils.py
Outdated
midnight_current_date = ( | ||
datetime.datetime.combine(default_current_date, datetime.time()) | ||
if default_current_date | ||
else None | ||
) | ||
default_current_time = ( | ||
int(midnight_current_date.timestamp() * 1000) | ||
if midnight_current_date | ||
else None | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we can simplify this logic by using a single if statement
midnight_current_date = ( | |
datetime.datetime.combine(default_current_date, datetime.time()) | |
if default_current_date | |
else None | |
) | |
default_current_time = ( | |
int(midnight_current_date.timestamp() * 1000) | |
if midnight_current_date | |
else None | |
) | |
midnight_current_date = None | |
default_current_time = None | |
if default_current_date: | |
midnight_current_date = datetime.datetime.combine( | |
default_current_date, datetime.time() | |
) | |
default_current_time = int(midnight_current_date.timestamp() * 1000) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
changed, thanks for the suggestion!
cl/api/pagination.py
Outdated
def set_page_request(self, request, search_type) -> datetime.date: | ||
self.base_url = request.build_absolute_uri() | ||
self.request = request | ||
self.search_type = search_type | ||
self.cursor = self.decode_cursor(request) | ||
|
||
# Set the request date from the cursor or provide an initial one if | ||
# this is the first page request. | ||
self.request_date = ( | ||
self.cursor.request_date | ||
if self.cursor | ||
else datetime.datetime.now().date() | ||
) | ||
return self.request_date |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the name set_page_request doesn't fully capture this method's functionality. While it does set the request
variable, it's doing a bit more than that, like decoding the cursor and setting the cursor variable, and also returning only the request_date
. Maybe we can give it a name that reflects those extra bits? Something like initialize_context_from_request, What do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I agree with the name change!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done!
cl/search/tests/tests_es_recap.py
Outdated
with self.subTest( | ||
search_params=search_params, msg="Test stable scores." | ||
): | ||
original_datetime = now().replace(day=1, hour=5, minute=0) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we can get this variable out of the loop, right? We don't need to recompute it for each subtest.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, thanks for noticing it. I have moved the variable out of the loop.
cl/search/tests/tests_es_recap.py
Outdated
with time_machine.travel(original_datetime, tick=False): | ||
r = self.client.get( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can potentially improve code reusability by implementing the shift helper function. This function encapsulates the logic to move the current time by a given offset. By using this function before triggering an API request, we can avoid repeating the "time_machine" contexts throughout the code.
We would need to apply the following refactor to the code:
original_datetime = now().replace(day=1, hour=5, minute=0)
for search_params in all_tests:
with self.subTest(
search_params=search_params, msg="Test stable scores."
), time_machine.travel(original_datetime, tick=False) as traveler:
# first API request
r = self.client.get(
reverse("search-list", kwargs={"version": "v4"}),
search_params,
)
date_score_1 = r.data["results"][0]["date_score"]
traveler.shift(datetime.timedelta(minutes=1))
# 2nd API request
r = self.client.get(
reverse("search-list", kwargs={"version": "v4"}),
search_params,
)
date_score_2 = r.data["results"][0]["date_score"]
...
what do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah I agree, this is a perfect use case for shift
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've tested the code using several cloned dockets. The results within the API response and the corresponding frontend results consistently match. We can merge the PR after addressing the comments
Thanks, @ERosendo I've applied your suggestions! |
This pull request addresses issue #4032, which caused inconsistent sort orders when using
date_filed
for ascending sorts.d
andr
search types.The original query structure was:
After profiling the queries, I realized that when a docket is matched by one of its child documents, the docket doesn't match the query. The docket function score doesn't contribute to the final document score.
The solution was to apply the function score to the whole query, so it applies regardless of the query that matched the document.
The second issue solved in this PR was related to using the same function score to sort documents by
date_filed
orentry_date_filed
in descending order, which had no issue since the document date is used directly to compute the score.However, when sorting in ascending order, the score is computed as
current_time - date_filed_time
, causing older documents to have a greater score. This could lead to pagination inconsistencies over time ascurrent_time
was based on milliseconds. As time passed between requests, scores could change between pages.In the front end, this is not an issue since all documents change their score equally. However, for cursor pagination, this could lead to inconsistencies across pages as these scores are used as a
search_after
parameter, potentially resulting in missing documents across pages.request_date
is retrieved on the first-page request, always set to the current day at midnight, ensuring all requests in a day have the samerequest_date
and scores remain stable throughout the day due todate_filed
andentry_date_filed
are also day granular.request_date
to the cursor and use it to compute scores across pagination using the same cursor. This resolves the problem of score stability if a request is made on two different days, which might seem unusual but can be common for requests made close to day transitions.Let me know what you think.