Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

4039 Fix inconsistent results when ordering by date sorting keys. #4041

Merged
merged 5 commits into from
May 14, 2024

Conversation

albertisfu
Copy link
Contributor

This pull request addresses issue #4032, which caused inconsistent sort orders when using date_filed for ascending sorts.

  • The problem was directly related to queries that matched a docket by a RECAPDocument field in the d and r search types.

The original query structure was:

bool
 should
   function_score to nullify the child score
     has_child_query
  function_score to compute the score based on the docket date_filed
     docket_query

After profiling the queries, I realized that when a docket is matched by one of its child documents, the docket doesn't match the query. The docket function score doesn't contribute to the final document score.
The solution was to apply the function score to the whole query, so it applies regardless of the query that matched the document.

function_score to compute the score based on the docket `date_filed`
  bool
    should
      has_child_query
      docket_query
  • The second issue solved in this PR was related to using the same function score to sort documents by date_filed or entry_date_filed in descending order, which had no issue since the document date is used directly to compute the score.
    However, when sorting in ascending order, the score is computed as current_time - date_filed_time, causing older documents to have a greater score. This could lead to pagination inconsistencies over time as current_time was based on milliseconds. As time passed between requests, scores could change between pages.
    In the front end, this is not an issue since all documents change their score equally. However, for cursor pagination, this could lead to inconsistencies across pages as these scores are used as a search_after parameter, potentially resulting in missing documents across pages.

    • To solve the issue, a request_date is retrieved on the first-page request, always set to the current day at midnight, ensuring all requests in a day have the same request_date and scores remain stable throughout the day due to date_filed and entry_date_filed are also day granular.
    • The second tweak was to pass this request_date to the cursor and use it to compute scores across pagination using the same cursor. This resolves the problem of score stability if a request is made on two different days, which might seem unusual but can be common for requests made close to day transitions.

Let me know what you think.

@albertisfu albertisfu requested a review from mlissner May 10, 2024 01:23
Base automatically changed from 4039-fix-v4-search-api-serializer-errors to main May 10, 2024 16:44
Comment on lines 762 to 771
midnight_current_date = (
datetime.datetime.combine(default_current_date, datetime.time())
if default_current_date
else None
)
default_current_time = (
int(midnight_current_date.timestamp() * 1000)
if midnight_current_date
else None
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can simplify this logic by using a single if statement

Suggested change
midnight_current_date = (
datetime.datetime.combine(default_current_date, datetime.time())
if default_current_date
else None
)
default_current_time = (
int(midnight_current_date.timestamp() * 1000)
if midnight_current_date
else None
)
midnight_current_date = None
default_current_time = None
if default_current_date:
midnight_current_date = datetime.datetime.combine(
default_current_date, datetime.time()
)
default_current_time = int(midnight_current_date.timestamp() * 1000)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

changed, thanks for the suggestion!

Comment on lines 98 to 111
def set_page_request(self, request, search_type) -> datetime.date:
self.base_url = request.build_absolute_uri()
self.request = request
self.search_type = search_type
self.cursor = self.decode_cursor(request)

# Set the request date from the cursor or provide an initial one if
# this is the first page request.
self.request_date = (
self.cursor.request_date
if self.cursor
else datetime.datetime.now().date()
)
return self.request_date
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the name set_page_request doesn't fully capture this method's functionality. While it does set the request variable, it's doing a bit more than that, like decoding the cursor and setting the cursor variable, and also returning only the request_date. Maybe we can give it a name that reflects those extra bits? Something like initialize_context_from_request, What do you think?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I agree with the name change!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done!

with self.subTest(
search_params=search_params, msg="Test stable scores."
):
original_datetime = now().replace(day=1, hour=5, minute=0)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can get this variable out of the loop, right? We don't need to recompute it for each subtest.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, thanks for noticing it. I have moved the variable out of the loop.

Comment on lines 3253 to 3254
with time_machine.travel(original_datetime, tick=False):
r = self.client.get(
Copy link
Contributor

@ERosendo ERosendo May 14, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can potentially improve code reusability by implementing the shift helper function. This function encapsulates the logic to move the current time by a given offset. By using this function before triggering an API request, we can avoid repeating the "time_machine" contexts throughout the code.

We would need to apply the following refactor to the code:

original_datetime = now().replace(day=1, hour=5, minute=0)

for search_params in all_tests:
   with self.subTest(
       search_params=search_params, msg="Test stable scores."
   ), time_machine.travel(original_datetime, tick=False) as traveler:
   
        # first API request 
        r = self.client.get(
            reverse("search-list", kwargs={"version": "v4"}),
            search_params,
        )
        date_score_1 = r.data["results"][0]["date_score"]


       traveler.shift(datetime.timedelta(minutes=1))
       # 2nd API request
       r = self.client.get(
            reverse("search-list", kwargs={"version": "v4"}),
            search_params,
        )
       date_score_2 = r.data["results"][0]["date_score"]

       ...

what do you think?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I agree, this is a perfect use case for shift.

Copy link
Contributor

@ERosendo ERosendo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've tested the code using several cloned dockets. The results within the API response and the corresponding frontend results consistently match. We can merge the PR after addressing the comments

@albertisfu
Copy link
Contributor Author

Thanks, @ERosendo I've applied your suggestions!

@mlissner mlissner merged commit 266620b into main May 14, 2024
13 checks passed
@mlissner mlissner deleted the 4032-fix-v4-search-api-date-fields-sorting branch May 14, 2024 03:22
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Development

Successfully merging this pull request may close these issues.

None yet

3 participants