Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

3033 Introduced V4 People Search API #4021

Merged
merged 16 commits into from
May 17, 2024
Merged

Conversation

albertisfu
Copy link
Contributor

@albertisfu albertisfu commented May 4, 2024

This PR introduces the PEOPLE search type to the V4 Search API (Judges).

The object's structure looks as follows:

{
   "aba_rating":[
      
   ],
   "absolute_url":"/person/15466/frank-aguilar/",
   "alias":[
      
   ],
   "alias_ids":[
      
   ],
   "date_granularity_dob":"",
   "date_granularity_dod":"",
   "dob":null,
   "dob_city":"",
   "dob_state":"",
   "dob_state_id":"",
   "dod":null,
   "fjc_id":"None",
   "gender":"Male",
   "id":15466,
   "meta":{
      "timestamp":"2024-05-15T16:28:47.694956Z",
      "date_created":"2022-06-17T17:44:50.557685Z"
   },
   "name":"Frank Aguilar",
   "political_affiliation":[
      "Democratic"
   ],
   "political_affiliation_id":[
      "d"
   ],
   "positions":[
      {
         "appointer":null,
         "court":"Tex. 228th Jud. Dist. Ct.",
         "court_citation_string":"",
         "court_exact":"texdistct229",
         "court_full_name":"Texas 228th Judicial District Court",
         "date_confirmation":null,
         "date_elected":"2018-11-06",
         "date_granularity_start":"%Y-%m-%d",
         "date_granularity_termination":"%Y-%m-%d",
         "date_hearing":null,
         "date_judicial_committee_action":null,
         "date_nominated":null,
         "date_recess_appointment":null,
         "date_referred_to_judicial_committee":null,
         "date_retirement":null,
         "date_start":"2019-01-01",
         "date_termination":"2023-01-01",
         "job_title":"",
         "judicial_committee_action":"",
         "meta":{
            "timestamp":"2024-05-15T16:28:49.391488Z",
            "date_created":"2022-06-17T17:44:50.626627Z"
         },
         "nomination_process":"",
         "organization_name":null,
         "position_type":"Presiding Judge",
         "predecessor":null,
         "selection_method":"",
         "selection_method_id":"",
         "supervisor":null,
         "termination_reason":""
      }
   ],
   "races":[
      "Hispanic/Latino"
   ],
   "religion":"",
   "school":[
      "The University of Texas at Austin"
   ]
}

It displays PersonDocument as the main document with their nested PositionDocument.

As in RECAP and Opinions, due to PersonDocument fields being indexed into PositionDocument, if a query only involves a PersonDocument field, all the Person Positions will be matched. This also happens with match-all queries, so that each Person will show all their positions. To ensure all of them are shown, the inner_hits size is set to 1000.

By default, the max inner hits that can be queried is 100, so we'd need to update this setting in the people_vectors index to 1000 before merging this PR:

PUT  /people_vectors/_settings

{
  "index": {
    "max_inner_result_window": 1000
  }
}

If the query matches a position field specifically, only the positions that match the query will be displayed within the Person as nested objects.

Originally, the People search on the frontend and the V3 API was not using the same query approach as other parent-child documents like RECAP and Opinions. This was because People search was not required to show nested documents in the frontend or the V3 API, and it was using a simpler approach that didn't return nested documents. Now, in V4, we need to show nested documents. To centralize the code base for building the People queries, I've migrated the frontend and V3 queries to use build_full_join_es_queries, which is the same approach used in RECAP and Opinions. The difference is that for the frontend and V3, the number of inner hits to return is 0, while in V4 it is 1000.

Sorting

The supported sorting keys for People are the same as those in the fronted:

"score desc"
"name_reverse asc"
"dob desc,name_reverse asc"
"dob asc,name_reverse asc"
"dod desc,name_reverse asc"

Due to dob and dod dates can be None, it was necessary to apply the same approach (custom function score) as in RECAP as a workaround to sort documents by these fields and use them as the search_after param.

Also, we can notice that the dob and dod sorting keys by default have a secondary sorting key, which is name_reverse asc. This means that in the V4 API, the sorting looks like:

1° function score for dob or dod
name_reverse asc
id desc (as the tiebreaker key)

Highlighting

As in the other search types, highlighting is disabled by default. When enabled by passing highlight=on, the HL fields are the same as in the frontend:

name
dob_city
dob_state_id
school
political_affiliation

All of them parent-level fields.

Empty list fields

I noticed that empty list fields in the people_vectors index were being indexed as None:

prepare_political_affiliation
prepare_alias
prepare_aba_rating
prepare_school
prepare_races
prepare_alias_ids
prepare_political_affiliation_id

In other search types, we display empty list fields as []. So I fixed the indexing to index them as [] when empty. This will be corrected in the next re-index of people_vectors.

However, I also found that on partial updates that involve a list field, the field is re-indexed as None even though it's explicitly passed an empty list. The issue is described here: elastic/elasticsearch-dsl-py#1819

Once the fix is released, we can update the client.

In the meantime, as a workaround, we're using the NoneToListField to display these fields as empty lists instead of None.

Let me know what do you think.

…_es_queries approach.

- Improved People serializers.
- Fixed a bug related to empty lists values after partial updates.
@albertisfu albertisfu marked this pull request as ready for review May 15, 2024 22:43
@albertisfu albertisfu requested a review from mlissner May 15, 2024 22:43
Copy link
Member

@mlissner mlissner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This all sounds and looks good to me at a skim. @ERosendo, do you for full review.

Thank you both!

Base automatically changed from 3033-develop-v4-opinions-search-api to main May 16, 2024 15:56
Copy link
Contributor

@ERosendo ERosendo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The code looks good. I tested using different filter combinations and it worked properly. There's just one minor suggestion for refactoring the get_child_top_hits_limit method. After that, I think we can merge this PR 👍

Comment on lines 2499 to +2502
def get_child_top_hits_limit(
search_params: QueryDict | CleanData, search_type: str
search_params: QueryDict | CleanData,
search_type: str,
api_version: Literal["v3", "v4"] | None = None,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe we can combine the match-case statements. They seem to share similar logic.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure! I've applied the suggestion, thanks!

…query.

- The plain_text was not being merged from the database when HL was disabled in the V4 RECAP Search API.
@albertisfu
Copy link
Contributor Author

thanks, @ERosendo I've applied your suggestion.

While working on that, I noticed a bug in the V4 RECAP Search API, related to get_search_query, where match-all queries are built for nested search types. The problem was that when HL was disabled and performing a match-all query, the snippet was being retrieved from ES using the HL no_match_size feature. As a result, these types of queries wouldn't get the performance boost of disabling HL completely. So, I refactored the method to use build_has_child_query to build all the has_child queries with the same properties, allowing HL to be disabled in the V4 API and getting the snippet from the DB.

Additionally, I noticed that the rd type in the frontend (where it is not supported) was throwing a 500 error instead of failing gracefully. So, I applied a fix to show the search error page instead.

If everything seems good, this can be merged. However, before we proceed, we need to apply this setting in production, which is required to accept the maximum number of positions set to 1000.

PUT  /people_vectors/_settings

{
  "index": {
    "max_inner_result_window": 1000
  }
}

@mlissner
Copy link
Member

I applied the setting in prod yesterday, so if you are both happy, let's merge!

@ERosendo
Copy link
Contributor

ERosendo commented May 17, 2024

@mlissner The latest commit successfully resolved the issue identified by @albertisfu in their comment. Everything is working properly now :shipit:

@mlissner mlissner merged commit 72bdb60 into main May 17, 2024
13 checks passed
@mlissner mlissner deleted the 3033-develop-v4-people-search-api branch May 17, 2024 17:07
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: Done
Development

Successfully merging this pull request may close these issues.

None yet

3 participants