Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

3033 Introduced V4 Opinion Search API #4007

Merged
merged 14 commits into from
May 16, 2024

Conversation

albertisfu
Copy link
Contributor

@albertisfu albertisfu commented May 2, 2024

This PR adds the Opinions o search type to the V4 Search API.

It includes the same new V4 features as the RECAP search type described in #3975, with some variations detailed below.

The results structure is as follows:

{
            "absolute_url": "/opinion/1243/howard-v-honda/",
            "attorney": "a bunch of crooks!",
            "caseName": "Howard v. Honda",
            "caseNameFull": "Harvey Howard v. Antonin Honda",
            "citation": [
                "22 AL 339",
                "33 state 1",
                "1 Yeates 1",
                "56 F.2d 9"
            ],
            "citeCount": 6,
            "cluster_id": 1243,
            "court": "Testing Supreme Court",
            "court_citation_string": "Test",
            "court_id": "test",
            "dateArgued": "2015-08-15",
            "dateFiled": "1895-06-09",
            "dateReargued": null,
            "dateReargumentDenied": "2015-08-15",
            "date_created": "2024-05-15T00:27:01.293279Z",
            "docketNumber": "docket number 2",
            "docket_id": 1981,
            "judge": "David",
            "lexisCite": "",
            "neutralCite": "22 AL 339",
            "non_participating_judge_ids": [],
            "opinions": [
                {
                    "author_id": 1559,
                    "cites": [
                        1823
                    ],
                    "date_created": "2024-05-15T00:27:01.293279Z",
                    "download_url": null,
                    "id": 1822,
                    "joined_by_ids": [
                        1559
                    ],
                    "local_path": "test/search/opinion_pdf_image_based.pdf",
                    "per_curiam": false,
                    "sha1": "8c99509631108e5909f322258da042f8713afe1d",
                    "snippet": "",
                    "timestamp": "2024-05-15T00:27:01.293279Z",
                    "type": "combined-opinion"
                }
            ],
            "panel_ids": [],
            "panel_names": [],
            "posture": "",
            "procedural_history": "some rando history",
            "scdb_id": "",
            "sibling_ids": [
                1822
            ],
            "source": "C",
            "status": "Published",
            "suitNature": "copyright",
            "syllabus": "some rando syllabus",
            "timestamp": "2024-05-15T00:27:01.293279Z"
        }

At the first level, OpinionCluster fields are displayed. Within the opinions key, Opinions matching the query are shown. Up to 5 matched nested opinions are displayed per result; this setting is defined by CHILD_HITS_PER_RESULT.

In the frontend, we don't have a button to display if more than 5 opinions are matched by the query.
Therefore, my question is whether a more_docs field, similar to the one in the V4 RECAP Search API, is necessary when there are more than 5 Opinions matched?
Perhaps it doesn't make sense since we don't have an op type that users could use to query all Opinions matched by a query.

Count

Screenshot 2024-05-03 at 12 05 15 p m

It only has the count that matches OpinionCluster and also relies on the cardinality query to get the approximate count when hits exceed 10,000.

Sorting

The supported sorting keys for Opinions are the same as those in the frontend:

"score desc"
"dateFiled desc"
"dateFiled asc"
"citeCount desc"
"citeCount asc"

To support cursor pagination, the secondary sorting key is cluster_id desc.

One difference to note regarding sorting from the RECAP search type is that in Opinions, dateFiled and citeCount do not require the use of a custom function score as a workaround for score computation and search_after on None values. This is because date_filed is a mandatory field in the OpinionCluster model and citation_count defaults to 0 in the model.

Thus, sorting directly relies on the values returned by ES, avoiding the use of the custom function score.

Highlighting

As in the RECAP search type, highlighting is disabled by default and can be enabled by passing highlight=on.
The supported HL fields are the same as in the frontend:

caseName
citation
suitNature
court_citation_string
docketNumber
text (snippet)

When highlighting is disabled, the snippet is retrieved from the DB similar to RECAP. However, for Opinions, it is a bit more complex, as the text field during indexing can be filled with different values according to their availability and prioritization, as follows:

html_columbia
html_lawbox
xml_harvard
html_anon_2020
html
plain_text

So the same prioritization is used within the merge_unavailable_fields_on_parent_document method to extract the snippet from the DB, up to NO_MATCH_HL_SIZE characters. It uses a single query per page relying on Case When queries' conditional expressions.

@albertisfu albertisfu changed the base branch from main to 3033-develop-v4-recap-search-api May 2, 2024 02:10
Copy link

semgrep-app bot commented May 2, 2024

Semgrep found 4 avoid-query-set-extra findings:

QuerySet.extra' does not provide safeguards against SQL injection and requires very careful use. SQL injection can lead to critical data being stolen by attackers. Instead of using '.extra', use the Django ORM and parameterized queries such as People.objects.get(name='Bob').

Ignore this finding from avoid-query-set-extra.

- Also merge the snippet content from DB when highlighting is disabled in the API request.

- Included more V4 Opinions Search API
@albertisfu albertisfu force-pushed the 3033-develop-v4-opinions-search-api branch from 6dc94bb to 21ecfd8 Compare May 3, 2024 18:10
@albertisfu albertisfu marked this pull request as ready for review May 3, 2024 18:10
@albertisfu albertisfu requested a review from mlissner May 3, 2024 18:10
@mlissner
Copy link
Member

mlissner commented May 3, 2024

Therefore, my question is whether a more_docs field, similar to the one in the V4 RECAP Search API, is necessary when there are more than 5 Opinions matched?

This will always be limited to a few different opinions, so I'd say that both the API and the front end should always show all of them. If you set it to 20, items, that'd surely be enough.

The rest sounds perfect!

@albertisfu
Copy link
Contributor Author

This will always be limited to a few different opinions, so I'd say that both the API and the front end should always show all of them. If you set it to 20, items, that'd surely be enough.

while working on this, I got one question: Do you mean a cluster should always show all their opinions (up to 20) regardless of whether they were matched by the search query?

Or should it should show only the opinions that matched a query (up to 20)?

Currently, only positions that match are displayed in the fronted. If users perform a match-all query or query by cluster fields, the cluster will show "all" the opinions up to 5.
However, if a user queries by an opinion field, such as the text field, only the opinions that match the query are shown within the cluster.

@mlissner
Copy link
Member

mlissner commented May 3, 2024

Do you mean a cluster should always show all their opinions (up to 20) regardless of whether they were matched by the search query?

Ideally, only the opinions that match should show in the results. If a cluster matches (but the opinion doesn't), then showing all or none of the sub-opinions seems fine. Probably best in that case to not show any opinion at all.

@albertisfu
Copy link
Contributor Author

Ideally, only the opinions that match should show in the results.

Yeah, this is how it currently works.

If a cluster matches (but the opinion doesn't), then showing all or none of the sub-opinions seems fine. Probably best in that case to not show any opinion at all.

Well, due to the cluster fields (except for non_participating_judge_ids and source, but they're not searchable) being indexed into the sub-opinions, every time a cluster matches, at least one sub-opinion will also be matched. The only scenario where a cluster can be matched without matching a sub-opinion is if the cluster doesn't have any sub-opinion.

So, I believe the remaining option is to display all the sub-opinions when the query involves only cluster fields (this will happen automatically) or a match-all query.

@mlissner
Copy link
Member

mlissner commented May 4, 2024

Displaying all the subopinions in that case is fine too!

@albertisfu
Copy link
Contributor Author

Displaying all the subopinions in that case is fine too!

Great, I've set the limit for sub-opinions to 20. Hoping this limit is enough to display all the possible sub-opinions when they all match in a query. This applies to both the frontend and the API.

Base automatically changed from 3033-develop-v4-recap-search-api to main May 6, 2024 23:16
@mlissner
Copy link
Member

Looks like we have some conflicts here, @alberto. Want to get them cleaned up, and then I think we're good to have Eduardo review, right?

@albertisfu
Copy link
Contributor Author

Sure, I've resolved the conflicts and added the meta key to the Opinions serializers as well. So this is now ready for review!

Comment on lines 537 to 542
and cd["type"]
in [
SEARCH_TYPES.RECAP,
SEARCH_TYPES.DOCKETS,
SEARCH_TYPES.RECAP_DOCUMENT,
]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's refactor this code to store the membership check in a boolean variable. We can call it is_recap_search and reuse it in both if statements.

def build_sort_results:
    ...
    
    is_recap_search = cd["type"] in [
        SEARCH_TYPES.RECAP,
        SEARCH_TYPES.DOCKETS,
        SEARCH_TYPES.RECAP_DOCUMENT,
    ]

    if api_version == "v4" and is_recap_search:
        ...

    if (
        toggle_sorting
        and api_version == "v4"
        and is_recap_search
    ):
        ...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great! I've applied the suggestion and named the variable: require_v4_function_score since it also includes PEOPLE in #4021

.annotate(
text_to_show=Case(
When(
~QObject(html_columbia__exact=""),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't need to add the __exact lookup. According to the documentation, it is assumed to be exact if you don’t provide a lookup type.

image

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks I've removed __exact from the lookups

Comment on lines 2559 to 2563
if search_type not in [SEARCH_TYPES.RECAP, SEARCH_TYPES.DOCKETS]:
return frontend_hits_limit, query_hits_limit
return display_hits_limit, query_hits_limit

if search_type == SEARCH_TYPES.DOCKETS:
frontend_hits_limit = 1
display_hits_limit = 1
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should refactor these if statements into the pattern matching block.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done!

def list(self, request, *args, **kwargs):
search_form = SearchForm(request.GET, is_es_form=True)
if search_form.is_valid():
cd = search_form.cleaned_data
search_type = cd["type"]
search_query = self.document_search_classes[search_type].search()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like this approach, but there's potential overlap with the pattern matching block starting at line 276. Both use elements from the document_search_classes dictionary.

Considering this, should we handle a potential KeyError exception here? Line 293 adds a case _clause to the pattern matching, but a simple dictionary like document_search_classes won't handle unexpected keys.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, you're right. There was a potential KeyError for types that are not supported yet, like pa and oa. I've refactored the code and centralized the supported types in a dictionary called supported_search_types, raising the unsupported error earlier. Let me know what you think.

Copy link
Contributor

@ERosendo ERosendo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. 👍 We can merge this code after addressing the comments

@albertisfu
Copy link
Contributor Author

thanks! @ERosendo I've applied your suggestions.
Also, I've converted list fields in Opinions and RECAP to NoneToListField and added tests to handle the bug related to ES DSL partial updates.

@mlissner
Copy link
Member

Sounds like consensus. Merging!

@mlissner mlissner merged commit 56521b0 into main May 16, 2024
13 checks passed
@mlissner mlissner deleted the 3033-develop-v4-opinions-search-api branch May 16, 2024 15:56
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: Done
Development

Successfully merging this pull request may close these issues.

None yet

3 participants