-
-
Notifications
You must be signed in to change notification settings - Fork 135
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
3033 Introduced V4 Opinion Search API #4007
Conversation
Semgrep found 4
QuerySet.extra' does not provide safeguards against SQL injection and requires very careful use. SQL injection can lead to critical data being stolen by attackers. Instead of using '.extra', use the Django ORM and parameterized queries such as |
- Also merge the snippet content from DB when highlighting is disabled in the API request. - Included more V4 Opinions Search API
6dc94bb
to
21ecfd8
Compare
This will always be limited to a few different opinions, so I'd say that both the API and the front end should always show all of them. If you set it to 20, items, that'd surely be enough. The rest sounds perfect! |
while working on this, I got one question: Do you mean a cluster should always show all their opinions (up to 20) regardless of whether they were matched by the search query? Or should it should show only the opinions that matched a query (up to 20)? Currently, only positions that match are displayed in the fronted. If users perform a match-all query or query by cluster fields, the cluster will show "all" the opinions up to 5. |
Ideally, only the opinions that match should show in the results. If a cluster matches (but the opinion doesn't), then showing all or none of the sub-opinions seems fine. Probably best in that case to not show any opinion at all. |
Yeah, this is how it currently works.
Well, due to the cluster fields (except for So, I believe the remaining option is to display all the sub-opinions when the query involves only cluster fields (this will happen automatically) or a match-all query. |
Displaying all the subopinions in that case is fine too! |
…to 20 in search results.
Great, I've set the limit for sub-opinions to 20. Hoping this limit is enough to display all the possible sub-opinions when they all match in a query. This applies to both the frontend and the API. |
Looks like we have some conflicts here, @alberto. Want to get them cleaned up, and then I think we're good to have Eduardo review, right? |
Sure, I've resolved the conflicts and added the |
cl/lib/elasticsearch_utils.py
Outdated
and cd["type"] | ||
in [ | ||
SEARCH_TYPES.RECAP, | ||
SEARCH_TYPES.DOCKETS, | ||
SEARCH_TYPES.RECAP_DOCUMENT, | ||
] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's refactor this code to store the membership
check in a boolean variable. We can call it is_recap_search
and reuse it in both if statements.
def build_sort_results:
...
is_recap_search = cd["type"] in [
SEARCH_TYPES.RECAP,
SEARCH_TYPES.DOCKETS,
SEARCH_TYPES.RECAP_DOCUMENT,
]
if api_version == "v4" and is_recap_search:
...
if (
toggle_sorting
and api_version == "v4"
and is_recap_search
):
...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great! I've applied the suggestion and named the variable: require_v4_function_score
since it also includes PEOPLE in #4021
cl/lib/elasticsearch_utils.py
Outdated
.annotate( | ||
text_to_show=Case( | ||
When( | ||
~QObject(html_columbia__exact=""), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks I've removed __exact
from the lookups
cl/lib/elasticsearch_utils.py
Outdated
if search_type not in [SEARCH_TYPES.RECAP, SEARCH_TYPES.DOCKETS]: | ||
return frontend_hits_limit, query_hits_limit | ||
return display_hits_limit, query_hits_limit | ||
|
||
if search_type == SEARCH_TYPES.DOCKETS: | ||
frontend_hits_limit = 1 | ||
display_hits_limit = 1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should refactor these if statements into the pattern matching block.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done!
cl/search/api_views.py
Outdated
def list(self, request, *args, **kwargs): | ||
search_form = SearchForm(request.GET, is_es_form=True) | ||
if search_form.is_valid(): | ||
cd = search_form.cleaned_data | ||
search_type = cd["type"] | ||
search_query = self.document_search_classes[search_type].search() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like this approach, but there's potential overlap with the pattern matching block starting at line 276
. Both use elements from the document_search_classes
dictionary.
Considering this, should we handle a potential KeyError
exception here? Line 293
adds a case _
clause to the pattern matching, but a simple dictionary like document_search_classes
won't handle unexpected keys.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, you're right. There was a potential KeyError
for types that are not supported yet, like pa
and oa
. I've refactored the code and centralized the supported types in a dictionary called supported_search_types
, raising the unsupported error earlier. Let me know what you think.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. 👍 We can merge this code after addressing the comments
thanks! @ERosendo I've applied your suggestions. |
Sounds like consensus. Merging! |
This PR adds the Opinions
o
search type to the V4 Search API.It includes the same new V4 features as the RECAP search type described in #3975, with some variations detailed below.
The results structure is as follows:
At the first level, OpinionCluster fields are displayed. Within the
opinions
key, Opinions matching the query are shown. Up to 5 matched nested opinions are displayed per result; this setting is defined byCHILD_HITS_PER_RESULT
.In the frontend, we don't have a button to display if more than 5 opinions are matched by the query.
Therefore, my question is whether a
more_docs
field, similar to the one in the V4 RECAP Search API, is necessary when there are more than 5 Opinions matched?Perhaps it doesn't make sense since we don't have an
op
type that users could use to query allOpinions
matched by a query.Count
It only has the
count
that matchesOpinionCluster
and also relies on the cardinality query to get the approximate count when hits exceed 10,000.Sorting
The supported sorting keys for Opinions are the same as those in the frontend:
To support cursor pagination, the secondary sorting key is
cluster_id desc
.One difference to note regarding sorting from the RECAP search type is that in Opinions,
dateFiled
andciteCount
do not require the use of a custom function score as a workaround for score computation and search_after on None values. This is becausedate_filed
is a mandatory field in theOpinionCluster
model andcitation_count
defaults to0
in the model.Thus, sorting directly relies on the values returned by ES, avoiding the use of the custom function score.
Highlighting
As in the RECAP search type, highlighting is disabled by default and can be enabled by passing
highlight=on
.The supported HL fields are the same as in the frontend:
When highlighting is disabled, the snippet is retrieved from the DB similar to RECAP. However, for Opinions, it is a bit more complex, as the text field during indexing can be filled with different values according to their availability and prioritization, as follows:
So the same prioritization is used within the
merge_unavailable_fields_on_parent_document
method to extract the snippet from the DB, up toNO_MATCH_HL_SIZE
characters. It uses a single query per page relying onCase
When
queries' conditional expressions.