Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Opinons / Opinon Cluster Webhook #3779

Open
wants to merge 22 commits into
base: main
Choose a base branch
from

Conversation

jordanparker6
Copy link

@jordanparker6 jordanparker6 commented Feb 14, 2024

Adding webhooks to allow users to subscribe to updates to the Opinion and OpinionCluster entity.

  1. Webhook Events have been created for CREATE, UPDATE and DELETE events.
  2. Functions to send the webhook event have been created for each new event.
  3. The webhooks create, delete and update functionality has been added to the Opinion and OpinionCluster model's methods so that it fires on save / delete function calls.

Notes:

  1. Should we look at adding the event firing at the database layer instead with something like django-pgtrigger?
  2. Are there any updates to the UI that need to be made?
  3. There is no delete method on the Opinion model? Is this intended?

Copy link

sentry-io bot commented Feb 14, 2024

🔍 Existing Issues For Review

Your pull request is modifying functions with the following pre-existing issues:

📄 File: cl/search/models.py

Function Unhandled Issue
save ValidationError: 'docket_number' cannot be Null or empty in RECAP dockets. ...
Event Count: 12

Did you find this useful? React with a 👍 or 👎

@CLAassistant
Copy link

CLAassistant commented Feb 14, 2024

CLA assistant check
Thank you for your submission! We really appreciate it. Like many open source projects, we ask that you all sign our Contributor License Agreement before we can accept your contribution.
1 out of 2 committers have signed the CLA.

✅ jordanparker6
❌ dwelsh-clouru
You have signed the CLA already but the status is still pending? Let us recheck it.

@mlissner
Copy link
Member

Jordan, to make things explicit, we usually don't do reviews until a review has been requested, but I suspect you don't have the rights to do that, so before we start on this, can I just ask: Is this ready or a draft?

@jordanparker6
Copy link
Author

Jordan, to make things explicit, we usually don't do reviews until a review has been requested, but I suspect you don't have the rights to do that, so before we start on this, can I just ask: Is this ready or a draft?

Feedback on the notes may change this PR.

Besides that, the PR is ready for review.

@albertisfu
Copy link
Contributor

Thanks @jordanparker6 for the work on this. Here is some feedback:

Triggering Webhooks

Your current webhook triggering logic resides within custom save and delete model methods. Here are some thoughts about why we should use post_save or post_delete signals instead:

  • Soon, we will remove the indexing Solr tasks from these model methods, so it would be better to keep the save/delete methods specific for actions directly related to logic that affects the model data directly.
  • As you have noticed, we do not have a custom delete method for Opinion because we do not currently need it. Using a post_delete signal will avoid adding this method to Opinion.
  • Your code is using update_fields in the save method, but unfortunately that field is usually blank unless you explicitly populate it, so we can't count on it. Instead of doing that, we should use the field_tracker if we want to only include fields that change on updates. Also, to detect if an instance is newly created or just an update, we could use the post_save signal created argument.
  • We shouldn't trigger webhooks synchronously during the save/deletion process since that will slow down the operation and it's also vulnerable to errors. If for some reason the webhook sent fails, that will prevent the action from being completed, so the instance won’t be saved/updated/deleted. This can also happen using signals, so we should decouple the webhook sending from the DB operations, using a Celery task seems the way to go here. Or doing the batching at this stage and only adding the event to the queue to then send the webhook later should also solve the problem.
  • The current implementation of send_opinion_created_webhook and send_opinion_cluster_created_webhook is failing due to the use of OpinionClusterSerializer and OpinionSerializer. These serializers use the HyperlinkedRelatedField that require a request object to work, which is not available when sending webhooks, so we might need to create new serializers for Opinions and Opinion Clusters webhooks that exclude HyperlinkedRelatedField.

Frontend

After you create your webhook in /profile/webhooks/ you can see a “Test” button that users can use to simulate a webhook event for testing purposes.
Screenshot 2024-02-16 at 14 06 51

So, we’d need to add those event templates for each new webhook. The view that handles it is cl.users.api_views.WebhooksViewSet.test_webhook.

Tests

If you prefer, once this is completed I'm willing to help and add the required tests for these new webhooks, to ensure each webhook contains the payload we expect and are properly triggered.

@jordanparker6
Copy link
Author

@albertisfu Thanks for the feedback.

I have done the following.

  1. I have moved the webhook triggers to the django post_save and post_delete signals
  2. I have moved the webhook functions from cl.api.webhooks to cl.api.tasks where I have wrapped them in a celery task to offload the execution of all the events to celery workers. This task will fire one event to all subscribers per task execution. I wonder if batching of events is actually needed given the celery implementation.
  3. I have created a OpinionSerializerOffline and OpinionClusterSerializerOffline where I have implemented serialisation that avoids the request context. Please review there are some key decisions on what is / isn't included. There is also a circular import with the PersonSerializer. I think this information would be nice to include to avoid having to send requests to the person API but may need to refactor some stuff to work.
  4. I have created the template files for the test webhook functions. These need to be updated with real test data and I don't have my hands on any atm.

Things that need to be done / need clarification.

I am yet to implement the webhook update method for the OpinionCluster and Opinion models. I can add them to the post_save signal where created=False however I am unsure how the FieldTracking comes into it and if they should be mounted to that model or not. Thoughts?

I can't seem to run tests locally as the tests are linked to someones personal files ("/home/elliott/freelawmachine/flp/columbia_data/opinions"). Any tips for getting this running locally so I can test the UI?

Copy link
Contributor

@albertisfu albertisfu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @jordanparker6 this is getting closer!

I've left some comments in the code.

And following up on your comments.

I wonder if batching of events is actually needed given the celery implementation.

I believe we can benefit from batching in situations where we add or update thousands of Opinions/OpinionClusters at a time, as we typically do, to avoid triggering an equivalent number of tasks/webhooks. On the other hand, batching could slow down the process of sending webhooks if our ingestion rate is not too high during a given period.

This approach would necessitate additional logic to send the accumulated webhooks even if the batch size has not been reached after a certain threshold time.
Do you have any thoughts about this? @mlissner

I have created a OpinionSerializerOffline and OpinionClusterSerializerOffline where I have implemented serialisation that avoids the request context. Please review there are some key decisions on what is / isn't included.

Mike, do you have any suggestions on which fields should be included in the webhook/client context?

  • Should we add the same fields as in the API?
  • Should we directly include nested fields of related instances like the Docket? If so, which fields from Docket would be worthwhile to add directly?
  • Jordan mentioned that it would be nice to include the Person data nested within every Person related field. To do this, we would need to create a new serializer for Person since PersonSerializer also uses HyperlinkedRelatedField for some fields. Therefore, we need to define which fields are worth including in this new serializer.

I have created the template files for the test webhook functions. These need to be updated with real test data and I don't have my hands on any atm.

Great, thank you. I can update these templates with a real test example once we have defined the structure of the serializers.

I am yet to implement the webhook update method for the OpinionCluster and Opinion models. I can add them to the post_save signal where created=False however I am unsure how the FieldTracking comes into it and if they should be mounted to that model or not. Thoughts?

Yes, as I remember, we wanted the field_tracker to send a payload only with the fields that changed during an update. This approach has some benefits:

  • It allows for smaller payloads by only sending the fields that changed.
  • It simplifies the logic for the client to determine which fields changed and take appropriate action.
    We could also send the same payload as the created webhook and not use the field_tracker. Do you have an opinion on this, Mike?

In a follow-up comment, I'm posting the process we could follow to use the field tracker in case we decide to go with that approach.

I can't seem to run tests locally as the tests are linked to someones personal files ("/home/elliott/freelawmachine/flp/columbia_data/opinions"). Any tips for getting this running locally so I can test the UI?

I think you're looking to generate some OpinionClusters/Opinions? I did see that the file you mentioned is linked to html_test.py, which seems like a command to import opinions.

If that's the case, instead of using that importer, you could generate fake Opinions/OpinionClusters using the make_dev_data command:

To generate 1 OpinionWithParentsFactory, use the following command:
manage.py make_dev_data --make-objects 103 --count 1

To generate 1 OpinionClusterWithParentsFactory, use the following command:
manage.py make_dev_data --make-objects 102 --count 1

You can also generate some instances using the admin panel in case you want to create Opinions/OpinionClusters with specific data.

If you're interested in real data, maybe Mike can suggest which command you can use to import some real opinions.

cl/api/tasks.py Outdated Show resolved Hide resolved
cl/api/tasks.py Outdated Show resolved Hide resolved
cl/api/webhooks.py Outdated Show resolved Hide resolved
judges = PersonSerializer(many=True, read_only=True)
docket_id = serializers.ReadOnlyField()
# To Delete: CONFIRM THIS
# docket = serializers.PrimaryKeyRelatedField(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

By default, when using fields = "__all__", a docket field is generated that shows the related docket ID. Therefore, we need to decide which name makes more sense: it can either be docket_id or docket.

cl/search/models.py Outdated Show resolved Hide resolved
cl/search/models.py Outdated Show resolved Hide resolved
Send a webhook to the webapp when an opinion is created.
"""
if created:
return send_opinion_created_webhook(instance)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This works, but it doesn't process the method asynchronously as a Celery task.

It should be invoked as:
send_opinion_created_webhook.delay(instance.id)

Additionally, we prefer to only send the instance ID to the task and retrieve the instance within the task for two reasons: it's more lightweight, and in case the instance is modified in the meantime, the data to be sent is the most recent one from the database.

Regarding the delete webhooks, I notice that you're sending a list of IDs. I assume this is in preparation for sending multiple IDs when we implement batching? If so, would it be a good idea to also prepare the creation/update methods to support batching in the future? This way, the JSON structure would be ready to support multiple instances.

For instance, for OpinionCluster we can have an array like:

{
   "payload":{
      "clusters":[
         {
            
         }
      ]
   }
}

And something similar for Opinions too.

Comment on lines +1 to +52
{
"payload":{
"docket_id": 1,
"panel": [
{ }
] ,
"non_participating_judges": [
{}
],
"judges": "J Gallegar",
"date_filed": "",
"date_filed_is_approximate": True,
"slug": "",
"case_name_short": "",
"case_name": "",
"case_name_full": "",
"scdb_id": "",
"scdb_decision_direction": "",
"scdb_votes_majority": "",
"scdb_votes_majority": "",
"scdb_votes_minority": "",
"source": "",
"procedural_history": "",
"attorneys": "",
"nature_of_suit": "",
"posture": "",
"syllabus": "",
"headnotes": "",
"summary": "",
"disposition": "",
"history": "",
"other_dates": "",
"cross_reference": "",
"correction": "",
"citations": [
{}
]
"citation_count": "",
"precedential_status": "",
"date_blocked": "",
"blocked": "",
"filepath_json_harvard": "",
"arguments": "",
"headmatter": "",
}
"webhook":{
"version":1,
"event_type":3,
"date_created":"2024-01-06T14:21:40.855097-07:00",
"deprecation_date":null
}
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
{
"payload":{
"docket_id": 1,
"panel": [
{ }
] ,
"non_participating_judges": [
{}
],
"judges": "J Gallegar",
"date_filed": "",
"date_filed_is_approximate": True,
"slug": "",
"case_name_short": "",
"case_name": "",
"case_name_full": "",
"scdb_id": "",
"scdb_decision_direction": "",
"scdb_votes_majority": "",
"scdb_votes_majority": "",
"scdb_votes_minority": "",
"source": "",
"procedural_history": "",
"attorneys": "",
"nature_of_suit": "",
"posture": "",
"syllabus": "",
"headnotes": "",
"summary": "",
"disposition": "",
"history": "",
"other_dates": "",
"cross_reference": "",
"correction": "",
"citations": [
{}
]
"citation_count": "",
"precedential_status": "",
"date_blocked": "",
"blocked": "",
"filepath_json_harvard": "",
"arguments": "",
"headmatter": "",
}
"webhook":{
"version":1,
"event_type":3,
"date_created":"2024-01-06T14:21:40.855097-07:00",
"deprecation_date":null
}
}
{
"payload":{
"docket_id":1,
"panel":[
],
"non_participating_judges":[
],
"judges":"J Gallegar",
"date_filed":"",
"date_filed_is_approximate":true,
"slug":"",
"case_name_short":"",
"case_name":"",
"case_name_full":"",
"scdb_id":"",
"scdb_decision_direction":"",
"scdb_votes_majority":"",
"source":"",
"procedural_history":"",
"attorneys":"",
"nature_of_suit":"",
"posture":"",
"syllabus":"",
"headnotes":"",
"summary":"",
"disposition":"",
"history":"",
"other_dates":"",
"cross_reference":"",
"correction":"",
"citations":[
],
"citation_count":"",
"precedential_status":"",
"date_blocked":"",
"blocked":"",
"filepath_json_harvard":"",
"arguments":"",
"headmatter":""
},
"webhook":{
"version":1,
"event_type":8,
"date_created":"2024-01-06T14:21:40.855097-07:00",
"deprecation_date":null
}
}

The webhook dummy templates need some syntax corrections to work and also they could benefit from an example with more fields filled out so they have more value from a testing perspective. However, I believe these updates should wait until we have figured out the final JSON payload for every webhook event. At that point, I can help in updating them while also adding the test suite for the PR.

@albertisfu
Copy link
Contributor

Here you can find more details about how the field tracker works, in case we decide to go with this approach:

If you look at the OpinionCluster or Opinion models, you'll notice something like:

es_o_field_tracker = FieldTracker(
    fields=[
        "docket_id",
        "case_name",
        "case_name_short",
        "case_name_full",
        "date_filed",
        "judges",
        "attorneys",
        "nature_of_suit",
        "attorneys",
        "precedential_status",
        "procedural_history",
        "posture",
        "syllabus",
        "scdb_id",
        "citation_count",
        "slug",
        "source",
    ]
)

These fields are tracked for Elasticsearch indexing purposes, meaning we only update the document in ES when one or more of these fields change, ignoring updates that don't affect a field that's indexed.

So, we'll need to add a new field_tracker to Opinion and OpinionCluster that includes the fields we determine should be included in the serializer:

webhook_tracked_fields = FieldTracker(
    fields=[ fields_to_track... ]
)

And within the signal you can do something like:

def handle_opinion_cluster_created_or_updated_webhook(
    sender,
    instance: OpinionCluster,
    created: bool,
    update_fields=None,
    **kwargs
):
    """
    Send a webhook to the webapp when an opinion is created.
    """
    if created:
        return send_opinion_cluster_created_webhook(instance)
        
     # Send webhook for updates.  
     changed_fields = instance.webhook_tracked_fields.changed()
     if changed_fields:
         fields_that_changed= list(changed_fields.keys())
         send_opinion_cluster_updated_webhook(instance.id, fields_that_changed)        

And within send_opinion_cluster_updated_webhook doing something like

serializer = OpinionClusterSerializerOffline(instance)
# Only include fields that changed.
for field_name in list(serializer.fields.keys()):
    if field_name not in changed_fields:
        serializer.fields.pop(field_name)

post_content = {
            "webhook": generate_webhook_key_content(webhook),
            "payload": serializer.data,
        }

@mlissner
Copy link
Member

mlissner commented Mar 1, 2024

Sorry to be a bit slow here. Thank you both for all the careful work here. This is going to be a major new feature.

A few replies:

  1. Re batching, yeah, I we do need it, to...

    ...avoid triggering an equivalent number of tasks/webhooks.

    This is also true:

    On the other hand, batching could slow down the process of sending webhooks if our ingestion rate is not too high during a given period.

    I think that's OK though. What'd suggest is that we send batches of 25 items at a time, and that batches go out after 30s, say, if no more content is being added to them.

    I am a bit worried about celery tasks piling up though, and about our ability to cancel celery tasks if an endpoint is down. @albertisfu, do you recall how that would work? If an endpoint goes down and we have 10,000 items in the celery queue for it, do we handle that properly?

    Finally, however we do this, it should only serialize each item once, even if there are multiple webhook recipients subscribed.

  2. Re serialization formats:

    • Should we add the same fields as in the API?

    Generally, that's best, yes.

    • Should we directly include nested fields of related instances like the Docket? If so, which fields from Docket would be worthwhile to add directly?

    Yeah, it's probably best to include some fields from the docket:

    • case_name
    • date_filed
    • court_id
    • id
    • source
    • docket_number
    • date_blocked
    • blocked

    I've gone back and forth about whether including these in the object is better or if we should do a docket webhook too, but I think this is simpler and will have much less noise.

    • Jordan mentioned that it would be nice to include the Person data nested within every Person related field. To do this, we would need to create a new serializer for Person since PersonSerializer also uses HyperlinkedRelatedField for some fields. Therefore, we need to define which fields are worth including in this new serializer

    I think it's better to punt on this one if we're including the IDs for the Person objects. Doing so will allow webhook subscribers to only crawl Person data when they don't already have it, instead of having us generate it for every event.

  3. Re field trackers, yes, we should just send the changed fields to save on performance and bandwidth.

Once again, thank you guys!

@albertisfu
Copy link
Contributor

I am a bit worried about celery tasks piling up though, and about our ability to cancel celery tasks if an endpoint is down. @albertisfu, do you recall how that would work? If an endpoint goes down and we have 10,000 items in the celery queue for it, do we handle that properly?

Yeah, the webhook celery tasks are designed to fail if the webhook endpoint is down. The HTTP request has a set timeout of 3 seconds. If the endpoint fails to respond within that time, or there is a ConnectionError, the tasks will fail, the error will be logged, and the webhook will be scheduled for retry according to our retry policy: https://www.courtlistener.com/help/api/webhooks/#retries

We use a custom mechanism to retry webhook events instead of relying on celery tasks.

@mlissner
Copy link
Member

mlissner commented Mar 4, 2024

So in other words, you think that should be fine, Alberto? Sorry I forget the details here...

@albertisfu
Copy link
Contributor

Correct, I think we'll be fine. Celery tasks won't pile up if an endpoint goes down. The task will be able to handle that gracefully while finishing sending all the events to other endpoints.

jordanparker6 and others added 4 commits March 13, 2024 09:16
Co-authored-by: Alberto Islas <albertisfu@gmail.com>
Co-authored-by: Alberto Islas <albertisfu@gmail.com>
@jordanparker6
Copy link
Author

jordanparker6 commented Mar 13, 2024

@mlissner @albertisfu

For the batching do you want to use something like this:

https://celery-batches.readthedocs.io/en/latest/

It looks pretty straight forward to implement the time and size based batching discussed.

I have updated all the tasks to be ready for batching.

@mlissner
Copy link
Member

mlissner commented Mar 13, 2024

Edited 3/21 to choose postgresql as temporary storage and note how transactions will matter.


I'm a little hesitant to use that batching solution. I took a quick look at it, and I didn't like how it required a certain worker configuration nor how it relies on workers staying up to work properly (ours come and go pretty often thanks to k8s).

But somehow we need to create and send batches of events so we need some sort of temporary place to store the events and another tool to dispatch them.

For storage I previously noted we could use Redis or the DB and said that Celery was a bad choice. I've firmed up my opinion on this: We should use the DB because it has nearly unlimited storage, and it has transactions that work well. The speed penalty should be minimal vs Redis in this context. Let's use the DB for temporary storage.

For dispatching, I think you can do it this way:

  1. Use a transaction to store the item in the DB and check the number of items in the DB.

  2. If it's the first item in storage, schedule a task to send out all items batch_delay seconds later (whatever length of time we thought was best).

  3. When the task runs, have it grab a max_batch_size of items from storage and send them out.

  4. When that's complete, check how many items are remaining. If there are more, enqueue another task to do another round of max_batch_size items until it is done.

I think this would work nicely. We'd only send tasks every batch_delay seconds, except when the queue gets full, which would trigger a series of max_batch_size events to run.

The compromise is that if we get a lot of updates all at once, the first batch won't trigger until batch_delay seconds have passed, but that's pretty OK, right?

This introduces no new dependencies, is simple to do, and should work well?

@jordanparker6
Copy link
Author

Ok so I have had a stab at this. I don't have the best testing set up locally yet so I was just playing around with some mock implementations of this.

I have created a general add_webhook_event_to_queue function that is now being used by the signals that just simply writes the new webhook payload to a redis queue.

The meat and potatoes is the consume_webhook_event_batch which is a recursive task that does the following:

  • loops until the batch size or the timer has expired.
  • if the queue is empty it will delay until the timer is expired
  • has a on_finished callback so that to handle batches ready for processing
  • manages a ':status' key:pair to ensure only one loop is running per queue at a time
  • recursively triggers itself to run if a batch has been processed or to the beat of the timeout frequency we have set.

This task will keep running due to the recursion and it is being fired on start up in the ready method of the Search Apps config (is this the best spot).

One thought is whether we want a task per webhook event or just a global task that does batching with a switch statement for the webhooks.

Do you guys have a good way to test this implementation?

@mlissner
Copy link
Member

Thanks Jordan. I'm going to move to Alberto's queue so he can take a look at your work.

How are you feeling time wise? More eager to be done or more hanging in there for now?

@jordanparker6
Copy link
Author

@mlissner

haha, I am hanging in!

I am 100% keen to see this through. I am allocating a day / half-day a week to this until its done.

My time commitment has to stay about the 1 day a week as I am trying to ship a bunch of stuff with lawme in preparation for v.1.0.0 and i'm a bit understaffed.

This is important for the roadmap of lawme as the real time syncing answers a lot of "whataboutism".

@mlissner
Copy link
Member

Sounds good. Alberto should have time to take another look at it soon.

@mlissner
Copy link
Member

Sorry for the delay in review here, Jordan. Alberto reviewed your code and he and I had some conversations via email about your approach, mine, and his.

Here's the summary:

  1. Your Approach reminds me of an async event loop, but it has two problems. One is solvable with some clever timeouts, but one is inherit to your approach.

    The problem with creating an event loop task like you've done is that it will tie up one of our celery workers more or less forever. Tasks in celery are supposed to be fast, but your approach has tasks that run forever. That will prevent celery from doing other things, and burn resources.

  2. My Approach (above) got around that by relying on the first item in an empty queue to kick things off (with a delay), but Alberto pointed out that my approach puts too much trust in Celery tasks completing successfully. In my approach, if a task fails for whatever reason, the queue will never get emptied and it will fill up forever. That's bad.

  3. Alberto's Approach wins the day by using a simple daemon to do the job. To do this, we create a basic Django management command that holds the loop you envisioned. It checks the queue and sends items if the batch is big enough or a timer has run out. Simple, effective, foolproof. The only downside is it's one more daemon doing stuff, but oh well, so be it.

The other note I made (via an edit to my comment above) is that we should really do this in the DB. The downside is that it's a bit harder on the coding side (we'll need a little model to store stuff), and that it's a bit slower than Redis, but the upsides are that it's more reliable (transactions, ACID guarantees, etc), and it won't run out of space. I realized that if we have a LOT of opinions piling up faster than we can send them out, we'll be glad those aren't in redis piling up and threatening to bring down the whole system.

Do you think you can take another swing at this Jordon, with the changes above?

@jordanparker6
Copy link
Author

@mlissner sounds like solid brainstorm!

@albertisfu is a Django wizard I see.

Love the feedback! Let me have a stab at the daemon this week.

Once we are happy that the daemon is working. I can have a stab at moving the queue to the DB.

The db stuff sounds pretty straight forward.

@jordanparker6
Copy link
Author

@albertisfu I have the daemon stuff up. Still lacking a good way to test this though.

Is that along the lines of what you had in mind?

@albertisfu
Copy link
Contributor

@albertisfu I have the daemon stuff up. Still lacking a good way to test this though.
Is that along the lines of what you had in mind?

@jordanparker6 Great, thanks! I'll be reviewing the new approach.

And yes, the daemon tests are part of the tests I have in mind to add.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: 📋 Backlog
Development

Successfully merging this pull request may close these issues.

None yet

5 participants