Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Opinions removed from the database were not removed from search engines #3967

Open
albertisfu opened this issue Apr 12, 2024 · 5 comments
Open

Comments

@albertisfu
Copy link
Contributor

After completing #3897 and doing a check in the ES Opinions Search, I discovered something unusual: some OpinionClusters appear in the results, but clicking on the OpinionCluster trigger a 404 error. I confirmed that these have been removed from the database.

Steps to reproduce:

  • Log-in into Courtlistener.
  • Do a clean query: https://www.courtlistener.com/?q=
  • Most of the results on the first 1 - 7 pages are from OpinionClusters that have been removed from the database.

Here some examples:

  • These clusters are also still indexed in the Solr version of the Opinions Search. If you access the Opinion Search as an anonymous user and search for these cluster_ids, they are indexed as well.

McGuire v. Third Avenue Railroad (N.Y. App. Div. 1896)

https://www.courtlistener.com/?q=cluster_id%3A5348161&type=o&order_by=score desc&stat_Precedential=on

People v. Jordan (N.Y. App. Div. 2010)

https://www.courtlistener.com/?q=cluster_id%3A5947072&type=o&order_by=score desc&stat_Precedential=on

Goldin v. Kelly (N.Y. App. Div. 2010)

https://www.courtlistener.com/?q=cluster_id%3A5947192&type=o&order_by=score desc&stat_Precedential=on

Harrison v. Bezio (N.Y. App. Div. 2010)

https://www.courtlistener.com/?q=cluster_id%3A5948279&type=o&order_by=score desc&stat_Precedential=on

In re Clor (N.Y. App. Div. 2012)

https://www.courtlistener.com/?q=cluster_id%3A6012898&type=o&order_by=score desc&stat_Precedential=on

People v. Johns (N.Y. App. Div. 2012)

https://www.courtlistener.com/?q=cluster_id%3A6013129&type=o&order_by=score desc&stat_Precedential=on

People v. McCrae (N.Y. App. Div. 2011)

https://www.courtlistener.com/?q=cluster_id%3A5970993&type=o&order_by=score desc&stat_Precedential=on

People v. Russ (N.Y. App. Div. 2012)

https://www.courtlistener.com/?q=cluster_id%3A5990821&type=o&order_by=score desc&stat_Precedential=on

People v. Badman (N.Y. App. Div. 2012)

https://www.courtlistener.com/?q=cluster_id%3A5991871&type=o&order_by=score desc&stat_Precedential=on

In re Foley (N.Y. App. Div. 1998)

https://www.courtlistener.com/?q=cluster_id%3A6161808&type=o&order_by=score desc&stat_Precedential=on

People v. Jones (N.Y. App. Div. 1998)

https://www.courtlistener.com/?q=cluster_id%3A6163163&type=o&order_by=score desc&stat_Precedential=on

People v. Healey (N.Y. App. Div. 2000)

https://www.courtlistener.com/?q=cluster_id%3A6181359&type=o&order_by=score desc&stat_Precedential=on

In re Merante (N.Y. App. Div. 2015)

https://www.courtlistener.com/?q=cluster_id%3A6184542&type=o&order_by=score desc&stat_Precedential=on

  • I checked these clusters in the dev DB, and they're still there.
  • It seems they were removed after the January 15, 2024 date when the initial Opinion index was completed.
  • The method used to remove these clusters could be why they don't trigger signals or a deletion from Solr.

@mlissner or @flooie Would you know what the process was for removing these clusters from the database? This way, we can identify all the IDs to remove from the Opinion Index and also consider the deletion method used so it can trigger an automatic deletion next time.

@mlissner
Copy link
Member

Oof! Why do these come up first in the search results? Any idea?

I don't remember why we removed content around Jan. 15th, but maybe Bill does, or maybe we can check our Slack/Github/Email logs around then?

Is it possible that a queryset.objects.delete() wouldn't trigger signals?

@albertisfu
Copy link
Contributor Author

Oof! Why do these come up first in the search results? Any idea?

Well, that query only filters by the default status Published, so the results don't have scores. I believe it could be more about the order in which they're matched in segments/shards. And it can be a weird coincidence that they're shown first in the results, or there are many deleted clusters spread randomly throughout the index.

I don't remember why we removed content around Jan. 15th, but maybe Bill does, or maybe we can check our Slack/Github/Email logs around then?

Yeah, it could have been anytime from January 15th until now. I reviewed the code to look for methods that remove clusters from the DB, but I didn't find anything. I'm wondering if that could have been done directly at the DB level?

Is it possible that a queryset.objects.delete() wouldn't trigger signals?

I just confirmed that using a queryset like:
OpinionCluster.objects.filter(pk__in=[20,19]).delete()

It does trigger signals correctly.

Just like doing, that also trigger signals:

opinion = OpinionCluster.objects.get(pk=18)
opinion.delete()

@mlissner
Copy link
Member

could have been done directly at the DB level?

It's...possible, but extremely unlikely. I almost never delete with SQL, because it freaks me out. Too much power and not enough language support.

It sounds like we won't know the cause. Is there a way to fix this? I guess we'll have to check all of the millions of items in the index to see if they're in the DB?

@albertisfu
Copy link
Contributor Author

It sounds like we won't know the cause. Is there a way to fix this? I guess we'll have to check all of the millions of items in the index to see if they're in the DB?

Yeah, that's the way to fix it. We can do it in batches of ~1000 items or so to avoid using too many requests. Then, in Django, filter those IDs also in batches and check which were not found and remove them from the index.

@mlissner
Copy link
Member

Bleh. That sounds unpleasant, but we better do it. Let's set this as down the road though, because I want to get to alerts as soon as possible and this isn't particularly harmful to users.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: Later / Optimizations
Development

No branches or pull requests

2 participants