Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Management command to import citations from a csv file #3915

Merged
merged 17 commits into from
May 15, 2024

Conversation

quevon24
Copy link
Member

@quevon24 quevon24 commented Mar 26, 2024

As I mentioned in previous meetings, I consider that it is easier to match the citations with the datasets that we have from the local environment and only generate a csv to load the new citations in the clusters. This should be relatively fast and reuses existing code.

This command will be responsible for loading the citations that come in the csv with the following format (no header row, column 1: cluster id, column 2: citation to add):

"2155423", "2003 WL 22508842"
"7903720","520 A.2d 234"
"7903715","520 A.2d 233"

Here is a sample file to test the command:
updated_sample.csv

We need to place the file in cl/assets/media/sample.csv, clone the clusters and then run the command.

How to run the command using the csv file:

docker exec -it cl-django python /opt/courtlistener/manage.py clone_from_cl --type search.OpinionCluster --id 1904175 7903720 7903715 7903719 7903933
docker exec -it cl-django python /opt/courtlistener/manage.py import_citations_csv --csv /opt/courtlistener/cl/assets/media/updated_sample.csv

to be used with westlaw, lexis or any other dataset, it is only necessary to follow the csv format
@quevon24 quevon24 requested a review from flooie March 26, 2024 17:14
quevon24 and others added 4 commits March 26, 2024 11:59
to be used with westlaw, lexis or any other dataset, it is only necessary to follow the csv format
to be used with westlaw, lexis or any other dataset, it is only necessary to follow the csv format
This was referenced Mar 26, 2024
@quevon24
Copy link
Member Author

quevon24 commented May 2, 2024

@mlissner

Westlaw

7206737 rows with parallel citation
1063191 rows without parallel citation

From the 7206737 rows:

Citations to add to CL: 2,133,707 (matched using one citation of the citations, same filed date, same court and case name matched >= 90%)
Both citations exist: 1,643,083 rows
Can’t match using any of the citations: 981,403 rows
Possible matches: 2,448,544 rows (there is not enough precision in the match using the dataset data to ensure at least 90% similarity, this number includes incorrect matches)

Of the 2,133,707 citations to be added, 1,020,864 are from Westlaw and the remaining 1112,843 are from other reporters.

At the moment the 1063191 rows without parallel citation were ignored since only the case data would have to be used to try to find a match (case name, filed date, docket number and court) and later verify if the citation we have is in the system or not, in this case the precision has to be higher to ensure a correct match.

Lexis

In total there are 14,194,271 rows in the dataset, the number of citations in each row may vary.

288,117 rows doesn’t have a case name
167,463 rows doesn’t have citations
10,918,806 rows are valid (with more than one citation, we need at least one citation to try to match it to a cluster and add the rest if they are not already there)
~25M citations in the 10,918,806 rows, if we have at least two citations in each row and use one to match the cluster and find the match, that leaves us with approximately 12.5M citations available to add

4,467,182 rows matched using one citation, same filed date, same court id and case name matched >= 90%
6,608,508 citations to add to CL from ~4.4M rows
3,367,005 rows that we couldn’t match
3,084,619 rows with possible matches (not enough precision in the match or bad matches)

Update 05/02/2024: ~440K new citations to add using filed date, court id and matched case name >= 90% (this process is more exhaustive because we can get more than 100 results for possible cases for a single row from the dataset using only the court id and filed date, we could get more citations but this will take some time, could also apply this for rows that only have one citation)

~8M new citations in total from dataset

@quevon24
Copy link
Member Author

quevon24 commented May 8, 2024

@grossir please could you take a look at this PR

@quevon24
Copy link
Member Author

@grossir i updated the code to use skiprows and nrows to set start and end row and set rows limit to process.

I removed the header in the csv and updated the sample file to test the command

@quevon24 quevon24 requested a review from grossir May 13, 2024 16:52
Copy link
Contributor

@grossir grossir left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tested this and it works fine, I think it is ready to merge

I just left some minor comments about the arguments and about a pylint complaint

@quevon24 quevon24 requested a review from grossir May 15, 2024 17:44
Copy link
Contributor

@grossir grossir left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Feel free to merge this PR!

@quevon24 quevon24 merged commit 0a33b82 into main May 15, 2024
13 checks passed
@quevon24 quevon24 deleted the import-citations-csv branch May 15, 2024 18:34
@mlissner
Copy link
Member

Nice one! For the record, a zillion people are asking for this right now. Very nice milestone, Kevin!

@quevon24
Copy link
Member Author

Member

About this, @blancoramiro is in charge of running it, right? so that I can send the files to him and we can execute it when you order it

@mlissner
Copy link
Member

Yeah. I want to get Bill's eyes on this too before we run it, since it could cause so much damage if there are any wrong assumptions, but do you have instructions for running it that Ramiro will need?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Status: Done
Development

Successfully merging this pull request may close these issues.

None yet

3 participants