Management command to import citations from a csv file #3915

quevon24 · 2024-03-26T16:58:56Z

As I mentioned in previous meetings, I consider that it is easier to match the citations with the datasets that we have from the local environment and only generate a csv to load the new citations in the clusters. This should be relatively fast and reuses existing code.

This command will be responsible for loading the citations that come in the csv with the following format (no header row, column 1: cluster id, column 2: citation to add):

"2155423", "2003 WL 22508842"
"7903720","520 A.2d 234"
"7903715","520 A.2d 233"

Here is a sample file to test the command:
updated_sample.csv

We need to place the file in cl/assets/media/sample.csv, clone the clusters and then run the command.

How to run the command using the csv file:

docker exec -it cl-django python /opt/courtlistener/manage.py clone_from_cl --type search.OpinionCluster --id 1904175 7903720 7903715 7903719 7903933
docker exec -it cl-django python /opt/courtlistener/manage.py import_citations_csv --csv /opt/courtlistener/cl/assets/media/updated_sample.csv

to be used with westlaw, lexis or any other dataset, it is only necessary to follow the csv format

for more information, see https://pre-commit.ci

to be used with westlaw, lexis or any other dataset, it is only necessary to follow the csv format

…t-citations-csv

quevon24 · 2024-05-02T16:30:20Z

@mlissner

Westlaw

7206737 rows with parallel citation
1063191 rows without parallel citation

From the 7206737 rows:

Citations to add to CL: 2,133,707 (matched using one citation of the citations, same filed date, same court and case name matched >= 90%)
Both citations exist: 1,643,083 rows
Can’t match using any of the citations: 981,403 rows
Possible matches: 2,448,544 rows (there is not enough precision in the match using the dataset data to ensure at least 90% similarity, this number includes incorrect matches)

Of the 2,133,707 citations to be added, 1,020,864 are from Westlaw and the remaining 1112,843 are from other reporters.

At the moment the 1063191 rows without parallel citation were ignored since only the case data would have to be used to try to find a match (case name, filed date, docket number and court) and later verify if the citation we have is in the system or not, in this case the precision has to be higher to ensure a correct match.

Lexis

In total there are 14,194,271 rows in the dataset, the number of citations in each row may vary.

288,117 rows doesn’t have a case name
167,463 rows doesn’t have citations
10,918,806 rows are valid (with more than one citation, we need at least one citation to try to match it to a cluster and add the rest if they are not already there)
~25M citations in the 10,918,806 rows, if we have at least two citations in each row and use one to match the cluster and find the match, that leaves us with approximately 12.5M citations available to add

4,467,182 rows matched using one citation, same filed date, same court id and case name matched >= 90%
6,608,508 citations to add to CL from ~4.4M rows
3,367,005 rows that we couldn’t match
3,084,619 rows with possible matches (not enough precision in the match or bad matches)

Update 05/02/2024: ~440K new citations to add using filed date, court id and matched case name >= 90% (this process is more exhaustive because we can get more than 100 results for possible cases for a single row from the dataset using only the court id and filed date, we could get more citations but this will take some time, could also apply this for rows that only have one citation)

~8M new citations in total from dataset

… csv file

for more information, see https://pre-commit.ci

quevon24 · 2024-05-08T17:47:26Z

@grossir please could you take a look at this PR

cl/citations/management/commands/import_citations_csv.py

…o set start/end row and set limit

quevon24 · 2024-05-13T16:52:06Z

@grossir i updated the code to use skiprows and nrows to set start and end row and set rows limit to process.

I removed the header in the csv and updated the sample file to test the command

grossir

I tested this and it works fine, I think it is ready to merge

I just left some minor comments about the arguments and about a pylint complaint

cl/citations/management/commands/import_citations_csv.py

grossir

Feel free to merge this PR!

mlissner · 2024-05-15T18:46:31Z

Nice one! For the record, a zillion people are asking for this right now. Very nice milestone, Kevin!

quevon24 · 2024-05-15T18:49:31Z

Member

About this, @blancoramiro is in charge of running it, right? so that I can send the files to him and we can execute it when you order it

mlissner · 2024-05-15T20:23:49Z

Yeah. I want to get Bill's eyes on this too before we run it, since it could cause so much damage if there are any wrong assumptions, but do you have instructions for running it that Ramiro will need?

feat(citations): management command to import citations from a csv file

8ee2539

to be used with westlaw, lexis or any other dataset, it is only necessary to follow the csv format

quevon24 added data-quality Django labels Mar 26, 2024

quevon24 self-assigned this Mar 26, 2024

Merge branch 'main' into import-citations-csv

14af4fc

quevon24 requested a review from flooie March 26, 2024 17:14

quevon24 and others added 4 commits March 26, 2024 11:59

feat(citations): management command to import citations from a csv file

d837626

to be used with westlaw, lexis or any other dataset, it is only necessary to follow the csv format

[pre-commit.ci] auto fixes from pre-commit.com hooks

de90d00

for more information, see https://pre-commit.ci

feat(citations): management command to import citations from a csv file

b46e3cf

to be used with westlaw, lexis or any other dataset, it is only necessary to follow the csv format

Merge remote-tracking branch 'origin/import-citations-csv' into impor…

a403f34

…t-citations-csv

This was referenced Mar 26, 2024

Lexis citations merger #2621

Closed

Westlaw citation merger #2622

Closed

Merge branch 'main' into import-citations-csv

251cb62

quevon24 and others added 6 commits May 2, 2024 10:30

Merge branch 'main' into import-citations-csv

83c2a72

Merge branch 'main' into import-citations-csv

99871c7

feat(citations): update management command to import citations from a…

fb7b1af

… csv file

[pre-commit.ci] auto fixes from pre-commit.com hooks

5207a4c

for more information, see https://pre-commit.ci

Merge branch 'main' into import-citations-csv

0b92d58

Merge branch 'main' into import-citations-csv

cf39631

quevon24 requested a review from grossir May 8, 2024 17:47

grossir mentioned this pull request May 8, 2024

Error 403 in development environment when uploading files to S3 #4018

Open

grossir reviewed May 9, 2024

View reviewed changes

cl/citations/management/commands/import_citations_csv.py Outdated Show resolved Hide resolved

cl/citations/management/commands/import_citations_csv.py Outdated Show resolved Hide resolved

cl/citations/management/commands/import_citations_csv.py Outdated Show resolved Hide resolved

quevon24 added 3 commits May 9, 2024 10:02

Merge branch 'main' into import-citations-csv

645bac2

Merge branch 'main' into import-citations-csv

be9c1ce

feat(import_citations): tweak code to use pandas read_csv arguments t…

3077a6b

…o set start/end row and set limit

quevon24 requested a review from grossir May 13, 2024 16:52

grossir reviewed May 15, 2024

View reviewed changes

cl/citations/management/commands/import_citations_csv.py Show resolved Hide resolved

cl/citations/management/commands/import_citations_csv.py Show resolved Hide resolved

Merge branch 'main' into import-citations-csv

e88e99d

quevon24 requested a review from grossir May 15, 2024 17:44

grossir approved these changes May 15, 2024

View reviewed changes

quevon24 merged commit 0a33b82 into main May 15, 2024
13 checks passed

quevon24 deleted the import-citations-csv branch May 15, 2024 18:34

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Management command to import citations from a csv file #3915

Management command to import citations from a csv file #3915

quevon24 commented Mar 26, 2024 •

edited

quevon24 commented May 2, 2024 •

edited

quevon24 commented May 8, 2024

quevon24 commented May 13, 2024

grossir left a comment

grossir left a comment

mlissner commented May 15, 2024

quevon24 commented May 15, 2024

mlissner commented May 15, 2024

Management command to import citations from a csv file #3915

Management command to import citations from a csv file #3915

Conversation

quevon24 commented Mar 26, 2024 • edited

quevon24 commented May 2, 2024 • edited

quevon24 commented May 8, 2024

quevon24 commented May 13, 2024

grossir left a comment

Choose a reason for hiding this comment

grossir left a comment

Choose a reason for hiding this comment

mlissner commented May 15, 2024

quevon24 commented May 15, 2024

mlissner commented May 15, 2024

quevon24 commented Mar 26, 2024 •

edited

quevon24 commented May 2, 2024 •

edited