-
-
Notifications
You must be signed in to change notification settings - Fork 135
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Management command to import citations from a csv file #3915
Conversation
to be used with westlaw, lexis or any other dataset, it is only necessary to follow the csv format
to be used with westlaw, lexis or any other dataset, it is only necessary to follow the csv format
for more information, see https://pre-commit.ci
to be used with westlaw, lexis or any other dataset, it is only necessary to follow the csv format
Westlaw 7206737 rows with parallel citation From the 7206737 rows: Citations to add to CL: 2,133,707 (matched using one citation of the citations, same filed date, same court and case name matched >= 90%) Of the 2,133,707 citations to be added, 1,020,864 are from Westlaw and the remaining 1112,843 are from other reporters. At the moment the 1063191 rows without parallel citation were ignored since only the case data would have to be used to try to find a match (case name, filed date, docket number and court) and later verify if the citation we have is in the system or not, in this case the precision has to be higher to ensure a correct match. Lexis In total there are 14,194,271 rows in the dataset, the number of citations in each row may vary. 288,117 rows doesn’t have a case name 4,467,182 rows matched using one citation, same filed date, same court id and case name matched >= 90% Update 05/02/2024: ~440K new citations to add using filed date, court id and matched case name >= 90% (this process is more exhaustive because we can get more than 100 results for possible cases for a single row from the dataset using only the court id and filed date, we could get more citations but this will take some time, could also apply this for rows that only have one citation) ~8M new citations in total from dataset |
for more information, see https://pre-commit.ci
@grossir please could you take a look at this PR |
@grossir i updated the code to use skiprows and nrows to set start and end row and set rows limit to process. I removed the header in the csv and updated the sample file to test the command |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tested this and it works fine, I think it is ready to merge
I just left some minor comments about the arguments and about a pylint complaint
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Feel free to merge this PR!
Nice one! For the record, a zillion people are asking for this right now. Very nice milestone, Kevin! |
About this, @blancoramiro is in charge of running it, right? so that I can send the files to him and we can execute it when you order it |
Yeah. I want to get Bill's eyes on this too before we run it, since it could cause so much damage if there are any wrong assumptions, but do you have instructions for running it that Ramiro will need? |
As I mentioned in previous meetings, I consider that it is easier to match the citations with the datasets that we have from the local environment and only generate a csv to load the new citations in the clusters. This should be relatively fast and reuses existing code.
This command will be responsible for loading the citations that come in the csv with the following format (no header row, column 1: cluster id, column 2: citation to add):
Here is a sample file to test the command:
updated_sample.csv
We need to place the file in cl/assets/media/sample.csv, clone the clusters and then run the command.
How to run the command using the csv file: