Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Columbia importer updated #2865

Open
wants to merge 109 commits into
base: main
Choose a base branch
from
Open

Columbia importer updated #2865

wants to merge 109 commits into from

Conversation

quevon24
Copy link
Member

@quevon24 quevon24 commented Jul 6, 2023

This PR contains the updated version of columbia importer, it contains many changes like:

  • Update codebase to match python 3.11 style
  • Replace deprecated functions
  • Typing added
  • Remove court regex and use courts-db to find courts (We may need to update courts-db for test to pass PR 74)
  • Change etree with Beautiful Soup to parse xml files
  • Store opinions in the correct order
  • Store opinion footnotes
  • Find duplicates using citation, docket number, case name, and opinion content
  • Add citations when a duplicate is found
  • Store syllabus
  • Pass a csv file path as an argument with absolute paths to xml files
  • If we have a possible match, we only log a message and abort the import of that file instead of adding data to the matched cluster, that way we can review the logs manually
  • Default xml directory: /opt/courtlistener/_columbia
  • Default csv location: /opt/courtlistener/_columbia/columbia_import.csv
  • Log all messages to a file so that it can be reviewed manually without needing to see the container logs

Based on some calculations, ~1.2M files have to be imported based on the data in local_path in the Opinion model, the number could be lower because some of the cases in this list of files are already imported but from a different source

Usage:

Import using a csv file with xml file path pointing to mounted directory and file path
docker-compose -f docker/courtlistener/docker-compose.yml exec cl-django python manage.py import_columbia --csv /opt/courtlistener/cl/assets/media/testfile.csv

Csv example:

filepath
michigan/supreme_court_opinions/documents/d5a484f1bad20ba0.xml

Import specifying the mounted directory where the xml files are located
docker-compose -f docker/courtlistener/docker-compose.yml exec cl-django python /opt/courtlistener/manage.py import_columbia --csv /opt/courtlistener/cl/assets/media/files_to_import.csv --xml-dir /opt/courtlistener/columbia_files

@quevon24 quevon24 requested a review from flooie July 6, 2023 18:37
@quevon24 quevon24 self-assigned this Jul 6, 2023
@quevon24 quevon24 marked this pull request as draft July 15, 2023 01:08
@quevon24
Copy link
Member Author

The changes are ready @flooie

@quevon24 quevon24 requested a review from grossir May 22, 2024 17:32
@quevon24
Copy link
Member Author

@grossir when you have time available you could take a look

this is a sample file to test the command:

random_sample_1.zip

to run the command you need to copy the zip content to cl/assets/media

docker-compose -f docker/courtlistener/docker-compose.yml exec cl-django python /opt/courtlistener/manage.py import_columbia --csv /opt/courtlistener/cl/assets/media/random_sample_1.csv --xml-dir /opt/courtlistener/cl/assets/media/random_sample_1

if you have any questions, i'll stay tuned.

Copy link
Contributor

@grossir grossir left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It ingested ~766 dockets/opinion clusters out of 1000 documents
I ran the script 4 times and got 6 triplicated ingestions, maybe you can check on your environment too. I used this query to detect them:

select 
    local_path, html_columbia,  count(*)
FROM 
    search_docket sd 
inner join search_opinioncluster oc 
    on docket_id=sd.id 
inner join search_opinion 
    on search_opinion.cluster_id=oc.id 
where 
    sd.date_created::date = '2024-05-23'::date 
group by local_path , html_columbia
having count(*) > 1;

I left some comments, mostly ideas for improvements

I haven't really tested the matching algorithms beyond most basic duplication. If you could send another sample file, but sampled from the most recent opinions (so that it is easier to search them on the web pages / scrape them), it would help testing those parts

}

# Add date data into columbia dict
columbia_data.update(find_dates_in_xml(soup))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think there are some missing "FILED_TAGS" strings in columbia_utils.py
For example, for texas/court_opinions/documents/5c8dba31985162bf.xml in the sample, the following date is parsed: [[('opinion issued', datetime.date(2006, 10, 23))]] but it is not assigned to the "date_filed" key

Looking at the raw text I think it should be considered as "date_filed":

<opinion>
<reporter_caption><center>IN RE LARREW, 05-06-01227-CV (Tex.App.-Dallas 10-23-2006)</center></reporter_caption>
<caption><center>IN RE STEPHEN JAMES LARREW, Relator.</center></caption>
<docket><center>No. 05-06-01227-CV</center></docket><court><center>Court of Appeals of Texas, Fifth District, Dallas.</center></court>
<date><center>Opinion issued October 23, 2006.</center>

I see that on FILED_TAGS "opinion issued" is included in ARGUED_TAGS, not sure about the logic for this


I collected all the documents that do have dates, but no date filed. Some are obviously not OpinionCluster.date_filed, like "case announcements and administrative actions", but some others I am not so sure

{'texas/court_opinions/documents/5c8dba31985162bf.xml': 'opinion issued',
 'arkansas/court_opinions/documents/ae218d6345f5d320.xml': 'opinion delivered',
 'texas/court_opinions/documents/5f4fe3e1c4e72785.xml': 'opinion delivered',
 'arkansas/court_opinions/documents/96742836d45c4996.xml': 'opinion delivered',
 'arkansas/court_opinions/documents/d218fb45d4055bdd.xml': 'opinion delivered',
 'maryland/court_of_appeals_opinions/documents/61198adb4b840f4d.xml': 'denied',
 'arkansas/court_opinions/documents/5793152fb3e371a3.xml': 'opinion delivered',
 'maryland/court_of_appeals_opinions/documents/cda2e7a6c083f661.xml': 'denied',
 'texas/court_opinions/documents/f7f4eb4e0bb7e71a.xml': 'opinion delivered and filed',
 'connecticut/appellate_court_opinions/documents/e3e9aa07cc97f60f.xml': 'officially released',
 'arkansas/court_opinions/documents/0b027f05aa07c2af.xml': 'opinion delivered',
 'texas/court_opinions/documents/248981bf18493e9d.xml': 'opinion delivered',
 'arkansas/court_opinions/documents/a968b68353ffe980.xml': 'opinion delivered',
 'arkansas/court_opinions/documents/ebdc8da5b2ec8fe9.xml': 'opinion delivered',
 'michigan/supreme_court_opinions/documents/63efa26d555875ea.xml': 'leave to appeal denied',
 'texas/court_opinions/documents/d4a6653c3a7c08fe.xml': 'delivered',
 'maryland/court_of_appeals_opinions/documents/69cb6658d5b0324d.xml': 'granted',
 'texas/court_opinions/documents/0904c0a3016f8421.xml': 'delivered',
 'texas/court_opinions/documents/a43217e67bd08858.xml': 'opinion issued',
 'texas/court_opinions/documents/c48edff93471911d.xml': 'opinion issued',
 'ohio/court_opinions/documents/52a07db0c124634f.xml': 'case announcements and administrative actions',
 'arkansas/court_opinions/documents/e628b04ac0dcd6f1.xml': 'opinion delivered',
 'maryland/court_of_appeals_opinions/documents/9067dae3cd6d312e.xml': 'denied',
 'arkansas/court_opinions/documents/300ebbd01ba38398.xml': 'opinion delivered',
 'maryland/court_of_appeals_opinions/documents/217ae38fdf9869af.xml': 'denied',
 'arkansas/court_opinions/documents/9e3f71089f9d11dc.xml': 'opinion delivered',
 'maryland/court_of_appeals_opinions/documents/6600fe895d37d853.xml': 'denied',
 'arkansas/court_opinions/documents/2c71f85af35b9e0f.xml': 'opinion delivered',
 'texas/court_opinions/documents/cecfdd58268e8f07.xml': 'opinion issued',
 'texas/court_opinions/documents/60a231f3da6a421f.xml': 'memorandum opinion delivered and filed',
 'arkansas/court_opinions/documents/d37e7ba255a67a6d.xml': 'opinion delivered',
 'texas/court_opinions/documents/b78271984621969a.xml': 'opinion issued',
 'maryland/court_of_appeals_opinions/documents/4b97a5803331bb29.xml': 'denied',
 'maryland/court_of_appeals_opinions/documents/9eefe6f3e03131f7.xml': 'denied',
 'massachusetts/superior_court_opinions/documents/161739ca6ca6348b.xml': 'memorandum dated',
 'maryland/court_of_appeals_opinions/documents/6ac77c8a8002a723.xml': 'denied',
 'texas/court_opinions/documents/3bcee6268dd18a72.xml': 'opinion issued',
 'texas/court_opinions/documents/4d909c6b7d4de7e4.xml': 'opinion delivered',
 'arkansas/court_opinions/documents/8ddc4fe19662d9fb.xml': 'opinion delivered',
 'arkansas/court_opinions/documents/9d7c8e94e2c2b40f.xml': 'opinion delivered',
 'texas/court_opinions/documents/2db17b19d30d85df.xml': 'opinion issued',
 'arkansas/court_opinions/documents/3dd26fb70896c79b.xml': 'opinion delivered',
 'arkansas/court_opinions/documents/ab59ead0feee789f.xml': 'opinion delivered',
 'maryland/court_of_appeals_opinions/documents/9d65b83825eed85f.xml': 'denied',
 'michigan/supreme_court_opinions/documents/1287940f26660dfa.xml': 'summary dispositions',
 'arkansas/court_opinions/documents/6ae7cc75fc0cd311.xml': 'opinion delivered',
 'connecticut/appellate_court_opinions/documents/96cb6396c50954b6.xml': 'officially released',
 'connecticut/appellate_court_opinions/documents/4df558796ec8e60c.xml': 'decision released'}

cl/corpus_importer/management/commands/import_columbia.py Outdated Show resolved Hide resolved
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Status: 🏗 In progress
Development

Successfully merging this pull request may close these issues.

None yet

3 participants