-
-
Notifications
You must be signed in to change notification settings - Fork 135
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Bulk Data Loading #4024
Comments
Hi @jasonzoladz. I haven't seen that issue with Postgresql indexes before, but we did recently create some new and fancy indexes over in #3543. That was back in December though, so I sort of hope that wouldn't be affecting you. Did you try Googling this?
There might be, I really don't remember, but the reason we stopped generating that was because it was taking too much horsepower to do on a regular basis. Sorry!
Hm, the script ran, but it looks like it must have crashed. That's a bummer. I can try to run it again manually.
I think we would, yes! We have a plan to make a vector search engine, and I imagine such embeddings would be an important part of that: #3398
That's great! Sounds like we should talk. Want to grab a spot on my calendar and we can go over how that works? https://calendly.com/flp-mike/ |
The manual dump of the data is underway. I'll try to keep an eye on it. Sorry it didn't work automatically. I'm not sure what's up with that, but I suspect when I run it manually I'll find out. |
@mlissner, thanks so much for the prompt reply.
I did (and spent six hours trying to get it to work). However, I'm no postgres wizard.
If there's a public S3 bucket with an old set of JSON files, I'd love to make a replica.
My plan is to (at least initially) store the embeddings in a new table of a local copy of the courtlistener database; see pgvector. Once I do so, I will let you know. (I believe I have selected the embeddings with the best cost-to-retrieval-quality ratio -- gte-large-en-v1.5 -- but I want to evaluate a few more. This article discusses how costs vary substantially based on the choice of embedding model and compute, especially at the scale of CL.) I see that CL uses elastic search and I know they have an offering.
I won't take up your face time until I'm much closer to something real. I hope (and plan) that will be sooner than later.
Please let me know how this goes. |
So, one fix is to just drop the index. That should get you moving forward again, and if you need it again, you can create it once the data is ingested. One thing though: The first link I read about this issue suggests that this might be a disk corruption issue. If that's right, you might have bigger problems (but the internet is often full of bad advice!).
The manual dump is complete! |
I am attempting to load bulk data files into postgres using the
load-bulk-data-2024-03-12.sh
script. (Note: I've ensured that the corresponding schema and csv files for 2024-03-11 are downloaded and available to postgres.)The
load-bulk-data-2024-03-12.sh
script runs fine until it gets todockets-2024-03-11.csv
-- the first large csv file. During the copy ofdockets-2024-03-11.csv
I get errors like:Loading dockets-2024-03-11.csv to database ERROR: index "search_dock_court_i_a043ae_idx" contains corrupted page at block 40449 HINT: Please REINDEX it. CONTEXT: COPY search_docket, line 8109127
or
ERROR: index "search_docket_c69e55a4" contains corrupted page at block 8303 HINT: Please REINDEX it. CONTEXT: COPY search_docket, line 2110452
Is this a known issue? If so, will this be fixed and when will the next bulk data generation be performed? If not, any ideas on how I might fix this?
In the meantime, is there an (outdated) archive of a previous JSON dump (#1983) sitting in an S3 bucket somewhere? (Frankly, that might be the most useful thing to me because I'm going to need to pull the opinions from the database and process them anyway.)
(Aside: I noticed that the generation schedule contemplates that "bulk data files are regenerated on the last day of every month" yet the bulk data does not reflect that frequency.)
Use case:
I’m a lawyer and I am exploring creating text embeddings for the opinions to improve retrieval of relevant case law -- combining Approximate Nearest Neighbor search with pre-or-post-filtering -- for use as part of a RAG pipeline. (I've seen some good initial results using a subset of the Harvard CAP static files.)
Once those embeddings are generated, I am happy to contribute the embeddings (gte-large-en-v1.5) to Free Law if you might find them useful. Further, if my project ever becomes commercialized, I'd love to explore the possibility of contracting with Court Listener to obtain the daily update stream.
The text was updated successfully, but these errors were encountered: