Skip to content

The CourtListener Development Database

Mike Lissner edited this page Oct 26, 2023 · 4 revisions

For FLP staff and the occasional volunteer, we now host a full-ish copy of the FLP database in AWS. This can be really useful when:

  • The issue you're debugging can't be reproduced without a bunch of data
  • You need to test performance against a stupidly large collection of data
  • You want to have approximate counts of live data

The way this works is that we use replication to copy the majority of the CL dataset to a special DB in AWS, then we sever the connection, making it a snapshot of the database.

The database was last synced on 2023-10-24T21:40:00Z

Granting Access

Access to the database requires three things:

  1. You must be FLP staff or be approved by an FLP director.
  2. Your IP address has to be placed on the allowlist.
  3. You need the host, username, and password.

Allowing an IP

FLP staff has access to the AWS security group that controls permission to this database. They can go to this link and then add an IP address:

https://us-west-2.console.aws.amazon.com/vpcconsole/home?region=us-west-2#SecurityGroup:groupId=sg-05747cc20e595120f

To add an IP address you must add it to the "Inbound" rules. In the Actions drop down, select "Edit Inbound Rules", and you'll see a page like this:

image

Click the button towards the bottom to add a new rule, set the type to Postgres, and in the Source column, select "My IP". Use /32 as your netmask if it's not on there already. Name the rule like the others: "Mike at Work" or "Mike at Bali" or whatever.

Save.

Test that this worked by doing:

psql -h dev.courtlistener.com

You don't need the password to test that. If that connects and asks for a password, good news, you did it. If that doesn't connect, read on about proxies.

Proxies

It appears that a lot of browsers use proxies. Safari seems to do this by default. This means that the "My IP" trick above won't work and you won't be able to connect to the DB. Even asking Google your IP won't work. Your browser has a different IP than your postgres client and the rest of your laptop.

Get around this by getting your IP on the command line:

curl ipquail.com

Take that value, give it a netmask of /32, and put that into the inbound rules for the security group. Save and test as above.

Exchanging passwords

This database does not have user tables, but accessing it does get you into the FLP infrastructure. We want to keep this really tight and take this seriously.

To exchange passwords, there are two and only two approved apps: Whatsapp and Signal.

  • As a sender, set the conversation to have expiring messages of one week or less, then send the information.
  • As a recipient, copy the information to a password manager. Bitwarden and 1Password are preferred.

Accessing the database

Once you have your IP allowed and know the password, the rest is easy:

psql -h dev.courtlistener.com --dbname courtlistener --user django

That is:

  • Host: dev.courtlistener.com
  • Database name: courtlistener
  • User: django

Getting a fresh sync of the data

We need to figure out how to do this. Probably we'll create a new database in the RDS instance (CREATE DATABASE XYZ), and nuke the old one.

Q&A

Are there rules of how I use this?

Yes, a few:

  • Don't go building things with it outside of what you said you'd do. If you're a dev, this is for developing on behalf of FLP. If you're a volunteer, this is for completing some discrete project.
  • Try not to mess up the data too much. You're not the only one using it, so don't delete a table or something, unless you are prepared to confess that you did so, and spend time fixing it. Sure, tweak here and there, but try to keep it largely intact if you can.
  • You can create users and user-stuff, like alerts and favorites. All the regular tables are in place, they're just empty.

Why do you sever the replication connection?

Severing the connection to the upstream database allows the data in the dev DB to be different than the data in the prod DB. If we didn't do this and somebody created an item in the dev DB (because they're testing something, say), that could create a conflict when the same item tried to sync from upstream.

Making things more annoying, it's not like this would break immediately. It'd only break hours, days, or weeks later when the data in prod changed for some reason. Somehow that always seems to happen on the weekend, so we do a one-time sync, and we cut the connection, making it a snapshot, not a replica.

What's not in the database snapshot?

Two things aren't in here:

  1. As stated above, we stop syncing as of whenever it is launched. So stuff that's newer than that or changes after that point are missing.
  2. No user data (though the tables are in place).