-
-
Notifications
You must be signed in to change notification settings - Fork 135
Handling an Outage if Things Are Down
We try to keep CourtListener up, but it's not always easy due to the huge spikes in traffic we can sometimes get. Follows is the general process for fixing the site when it's down.
The general idea is to (1), figure out which component of the site is broken and then to (2), fix it.
This part is pretty hard at the moment. First steps:
-
Try the health check endpoint. It's easy and fast and often revealing.
-
Look at Sentry and see what kinds of errors it's reporting. It'll often give you a good lead about where to begin.
Usually, the problem is of:
-
Solr
Solr is run from a large EC2 instance that's outside of k8s. You can check its status by adding your IP to the Security Group, SSH'ing into it…
ssh -i .ssh/solr.pem ubuntu@ec2-35-91-65-155.us-west-2.compute.amazonaws.com
…and then doing the standard checks listed below. To SSH into the server, you'll need it's SSH key.
-
Postgresql
Postgresql is run in AWS RDS. If it's failing, it's almost always because somebody is doing a nasty API query of some kind (see section below on sorting that out). You can check its health via its logs (see below) and the RDS monitoring dashboard in the AWS console.
-
Django
Django is run as a horizontally scaled k8s deployment. Usually if it fails it's because:
- It needs more memory or servers (check this in k9s).
- A new pod deployment is broken. This will turn up in Sentry or if you look at the cl-python k8s
deployment
in k8s, it'll show lots of crashing pods.
-
Redis
Redis is run via AWS Elasticache. Like RDS, it has a monitoring dashboard and logs in the AWS console.
-
AWS SES
This is what sends our emails. It has a quota. If the quota runs out, we will get lots of errors from Sentry. To check the status of SES, pull it up in the console, where it will tell you everything you need to know.
Solr logs are shipped to AWS Cloudwatch. You can find and filter them there. Note that the caller
GET parameter in the logs can be a useful clue to figure out what part of CourtListener made a specific call to Solr.
If you're in the server, you can also view the logs with:
sudo docker logs solr -f --since 1m
Elastic is run as a k8s cluster. It uses the fluent-bit to send its logs to AWS CloudWatch. To search the logs there, use Log Insights, with a query like:
filter log_processed.log.level != "WARN"
and log not like /some-regex/
| fields @timestamp, @message
export ES_USER=elastic
export ES_PASS=
export ES_ENDPOINT=localhost:9200
Search stats:
curl -X GET "https://${ES_ENDPOINT}/_stats/search" -k -u ${ES_USER}:${ES_PASS}
Target a single index:
curl -X GET "https://${ES_ENDPOINT}/recap_vectors/_stats/search" -k -u ${ES_USER}:${ES_PASS}
Indexing stats:
curl -X GET "https://${ES_ENDPOINT}/recap_vectors/_stats/indexing" -k -u ${ES_USER}:${ES_PASS}
Retrieve all the tasks related to search that are queued:
curl -X GET "https://${ES_ENDPOINT}/_tasks?detailed=true&actions=*search" -k -u ${ES_USER}:${ES_PASS}
Web traffic could be logged at three levels:
- At the CDN, CloudFront
- At the Application Load Balancer (ALB)
- At the docker pods or deployment, in k8s.
What we do is have CloudFront ship logs to S3. You can go into S3 and look at the logs directly, but it's horrible: They're split up, zipped, and it's a mega pain.
The better way to look at our web logs is to use AWS Athena, which has some queries saved in it for looking at the logs. Athena is sort of magic. It gives you a SQL API for files stored in S3, so you can literally query S3.
Go here:
That will get you a feel for the traffic. If you want to map IP addresses to users, use the recipe here.
These logs don't tend to say much, but you can get them in k9s by selecting the deployment
object type (: deployment), pressing l for logs, and then 0 to see the latest logs.
Celery logs can be viewed from k9s by selecting the logs for the celery-prefork
deployment. To do that:
- Open k9s
- Go to deployments (
:deployment <enter>
) - Use arrows to select "celery"
- Press l for the logs.
Redis logs are available in the Elasticache console.
Postgresql logs are available in the RDS console.
This doesn't seem to have logs.
Possible solutions include:
-
Restart solr:
sudo docker container ls sudo docker restart 11c681511841
-
Scale up the Solr instance via EC2.
-
Add more swap space to handle memory exhaustion.
-
Figure out what queries are running it down and block offending IPs or users (see below).
-
Scale the k8s cluster by adding more nodes to the group. This should fix memory or CPU exhaustion.
-
Revert to a known-good deployment by re-running the last good Github deployment.
Often, the problem is that Celery has a really long queue that's blowing up Redis. This is a bummer, and you can fix it by deleting the queue (del celery
or del etl_tasks
, say), but that indiscriminately blows up everything. A better way is to delete the bad tasks.
You can do that with the script below, which lets you fiddle with the task name and parameters:
import json
from django.conf import settings
from cl.lib.redis_utils import make_redis_interface
r = make_redis_interface("CELERY")
queue_name = "etl_tasks"
tasks_to_remove = ["cl.search.tasks.update_children_docs_by_query", "cl.search.tasks.es_save_document"]
related_instance = "ESRECAPDocument"
# Count all tasks in the queue
total_tasks = r.llen(queue_name)
chunk_size = 500
removed_tasks = 0
checked_tasks = 0
# Calculate number of chunks based in total_tasks and chunk_size
# Add an extra chunk to process remaining tasks.
chunks = (total_tasks // chunk_size) + 1
# Remove tasks from the queue in chunks.
for chunk in range(chunks):
# Adjust the start index based on removed tasks.
start_index = (chunk * chunk_size) - removed_tasks
end_index = start_index + chunk_size - 1
tasks = r.lrange(queue_name, start_index, end_index)
for task_data in tasks:
task_json = json.loads(task_data)
if task_json['headers']['task'] in tasks_to_remove and related_instance in task_json['headers']['argsrepr']:
# Remove the task from the queue.
r.lrem(queue_name, 1, task_data)
removed_tasks+=1
checked_tasks += 1
print(f"Checked {checked_tasks} and removed {removed_tasks} tasks so far.")
print(f"Successfully removed {removed_tasks} tasks.")
Celery also keeps a couple other keys in redis for tasks that are scheduled, notably, the unacked
queue. You can read a lot more about this here:
This script can be used to remove things from the unacked
queue. We've used it once before, but by the time we did, the queue was clear, so count this script as untested:
import json
from django.conf import settings
from cl.lib.redis_utils import make_redis_interface
r = make_redis_interface("CELERY")
tasks_to_remove = ["cl.search.tasks.es_save_document"]
removed_tasks = 0
checked_tasks = 0
cursor = 0
while True:
# Iterate over unacked_index
cursor, items = r.zscan("unacked_index", cursor=cursor)
for unack_key, score in items:
task_value = r.hget("unacked", unack_key)
task_json = json.loads(task_value)
if task_json[0]['headers']['task'] in tasks_to_remove:
# Remove the task from "unacked_index" and the "unacked" queue.
r.hdel("unacked", unack_key)
r.zrem("unacked_index", unack_key)
removed_tasks += 1
checked_tasks += 1
print(f"Checked {checked_tasks} and removed {removed_tasks} unacked tasks so far.")
if cursor == 0:
break
print(f"Successfully removed {removed_tasks} unacked tasks.")
Usually, Redis's problem is that it runs out of memory. You can see this by loading its memory chart in AWS, and you can analyze it with a few commands:
redis-cli -h $REDIS_HOST --bigkeys
This will show you some meaningless progress info followed by a summary. Things to look for:
- Are there any particularly huge keys in the summary? Once, we accidentally cached the sitemaps to Redis instead of the DB, and it was bad.
- How big is the
celery
list? It should only have a few items, but sometimes it has a LOT. If it has a lot, Celery has fallen behind and needs more resources either directly (scale it up) or indirectly (the DB can't keep up with it, say).
Sometimes, like in an emergency, you can just delete the large keys with: DEL iauploads
. That worked in #1460.
Another useful command is:
redis-cli -h $REDIS_HOST INFO
That'll give you an overview of which Redis DB's are using the memory — A clue!
Finally, this is a handy way to delete a lot of keys:
redis-cli -h $REDIS_HOST -n 1 --scan --pattern ':1:mlt-cluster*' | xargs redis-cli -n 1 del
-
Figure out the cause of the problem.
-
Scale the disk so it's faster.
-
Scale the instance so it's more powerful.
I'm not good at this, but one place to start is with the whois
command:
whois xxx.xxx.xxx.xxx
Add it to the blocklist
IP Set in the Web Application Firewall.
You can get the username associated with an IP address that's making API requests:
from cl.lib.redis_utils import make_redis_interface
def get_user_by_ip(r, date_str, ip_address):
# Get the key for the specific day
key = f"api:v3.d:{date_str}.ip_map"
# Get the user_id associated with the IP address
user_id = r.hget(key, ip_address)
return user_id
r = make_redis_interface("STATS")
get_user_by_ip(r, '2023-06-01', 'x.x.x.x')
I think this has a bug that it only returns one username even if there are more than one on an IP.
Two approaches:
-
If they're abusing the RECAP APIs, simply yank their permission for those APIs.
-
If they're abusing a different API, add their name to the throttle override section of the settings, push that to
main
, and deploy it via CI.
We use a CDN but without caching. Put the viral page behind the CDN with a custom behavior just to match it, which has caching. If the viral load is very high, you may also have a halo effect of people loading the homepage. If so, but it behind the CDN in the same way. Use the short cache policy that factors in the sessionid
cookie value.
-
Check general health of the server with
htop
. Are the CPUs pegged? Is there a lot of IO wait? -
Check memory with
free -h
. -
Check IO usage with
sudo iotop
-
Check logs.