Add ability to stop containers asynchronously #579

yoelcabo · 2023-11-12T07:48:25Z

Motivation

We've used Kamal to move some long-running jobs (ffmpeg transcoding) from AWS to Hetzner. The savings in $$-to-big-tech are amazing, thanks a lot for this project 🙌

The only problem we have is that we cannot find a way to prevent our jobs from being stopped midway while also having quick deployments as part of our Monolith's CI/CD system.

The simplest solution I could think of is:

Kamal fires a docker stop command with a generous grace period and forgets about my workers.
The worker stops picking up new jobs upon receiving the SIGTERM, and exits after finishing the current job.
Just in case something goes wrong, Kamal persists the containers that should be dying and makes sure they are dead once the grace period is over.

I gave a bit more context on this discussion: #491

Implementation

`stop_asynchronously`

New configuration flag that can be applied per role: stop_asynchronously. This is more or less how we are using it

  flong_running_workers:
    hosts: 
      - MY IPS
    cmd: long_running_job_worker --do-not-pick-jobs-after-sigterm --grace-period-until-sigkill 600
    stop_asynchronously: true

stop_wait_time: 601

Running docker stop in the background

The command that is run is something like:

nohup sh -c 'echo "CONTAINER_IDS" | xargs docker stop -t 3600' > /dev/null 2>&1 & disown

Keeping track of stopped containers

We persist the records in a simple plain text file that looks like this:

# CONTAINER_ID,STOP_TIME
root@web-staging:~# cat .kamal/happyscribe-media-worker-staging-async_stop_records 
4bf1c771bcec,2023-11-12 08:18:20 UTC

If the container is still up in a subsequent deploy that happens after the recorded stop time, then Kamal stops it synchronously.

Closing thoughts

I understand long-running jobs may not be the focus of this tool, and therefore the extra complexity in this PR might not be justified. The reason I built this is that I'd rather maintain a fork of Kamal than deploy our own Kubernetes cluster. Guidance on reducing this complexity would be greatly appreciated if you think this can be a good fit for Kamal.

Also: I am not very sure the way I encapsulated this behavior is in line with Kamal's architecture. I am more than open to feedback and happy to do any modification that would align better with the project.

yoelcabo force-pushed the feat/stop_async branch 2 times, most recently from b78e228 to 7240955 Compare November 14, 2023 11:18

feat: add ability to stop containers asynchronously

535b5e1

yoelcabo force-pushed the feat/stop_async branch from 7240955 to 535b5e1 Compare November 14, 2023 11:19

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add ability to stop containers asynchronously #579

Add ability to stop containers asynchronously #579

yoelcabo commented Nov 12, 2023 •

edited

Add ability to stop containers asynchronously #579

Are you sure you want to change the base?

Add ability to stop containers asynchronously #579

Conversation

yoelcabo commented Nov 12, 2023 • edited

Motivation

Implementation

stop_asynchronously

Running docker stop in the background

Keeping track of stopped containers

Closing thoughts

yoelcabo commented Nov 12, 2023 •

edited

`stop_asynchronously`