Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We鈥檒l occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add ability to stop containers asynchronously #579

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

yoelcabo
Copy link
Contributor

@yoelcabo yoelcabo commented Nov 12, 2023

Motivation

We've used Kamal to move some long-running jobs (ffmpeg transcoding) from AWS to Hetzner. The savings in $$-to-big-tech are amazing, thanks a lot for this project 馃檶

The only problem we have is that we cannot find a way to prevent our jobs from being stopped midway while also having quick deployments as part of our Monolith's CI/CD system.

The simplest solution I could think of is:

  • Kamal fires a docker stop command with a generous grace period and forgets about my workers.
  • The worker stops picking up new jobs upon receiving the SIGTERM, and exits after finishing the current job.
  • Just in case something goes wrong, Kamal persists the containers that should be dying and makes sure they are dead once the grace period is over.

I gave a bit more context on this discussion: #491

Implementation

stop_asynchronously

New configuration flag that can be applied per role: stop_asynchronously. This is more or less how we are using it

  flong_running_workers:
    hosts: 
      - MY IPS
    cmd: long_running_job_worker --do-not-pick-jobs-after-sigterm --grace-period-until-sigkill 600
    stop_asynchronously: true

stop_wait_time: 601

Running docker stop in the background

The command that is run is something like:

nohup sh -c 'echo "CONTAINER_IDS" | xargs docker stop -t 3600' > /dev/null 2>&1 & disown

Keeping track of stopped containers

We persist the records in a simple plain text file that looks like this:

# CONTAINER_ID,STOP_TIME
root@web-staging:~# cat .kamal/happyscribe-media-worker-staging-async_stop_records 
4bf1c771bcec,2023-11-12 08:18:20 UTC

If the container is still up in a subsequent deploy that happens after the recorded stop time, then Kamal stops it synchronously.

Closing thoughts

I understand long-running jobs may not be the focus of this tool, and therefore the extra complexity in this PR might not be justified. The reason I built this is that I'd rather maintain a fork of Kamal than deploy our own Kubernetes cluster. Guidance on reducing this complexity would be greatly appreciated if you think this can be a good fit for Kamal.

Also: I am not very sure the way I encapsulated this behavior is in line with Kamal's architecture. I am more than open to feedback and happy to do any modification that would align better with the project.

@yoelcabo yoelcabo force-pushed the feat/stop_async branch 2 times, most recently from b78e228 to 7240955 Compare November 14, 2023 11:18
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

1 participant