restart_policy max_attempts seems backwards. #45039
chrisbecke
started this conversation in
General
Replies: 0 comments
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
The guidance in Docker Bench is that restart_policy.max_attempts should be set to a small number - such as 5.
This just causes problems in production swarms where long running tasks might be restarted because they fail health checks or, even worse, are migrated because their node is restarted as part of routine maintenance.
This means each task now has a hidden counter that, after weeks or months even, will prevent a task being restarted.
It is difficult to imagine where this behaviour is desired. The only way to avoid this is to set the max_attempts to 0 - which allows infinite restarts. Which is not desirable as it does not detect and stop a service that is being restarted in a fail loop due to configuration drift or some other infrastructure error.
There is a window parameter, but rather than counting errors within a window to determine if a service is stuck in a fail loop, the window allows for restarts to not be counted within the window.
Clearly someone thought this makes sense, but as someone who runs swarm in production I don't get it. restart_policy.max_attempts currently does NOT catch and stop services that have entered an error state, so much as stop running tasks after months of operation.
What gives?
Beta Was this translation helpful? Give feedback.
All reactions