storage: make suspend and restart delay configurable, drop to 5s from 30s #26995
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
#26139 + https://github.com/MaterializeInc/cloud/issues/8965 have the full context, but the tl;dr is that sources can take a long time reconnect after momentary network blips. This means <1s of network downtime can turn into >60s of actual source/sink unavailability. One of the primary sources of downtime is the fixed 30s
SUSPEND_AND_RESTART_DELAY
.This PR makes two small changes to this knob:
A more advanced solution would be using exponential backoff, but it's not quite clear to me how to make that work within the healthcheck operator. Tossing this up as a proposal, and will let a member of @MaterializeInc/storage determine whether it's sufficient 😄
Motivation
In part addresses #26139, though it'd be nice to move to exponential backoff eventually.
Tips for reviewer
Checklist
$T ⇔ Proto$T
mapping (possibly in a backwards-incompatible way), then it is tagged with aT-proto
label.