Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

storage: make suspend and restart delay configurable, drop to 5s from 30s #26995

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

pH14
Copy link
Contributor

@pH14 pH14 commented May 8, 2024

#26139 + https://github.com/MaterializeInc/cloud/issues/8965 have the full context, but the tl;dr is that sources can take a long time reconnect after momentary network blips. This means <1s of network downtime can turn into >60s of actual source/sink unavailability. One of the primary sources of downtime is the fixed 30s SUSPEND_AND_RESTART_DELAY.

This PR makes two small changes to this knob:

  1. Makes it configurable
  2. Drops the default to 5s, down from 30s

A more advanced solution would be using exponential backoff, but it's not quite clear to me how to make that work within the healthcheck operator. Tossing this up as a proposal, and will let a member of @MaterializeInc/storage determine whether it's sufficient 😄

Motivation

In part addresses #26139, though it'd be nice to move to exponential backoff eventually.

Tips for reviewer

Checklist

@pH14 pH14 requested a review from a team as a code owner May 8, 2024 20:29
Copy link
Contributor

@guswynn guswynn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Doing backoff IS quite a bit more annoying, you are right! I am personally fine with this, will get sign off with rest of the team

@pH14 pH14 force-pushed the make-suspend-and-restart-configurable branch from ce2233a to 2dd0157 Compare May 13, 2024 18:16
Copy link
Contributor

@guswynn guswynn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@pH14 looks good!

The ci failure is because that test is checking against mz_sink_statuses, which cycles through running and stalled; instead it should check that mz_sink_status_history has stalled in its history

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants