Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Increase initialDelaySeconds for MOFED POD liveness probe #166

Closed

Conversation

ykulazhenkov
Copy link
Collaborator

Before version 1.21, Kubernetes used startupProbe for "fresh" POD
starts and restarts (caused by liveness check failed as an example).
Starting from v1.21, startupProbe applied to "fresh" starts only.
To prevent a crash loop after MOFED POD restarts, we should grant
enough time to POD to fully boot before we start liveness checking.

Before version 1.21, Kubernetes used startupProbe for "fresh" POD
starts and restarts (caused by liveness check failed as an example).
Starting from v1.21, startupProbe applied to "fresh" starts only.
To prevent a crash loop after MOFED POD restarts, we should grant
enough time to POD to fully boot before we start liveness checking.

Signed-off-by: Yury Kulazhenkov <ykulazhenkov@nvidia.com>
# starting from v1.21, Kubernetes doesn't use startupProbe during POD restarts
# to prevent crash loop after POD restarts, we should grant enough time to POD to
# fully boot before we start liveness checking
initialDelaySeconds: 570
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

who did you come up with this number?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We discussed this with @AbdYsn. We use the same delay as for startupProbe 10 minutes. If I remember correctly, @AbdYsn mentioned that choosing startupProbe delays is based on experiments on multipile different ENVs.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ykulazhenkov can you point me to the kubernetes commit that change the behaviour?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this PR changed behavior of startupProbe kubernetes/kubernetes#98376

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will ask community. Maybe this is some regression in Kuberntes, because I can't find any mentions that behavior of startupProbe should change.

@ykulazhenkov ykulazhenkov marked this pull request as draft April 13, 2021 08:14
@ykulazhenkov
Copy link
Collaborator Author

Need to check with the community that the current behavior of startupProbe is expected.

@ykulazhenkov
Copy link
Collaborator Author

ykulazhenkov commented Apr 13, 2021

Issue in the Kubernetes repo: kubernetes/kubernetes#101064

@ykulazhenkov
Copy link
Collaborator Author

Issue in Kubernetes confirmed. Fix in progress. No need to change timeouts in network-operator repo. Closed.

@ykulazhenkov ykulazhenkov deleted the fix-mofed-liveness-delay branch November 3, 2021 08:28
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants