Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DaemonSet controller respects MaxSurge during update #96441

Merged
merged 5 commits into from Mar 6, 2021

Conversation

smarterclayton
Copy link
Contributor

@smarterclayton smarterclayton commented Nov 11, 2020

If MaxSurge is set, the controller will attempt to launch updated pods on up to MaxSurge nodes and wait for them to be ready before deleting the old pods. If any old pod goes unready during update, a new pod is added to the node (regardless of MaxSurge) and the old pod is deleted.

This implements the logic behind #96375

/kind feature

DaemonSets accept a MaxSurge integer or percent on their rolling update strategy that will launch the updated pod on nodes and wait for those pods to go ready before marking the old out-of-date pods as deleted. This allows workloads to avoid downtime during upgrades when deployed using DaemonSets. This feature is alpha and is behind the DaemonSetUpdateSurge feature gate.
- [KEP]: https://github.com/kubernetes/enhancements/tree/master/keps/sig-apps/1591-daemonset-surge

@k8s-ci-robot k8s-ci-robot added release-note Denotes a PR that will be considered when it comes time to generate release notes. kind/feature Categorizes issue or PR as related to a new feature. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. do-not-merge/needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. needs-priority Indicates a PR lacks a `priority/foo` label and requires one. labels Nov 11, 2020
@smarterclayton smarterclayton changed the title DaemonSet controller respects MaxSurge during update WIP: DaemonSet controller respects MaxSurge during update Nov 11, 2020
@k8s-ci-robot k8s-ci-robot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Nov 11, 2020
@smarterclayton
Copy link
Contributor Author

/sig apps

@k8s-ci-robot k8s-ci-robot added sig/apps Categorizes an issue or PR as relevant to SIG Apps. approved Indicates a PR has been approved by an approver from all required OWNERS files. kind/api-change Categorizes issue or PR as related to adding, removing, or otherwise changing an API and removed do-not-merge/needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels Nov 11, 2020
@smarterclayton smarterclayton force-pushed the daemonset_surge_impl branch 2 times, most recently from c6422a6 to 6c69080 Compare November 11, 2020 02:43
@smarterclayton
Copy link
Contributor Author

Added DaemonSetUpdateSurge to the e2e alpha suite in kubernetes/test-infra#19914

@smarterclayton
Copy link
Contributor Author

/assign @kubernetes/sig-apps-pr-reviews

@k8s-ci-robot k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Nov 12, 2020
@k8s-ci-robot k8s-ci-robot added size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. and removed size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. labels Dec 17, 2020
@k8s-ci-robot k8s-ci-robot added area/test sig/testing Categorizes an issue or PR as relevant to SIG Testing. labels Jan 19, 2021
@smarterclayton
Copy link
Contributor Author

/test pull-kubernetes-e2e-gce-alpha-features

@smarterclayton
Copy link
Contributor Author

Ok, this is ready for review (all known issues are addressed, the test is stable, logging should be accurate and at the right levels). I'm fairly confident that this is in alpha shape at least.

@jpbetz
Copy link
Contributor

jpbetz commented Feb 9, 2021

/remove-sig api-machinery

@k8s-ci-robot k8s-ci-robot removed the sig/api-machinery Categorizes an issue or PR as relevant to SIG API Machinery. label Feb 9, 2021
@soltysh
Copy link
Contributor

soltysh commented Feb 12, 2021

/triage accepted
/priority backlog

@k8s-ci-robot k8s-ci-robot added triage/accepted Indicates an issue or PR is ready to be actively worked on. priority/backlog Higher priority than priority/awaiting-more-evidence. and removed needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. needs-priority Indicates a PR lacks a `priority/foo` label and requires one. labels Feb 12, 2021
Copy link
Contributor

@soltysh soltysh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This lgtm, leaving final tag to @kow3ns

It is too easy to omit checking the return value for the
syncAndValidateDaemonSet test in large suites. Switch the method
type to be a test helper and fatal/error directly. Also rename
a method that referenced the old name 'Rollback' instead of
'RollingUpdate'.
While this is correct in order of operations, it is harder to read
and masks the intent of the user without the parenthesis.
If MaxSurge is set, the controller will attempt to double up nodes
up to the allowed limit with a new pod, and then when the most recent
(by hash) pod is ready, trigger deletion on the old pod. If the old
pod goes unready before the new pod is ready, the old pod is immediately
deleted. If an old pod goes unready before a new pod is placed on that
node, a new pod is immediately added for that node even past the MaxSurge
limit.

The backoff clock is used consistently throughout the daemonset controller
as an injectable clock for the purposes of testing.
In order to maintain the correct invariants, the existing maxUnavailable
logic calculated the same data several times in different ways. Leverage
the simpler structure from maxSurge and calculate pod availability only
once, as well as perform only a single pass over all the pods in the
daemonset. This changed no behavior of the current controller, and
has a structure that is almost identical to maxSurge.
The nodeShouldRunDaemonPod method does not need to return an error
because there are no scenarios under which it fails. Remove the
error return path for its direct calls as well.
@smarterclayton
Copy link
Contributor Author

/refresh

@smarterclayton
Copy link
Contributor Author

/refresh
/test pull-kubernetes-e2e-gce-alpha-features

Copy link
Contributor

@soltysh soltysh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Having heard no objections for the past weeks, I'm tagging as is.
/lgtm
/retest

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Mar 5, 2021
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: smarterclayton, soltysh

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@fejta-bot
Copy link

/retest
This bot automatically retries jobs that failed/flaked on approved PRs (send feedback to fejta).

Review the full test history for this PR.

Silence the bot with an /lgtm cancel or /hold comment for consistent failures.

3 similar comments
@fejta-bot
Copy link

/retest
This bot automatically retries jobs that failed/flaked on approved PRs (send feedback to fejta).

Review the full test history for this PR.

Silence the bot with an /lgtm cancel or /hold comment for consistent failures.

@fejta-bot
Copy link

/retest
This bot automatically retries jobs that failed/flaked on approved PRs (send feedback to fejta).

Review the full test history for this PR.

Silence the bot with an /lgtm cancel or /hold comment for consistent failures.

@fejta-bot
Copy link

/retest
This bot automatically retries jobs that failed/flaked on approved PRs (send feedback to fejta).

Review the full test history for this PR.

Silence the bot with an /lgtm cancel or /hold comment for consistent failures.

@k8s-ci-robot k8s-ci-robot merged commit 377ed3c into kubernetes:master Mar 6, 2021
@k8s-ci-robot k8s-ci-robot added this to the v1.21 milestone Mar 6, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. area/test cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/api-change Categorizes issue or PR as related to adding, removing, or otherwise changing an API kind/feature Categorizes issue or PR as related to a new feature. lgtm "Looks good to me", indicates that a PR is ready to be merged. priority/backlog Higher priority than priority/awaiting-more-evidence. release-note Denotes a PR that will be considered when it comes time to generate release notes. sig/apps Categorizes an issue or PR as relevant to SIG Apps. sig/testing Categorizes an issue or PR as relevant to SIG Testing. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. triage/accepted Indicates an issue or PR is ready to be actively worked on.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants