fix(cilium): cilium-operator must be able to patch K8S nodes resources #15470

maximumG · 2023-06-05T08:53:33Z

As per the cilium recommended installation, the cilium-operator pod is in charge of removing the node.cilium.io/agent-not-ready that can be put on nodes upon startup. As such the operator needs the PATCH permission on the node resource.

As per the cilium recommended installation, the cilium-operator is in charge of removing the `node.cilium.io/agent-not-ready` that is put on nodes upon startup. As such the operator needs the PATCH permission on the node ressource.

k8s-ci-robot · 2023-06-05T08:53:42Z

Hi @maximumG. Thanks for your PR.

I'm waiting for a kubernetes member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

k8s-ci-robot · 2023-06-05T08:53:50Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign johngmyers for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

hakman · 2023-06-05T11:10:43Z

/ok-to-test
/cc @olemarkus

maximumG · 2023-06-05T18:37:27Z

/retest-required

hakman · 2023-06-05T20:04:53Z

/test pull-kops-e2e-aws-upgrade-126-ko126-to-klatest-kolatest-many-addons

hakman · 2023-06-05T20:06:29Z

@maximumG To fix the failing tests you need to run hack/update-exptected.sh and commit the changes.

maximumG · 2023-06-06T07:43:25Z

/retest

olemarkus · 2023-06-06T18:52:50Z

But that taint is not used in our setup. Does it then need this permission?

maximumG · 2023-06-07T07:24:38Z

But that taint is not used in our setup. Does it then need this permission?

I must agree that by default the cilium taint is not configured. However I might argue that users are able to customize the taint per kops instance groups, leaving the possibility to user to add this specific taint.

Actually this is exactly what we tried out, 😃 we put this taint to ensure no workload got scheduled before the cilium-agent was ready. And we were surprised that it was not removed after all.

k8s-ci-robot · 2023-06-21T05:59:29Z

PR needs rebase.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

maximumG · 2023-07-07T14:00:44Z

@olemarkus Do you need additional info to validate this PR on your side ?

h3poteto · 2023-07-30T06:23:46Z

I also have the same issue. And I can understand what @olemarkus says.

node.cilium.io/agent-not-ready taint is only used when we use ENI as IPAM in Cilium. And now we need to add the taint manually. Nevertheless, is it correct to prepare only permissions with kOps?
I'm thinking of 3 ideas.

Just write a document about the taint and permissions.
- This is an optional usage.
- So users need to add the taint and create ClusterRole for the cilium-operator manually.
The permission and taint are automatically added by kOps when users specify ENI as IPAM.
- The taint is automatically added by kOps only when users specify networking.cilium.ipam = eni.
- At that time, permissions are also added by kOps.
Same as this PR.
- Users add the taint manually.
- But the permission required at that time is prepared by kOps.

What do you think?

hakman · 2023-07-30T06:45:14Z

@h3poteto @maximumG Added to the topics list for kOps Office Hours next week.

hakman · 2023-07-30T06:45:46Z

@johngmyers Any thoughts?

johngmyers · 2023-07-31T00:01:22Z

Why is nodes/status needed?

The referenced documentation describes a situation where there is a second CNI that Cilium has to wrest control from, but that shouldn't apply to kOps.

If an extra taint was useful in some common configuration, I would prefer that kOps automatically apply it, in addition to giving cilium-operator the permssion to remove it.

zadjadr · 2023-08-04T19:37:21Z

I believe this is needed long term anyway, because in v1.14 this change has been introduced:

The Cilium operator now taints nodes where Cilium is scheduled to run but is not running.
This prevents pods from being scheduled on nodes without Cilium.
The CNI configuration file is no longer removed on agent shutdown.
This means that pod deletion will always succeed; previously it would fail if Cilium was down for an upgrade.
This should help prevent nodes accidentally entering an unmanageable state.
It also means that nodes are not removed from cloud LoadBalancer backends during Cilium upgrades.

cilium/cilium#23486

johngmyers · 2023-08-06T21:41:18Z

When Cilium requires it we will need to add the permission.

maximumG · 2023-08-07T07:52:00Z

Why is nodes/status needed?

On our production K8S clusters, we had some weird scheduling on new joining nodes. Basically, business pods were sometimes scheduled before Cilium daemonset, exhausting the node ressources (CPU/memory). As a result, the cilium-agent was not able to get scheduled on the node and trigger another node scale up.

With this taint added and removed by the cilium-operator, we would avoid such situation.

olemarkus · 2023-08-07T09:34:25Z

If your business pods gets scheduled before the cilium DS, you must have specific tolerations for that. If you e.g tolerate any taint, adding more taints won't work.

johngmyers · 2023-08-07T15:12:59Z

With this taint added and removed by the cilium-operator, we would avoid such situation.

Taints are not under nodes/status. So I ask again: why is nodes/status needed?

zadjadr · 2023-08-08T06:52:14Z

@johngmyers node/status is used to annotate the node. But I don't know if this is needed 100%.

maximumG · 2023-08-08T07:20:05Z

node/status is used to annotate the node. But I don't know if this is needed 100%.

Looking at the latest cilium helm documentation for v1.12 here, I would say that this permission is indeed useless. The annotateK8sNode is set to false by default and AFAIK there is no way to modify this using cilium kops integration.

h3poteto · 2023-08-09T12:34:07Z

I think node/status permission is requested https://github.com/cilium/cilium/blob/5b92dce1b003c64fa5ba36cc2f0fc355632f0cf5/operator/watchers/node_taint.go#L298-L327

And removing agent-not-ready is here: https://github.com/cilium/cilium/blob/5b92dce1b003c64fa5ba36cc2f0fc355632f0cf5/operator/watchers/node_taint.go#L345-L398

So, maybe node/status is not necessary to manage agent-not-ready annotation. But they wrote the reason why NodeNetworkUnavailable is needed:

// setNodeNetworkUnavailableFalse sets Kubernetes NodeNetworkUnavailable to
// false as Cilium is managing the network connectivity.
// https://kubernetes.io/docs/concepts/architecture/nodes/#condition
// This is because some clusters (notably GCP) come up with a NodeNetworkUnavailable condition set
// and the network provider is expected to remove this manually.

maximumG · 2023-08-17T13:28:26Z

Based on @h3poteto comment, I would still state that this permission it actually useless (and sorry for pushing this PR)

Basically if we want to avoid scheduling pods on nodes where cilium is yet not ready, we just have to put the taint node.kubernetes.io/network-unavailable:NoSchedule on the the kops instance group. As soon as cilium is ready on the node, the cilium-operator will notice it and update the NodeNetworkUnavailable condition to false. As such Kubernetes will deliberately remove the taint by itself node.kubernetes.io/network-unavailable:NoSchedule and pods can now be scheduled properly.

As already stated by some people here, the node.cilium.io/agent-not-ready taint is only useful in case of CNI chaining, where the first CNI (handling the IPAM) updates this NodeNetworkUnavailable condition to false, while not waiting for cilium to be up and running.

Does it make sense ?

k8s-ci-robot · 2024-01-18T04:23:33Z

@maximumG: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name	Commit	Details	Required	Rerun command
pull-kops-kubernetes-e2e-ubuntu-gce-build	`00e6914`	link	true	`/test pull-kops-kubernetes-e2e-ubuntu-gce-build`
pull-kops-verify-hashes	`00e6914`	link	true	`/test pull-kops-verify-hashes`

Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

k8s-triage-robot · 2024-04-17T05:14:57Z

The Kubernetes project currently lacks enough contributors to adequately respond to all PRs.

This bot triages PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the PR is closed

You can:

Mark this PR as fresh with /remove-lifecycle stale
Close this PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

k8s-triage-robot · 2024-05-17T05:36:20Z

The Kubernetes project currently lacks enough active contributors to adequately respond to all PRs.

This bot triages PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the PR is closed

You can:

Mark this PR as fresh with /remove-lifecycle rotten
Close this PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. labels Jun 5, 2023

k8s-ci-robot added the area/addons label Jun 5, 2023

k8s-ci-robot requested review from hakman and zetaab June 5, 2023 08:53

k8s-ci-robot requested a review from olemarkus June 5, 2023 11:10

k8s-ci-robot added ok-to-test Indicates a non-member PR verified by an org member that is safe to test. and removed needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Jun 5, 2023

This was referenced Jun 5, 2023

fix(cilium): cilium-operator must be able to patch K8S nodes resources #15472

Closed

fix(cilium): cilium-operator must be able to patch K8S nodes resources backmarket-oss/kops#2

Open

chore: update tests

00e6914

k8s-ci-robot added size/M Denotes a PR that changes 30-99 lines, ignoring generated files. and removed size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. labels Jun 6, 2023

k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Jun 21, 2023

hakman added the kind/office-hours label Jul 30, 2023

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Apr 17, 2024

k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels May 17, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(cilium): cilium-operator must be able to patch K8S nodes resources #15470

fix(cilium): cilium-operator must be able to patch K8S nodes resources #15470

maximumG commented Jun 5, 2023

k8s-ci-robot commented Jun 5, 2023

k8s-ci-robot commented Jun 5, 2023

hakman commented Jun 5, 2023

maximumG commented Jun 5, 2023

hakman commented Jun 5, 2023

hakman commented Jun 5, 2023

maximumG commented Jun 6, 2023

olemarkus commented Jun 6, 2023

maximumG commented Jun 7, 2023

k8s-ci-robot commented Jun 21, 2023

maximumG commented Jul 7, 2023

h3poteto commented Jul 30, 2023

hakman commented Jul 30, 2023

hakman commented Jul 30, 2023

johngmyers commented Jul 31, 2023

zadjadr commented Aug 4, 2023

johngmyers commented Aug 6, 2023

maximumG commented Aug 7, 2023

olemarkus commented Aug 7, 2023

johngmyers commented Aug 7, 2023

zadjadr commented Aug 8, 2023 •

edited

maximumG commented Aug 8, 2023 •

edited

h3poteto commented Aug 9, 2023

maximumG commented Aug 17, 2023

k8s-ci-robot commented Jan 18, 2024

k8s-triage-robot commented Apr 17, 2024

k8s-triage-robot commented May 17, 2024

fix(cilium): cilium-operator must be able to patch K8S nodes resources #15470

Are you sure you want to change the base?

fix(cilium): cilium-operator must be able to patch K8S nodes resources #15470

Conversation

maximumG commented Jun 5, 2023

k8s-ci-robot commented Jun 5, 2023

k8s-ci-robot commented Jun 5, 2023

hakman commented Jun 5, 2023

maximumG commented Jun 5, 2023

hakman commented Jun 5, 2023

hakman commented Jun 5, 2023

maximumG commented Jun 6, 2023

olemarkus commented Jun 6, 2023

maximumG commented Jun 7, 2023

k8s-ci-robot commented Jun 21, 2023

maximumG commented Jul 7, 2023

h3poteto commented Jul 30, 2023

hakman commented Jul 30, 2023

hakman commented Jul 30, 2023

johngmyers commented Jul 31, 2023

zadjadr commented Aug 4, 2023

johngmyers commented Aug 6, 2023

maximumG commented Aug 7, 2023

olemarkus commented Aug 7, 2023

johngmyers commented Aug 7, 2023

zadjadr commented Aug 8, 2023 • edited

maximumG commented Aug 8, 2023 • edited

h3poteto commented Aug 9, 2023

maximumG commented Aug 17, 2023

k8s-ci-robot commented Jan 18, 2024

k8s-triage-robot commented Apr 17, 2024

k8s-triage-robot commented May 17, 2024

zadjadr commented Aug 8, 2023 •

edited

maximumG commented Aug 8, 2023 •

edited