Resources are sometimes manipulated with the wrong API group #6220

pjestin-sym · 2022-12-14T15:01:19Z

Bug Report

I have a Helm operator that installs releases in multiple namespaces in my K8s cluster. It is working mostly fine, however sometimes, seemingly at random, the release fails. I can see that the operator logged the error below.

It seems that the operator is trying to get the correct resource, but from the wrong API group. I don't know how it could happen, but it seems it is sometimes confusing API groups between resources.

In the example below, the Helm chart that is getting installed has only 2 resources:

A Deployment in API group apps
A ConfigMap in API group ""

Sometimes, at random, the operator will try to manipulate either a Deployment in API group "" or a ConfigMap in API group apps. This fails the release, as Helm tries to manipulate resources that do not exist. When the release is tried again, it might fail again (a different resource might be the problem) or it might succeed.

Eventually, all resources are properly reconciled. The impact of this is that the reconciliation takes significantly more time.

What did you do?

Define a Helm chart with 2 resources
Use the operator SDK Helm operator to reconcile Helm releases in multiple namespaces
Check the operator pod logs

What did you expect to see?

The Helm releases are reconciled successfully with no errors.

What did you see instead? Under which circumstances?

The following errors appear:

could not get object: configmaps.apps "tenant-50139-xpodbridge" is forbidden: User "system:serviceaccount:xpod-op:manager" cannot get resource "configmaps" in API group "apps" in the namespace "tenant-50139"

could not get object: deployments "xpbridge" is forbidden: User "system:serviceaccount:xpod-op:manager" cannot get resource "deployments" in API group "" in the namespace "tenant-50262"

Environment

Operator type:

/language helm

Kubernetes cluster type:

Google Kubernetes Engine

$ operator-sdk version

"v1.26.0", commit: "cbeec475e4612e19f1047ff7014342afe93f60d2", kubernetes version: "1.25.0", go version: "go1.19.3", GOOS: "linux", GOARCH: "amd64"

Docker image: quay.io/operator-framework/helm-operator:v1.26.0

(Note that this also happens with operator-sdk 1.19.1.)

$ kubectl version

Server Version: version.Info{Major:"1", Minor:"24", GitVersion:"v1.24.5-gke.600", GitCommit:"fb4964ee848bc4d25d42d60386c731836059d1d8", GitTreeState:"clean", BuildDate:"2022-09-22T09:24:55Z", GoVersion:"go1.18.6b7", Compiler:"gc", Platform:"linux/amd64"}

Possible Solution

The randomness seems to point to a race condition
The issue could be related to Helm, or also to the K8s go client, I'm not sure.

The text was updated successfully, but these errors were encountered:

everettraven · 2022-12-21T14:50:30Z

@pjestin-sym Thanks for raising this issue!

I'm not sure what may be causing this. Would you be able to share either your operator or an operator that reliably reproduces this issue?

pjestin-sym · 2023-01-02T10:11:18Z

Hello @everettraven thanks for your reply and happy new year!

I was able to reproduce this issue on a fresh Helm operator with its chart. Here is the repo with instructions: https://github.com/pjestin-sym/api-group-issue-helm-op

The issue seems to be directly linked to the number of resources in the chart, and possibly also to the number of CRs in the cluster (and hence the number of Helm installations).

I hope you can have a look.

pjestin-sym · 2023-01-02T11:28:11Z

Note that the value of max-concurrent-reconciles passed as an argument to the operator pod has a significant effect:

The default value is 2
Setting max-concurrent-reconciles to 1 completely solves the issue
Setting it to 4 multiplies the number of errors.

This seems to point towards conflicts between the operator workers.

pjestin-sym · 2023-01-11T10:09:28Z

I was playing around with operator-sdk versions, and realized that with version 1.16.0, the problem is there, but to a much lesser extent. I was able to compare versions 1.16.0 and 1.17.0: with version 1.17.0, the amount of errors is about 20 times the amount of errors in version 1.16.0 (with the same settings otherwise). As a result, the rate of reconciliation is doubled by switching to version 1.16.0.

By locally building the Docker image for helm-operator, I was able to determine that this worsening of the problem is caused by this PR: #5505

This PR updates many dependencies, so I can conclude that one of those dependency bumps is responsible for this issue worsening.

My intuition is pointing towards this change in controller-runtime, even if I was not able to test that hypothesis: kubernetes-sigs/controller-runtime#1695

everettraven · 2023-01-11T14:01:38Z

@pjestin-sym thanks for your analysis! I apologize for my delay in getting around to investigating this further, I just haven't had the time to take a deeper look. I am planning to carve out some time over the next couple days to take a deeper dive into this and some other open issues. I appreciate your patience with this!

everettraven · 2023-01-13T19:35:44Z

So I spent some time doing some digging and was able to dig down to the line that is reporting the error being:

operator-sdk/internal/helm/release/manager.go

Line 262 in a5d933b

return fmt.Errorf("could not get object: %w", err)

Doing some looking at the surrounding context it seems the problem has something to do with the "helper" that is being used and configured for retrieving resources during release reconciliation:

operator-sdk/internal/helm/release/manager.go

Lines 254 to 255 in a5d933b

    
           helper := resource.NewHelper(expected.Client, expected.Mapping) 
        
           existing, err := helper.Get(expected.Namespace, expected.Name)

This helper comes from https://pkg.go.dev/k8s.io/cli-runtime/pkg/resource and for some reason seems to be mucking up the GVK occasionally when processing the requests. I'm too familiar with the internals of the Helm controller and am not really sure where this processing of the GVK is going wrong such that it is causing this error (I tried digging into this a bit, but just didn't see anything that would mess up the GVK).

@varshaprasad96 Since you have more knowledge on the Helm controller itself, do you happen to have some ideas as to why this helper may be setting the wrong GVK when trying to get a resource?

pjestin-sym · 2023-02-16T10:48:21Z

Hi @varshaprasad96 @everettraven any news on this topic? We have reverted to version 1.16.0 for now, but this issue prevents us from using newer versions.

dweebo · 2023-02-17T16:25:36Z

@everettraven I have spent all morning on this as well, and also narrowed it down to that call to helper.Get but beyond that I got a bit lost in what is happening.

everettraven · 2023-02-20T16:18:25Z

Digging a tiny bit deeper I was able to find that the kube client is a helm one and not a client-go one like I was originally thinking and is defined here: https://pkg.go.dev/helm.sh/helm/v3/pkg/kube#Interface

The Build() function that is used returns https://pkg.go.dev/helm.sh/helm/v3/pkg/kube#ResourceList

The resource.NewHelper takes in a RESTMapping that is being set by the resource.Info that is retrieved by the Build() function and then sets the resource based on that: https://github.com/kubernetes/cli-runtime/blob/bfd3c43351c9870acafbfd30a6ed6f1a52b25bad/pkg/resource/helper.go#L64

It seems like maybe this could be a helm issue? That being said, I'm not super familiar with this low-level of the helm operator interactions so I think I am going to bring this back up during our community issue triage meeting and see if anyone else may have some additional insight.

jberkhahn · 2023-02-21T22:07:13Z

So, I was poking around into this by adding log statements to the Helm Operator to try and figure out if the mapping was wrong or what, and I can't get it to exhibit this behavior on master. Is it possible this got fixed by a dependency bump or something?

hnajib-sym · 2023-05-09T07:13:51Z

After testing operartor-sdk versions , it turned that the issue was fixed by #6026 but at the same time it introduced a performance degradation.

Still not sure what changes brought the issue in the first place , #6026 changes are unclear .

openshift-bot · 2023-08-07T09:00:38Z

Issues go stale after 90d of inactivity.

Mark the issue as fresh by commenting /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.
Exclude this issue from closing by commenting /lifecycle frozen.

If this issue is safe to close now please do so with /close.

/lifecycle stale

openshift-bot · 2023-09-07T00:30:29Z

Stale issues rot after 30d of inactivity.

Mark the issue as fresh by commenting /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.
Exclude this issue from closing by commenting /lifecycle frozen.

If this issue is safe to close now please do so with /close.

/lifecycle rotten
/remove-lifecycle stale

openshift-bot · 2023-10-07T08:00:18Z

Rotten issues close after 30d of inactivity.

Reopen the issue by commenting /reopen.
Mark the issue as fresh by commenting /remove-lifecycle rotten.
Exclude this issue from closing again by commenting /lifecycle frozen.

/close

openshift-ci · 2023-10-07T08:00:21Z

@openshift-bot: Closing this issue.

In response to this:

Rotten issues close after 30d of inactivity.

Reopen the issue by commenting /reopen.
Mark the issue as fresh by commenting /remove-lifecycle rotten.
Exclude this issue from closing again by commenting /lifecycle frozen.

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

openshift-ci bot added the language/helm Issue is related to a Helm operator project label Dec 14, 2022

varshaprasad96 assigned everettraven Jan 9, 2023

varshaprasad96 added this to the Backlog milestone Jan 9, 2023

everettraven removed this from the Backlog milestone Feb 20, 2023

jberkhahn self-assigned this Feb 20, 2023

jberkhahn added this to the v1.29.0 milestone Feb 20, 2023

varshaprasad96 modified the milestones: v1.29.0, Backlog Mar 29, 2023

openshift-ci bot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Aug 7, 2023

openshift-ci bot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Sep 7, 2023

openshift-ci bot closed this as completed Oct 7, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Resources are sometimes manipulated with the wrong API group #6220

Resources are sometimes manipulated with the wrong API group #6220

pjestin-sym commented Dec 14, 2022

everettraven commented Dec 21, 2022

pjestin-sym commented Jan 2, 2023

pjestin-sym commented Jan 2, 2023

pjestin-sym commented Jan 11, 2023

everettraven commented Jan 11, 2023

everettraven commented Jan 13, 2023

pjestin-sym commented Feb 16, 2023

dweebo commented Feb 17, 2023

everettraven commented Feb 20, 2023

jberkhahn commented Feb 21, 2023 •

edited

hnajib-sym commented May 9, 2023

openshift-bot commented Aug 7, 2023

openshift-bot commented Sep 7, 2023

openshift-bot commented Oct 7, 2023

openshift-ci bot commented Oct 7, 2023

Resources are sometimes manipulated with the wrong API group #6220

Resources are sometimes manipulated with the wrong API group #6220

Comments

pjestin-sym commented Dec 14, 2022

Bug Report

What did you do?

What did you expect to see?

What did you see instead? Under which circumstances?

Environment

Possible Solution

everettraven commented Dec 21, 2022

pjestin-sym commented Jan 2, 2023

pjestin-sym commented Jan 2, 2023

pjestin-sym commented Jan 11, 2023

everettraven commented Jan 11, 2023

everettraven commented Jan 13, 2023

pjestin-sym commented Feb 16, 2023

dweebo commented Feb 17, 2023

everettraven commented Feb 20, 2023

jberkhahn commented Feb 21, 2023 • edited

hnajib-sym commented May 9, 2023

openshift-bot commented Aug 7, 2023

openshift-bot commented Sep 7, 2023

openshift-bot commented Oct 7, 2023

openshift-ci bot commented Oct 7, 2023

jberkhahn commented Feb 21, 2023 •

edited