Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

The latest VPA app v5.2.1 is broken #3421

Closed
7 tasks done
nprokopic opened this issue Apr 23, 2024 · 4 comments
Closed
7 tasks done

The latest VPA app v5.2.1 is broken #3421

nprokopic opened this issue Apr 23, 2024 · 4 comments
Assignees
Labels
kind/bug provider/capvcd provider/cloud-director provider/cluster-api-aws Cluster API based running on AWS provider/cluster-api-azure Cluster API based running on Azure provider/vsphere Related to a VMware vSphere based on-premises solution team/turtles Team Turtles

Comments

@nprokopic
Copy link

nprokopic commented Apr 23, 2024

Summary

⚠️ I believe that this latest release of vertical-pod-autoscaler-app is broken giantswarm/vertical-pod-autoscaler-app#281.

It pulls in upstream v1.1.0 which contains this change which is I believe not working properly (or we have some issues that got uncovered here).

I have tested this on CAPA MC golem where VPA updater was crashlooping in the clusters that use vertical-pod-autoscaler-app v5.2.1, and the error can be tracked down to previously mentioned upstream VPA change. Test clusters were deployed with this cluster-aws PR where default apps are in cluster+cluster-aws and VPA app is on the latest (I think broken) version, while using VPA app v5.1.0 was working without issues.

VPA app have been already updated in default-apps-aws here giantswarm/default-apps-aws#455, but luckily not yet released (so not yet used in e2e tests which is why we have not seen the effects of the issue yet). I believe that this e2e test failure was a genuine one, but e2e tests had passed eventually there, since VPA updater is crashlooping, but when it gets restarted it is ready and running for some time.

Logs

These are the vertical-pod-autoscaler-updater logs after creating the cluster (confirmed multiple times in different clusters):

kubectl logs -n kube-system vertical-pod-autoscaler-updater-54b7fc465b-sm84d
I0422 02:27:21.221867       1 flags.go:57] FLAG: --add-dir-header="false"
I0422 02:27:21.221972       1 flags.go:57] FLAG: --address=":8943"
I0422 02:27:21.221978       1 flags.go:57] FLAG: --alsologtostderr="false"
I0422 02:27:21.221983       1 flags.go:57] FLAG: --evict-after-oom-threshold="10m0s"
I0422 02:27:21.221987       1 flags.go:57] FLAG: --eviction-rate-burst="1"
I0422 02:27:21.221991       1 flags.go:57] FLAG: --eviction-rate-limit="-1"
I0422 02:27:21.221995       1 flags.go:57] FLAG: --eviction-tolerance="0.5"
I0422 02:27:21.222001       1 flags.go:57] FLAG: --in-recommendation-bounds-eviction-lifetime-threshold="12h0m0s"
I0422 02:27:21.222005       1 flags.go:57] FLAG: --kube-api-burst="75"
I0422 02:27:21.222010       1 flags.go:57] FLAG: --kube-api-qps="50"
I0422 02:27:21.222014       1 flags.go:57] FLAG: --kubeconfig=""
I0422 02:27:21.222018       1 flags.go:57] FLAG: --log-backtrace-at=":0"
I0422 02:27:21.222030       1 flags.go:57] FLAG: --log-dir=""
I0422 02:27:21.222035       1 flags.go:57] FLAG: --log-file=""
I0422 02:27:21.222038       1 flags.go:57] FLAG: --log-file-max-size="1800"
I0422 02:27:21.222043       1 flags.go:57] FLAG: --logtostderr="true"
I0422 02:27:21.222047       1 flags.go:57] FLAG: --min-replicas="1"
I0422 02:27:21.222050       1 flags.go:57] FLAG: --one-output="false"
I0422 02:27:21.222054       1 flags.go:57] FLAG: --pod-update-threshold="0.1"
I0422 02:27:21.222059       1 flags.go:57] FLAG: --skip-headers="false"
I0422 02:27:21.222072       1 flags.go:57] FLAG: --skip-log-headers="false"
I0422 02:27:21.222076       1 flags.go:57] FLAG: --stderrthreshold="2"
I0422 02:27:21.222079       1 flags.go:57] FLAG: --updater-interval="1m0s"
I0422 02:27:21.222083       1 flags.go:57] FLAG: --use-admission-controller-status="true"
I0422 02:27:21.222087       1 flags.go:57] FLAG: --v="2"
I0422 02:27:21.222091       1 flags.go:57] FLAG: --vmodule=""
I0422 02:27:21.222094       1 flags.go:57] FLAG: --vpa-object-namespace=""
I0422 02:27:21.222105       1 main.go:82] Vertical Pod Autoscaler 1.1.0 Updater
I0422 02:27:21.323231       1 fetcher.go:99] Initial sync of ReplicaSet completed
I0422 02:27:21.423941       1 fetcher.go:99] Initial sync of StatefulSet completed
I0422 02:27:21.524585       1 fetcher.go:99] Initial sync of ReplicationController completed
I0422 02:27:21.624882       1 fetcher.go:99] Initial sync of Job completed
I0422 02:27:21.724973       1 fetcher.go:99] Initial sync of CronJob completed
I0422 02:27:21.825969       1 fetcher.go:99] Initial sync of DaemonSet completed
I0422 02:27:21.926159       1 fetcher.go:99] Initial sync of Deployment completed
I0422 02:27:21.926307       1 controller_fetcher.go:141] Initial sync of ReplicaSet completed
I0422 02:27:21.926338       1 controller_fetcher.go:141] Initial sync of StatefulSet completed
I0422 02:27:21.926344       1 controller_fetcher.go:141] Initial sync of ReplicationController completed
I0422 02:27:21.926350       1 controller_fetcher.go:141] Initial sync of Job completed
I0422 02:27:21.926355       1 controller_fetcher.go:141] Initial sync of CronJob completed
I0422 02:27:21.926362       1 controller_fetcher.go:141] Initial sync of DaemonSet completed
I0422 02:27:21.926368       1 controller_fetcher.go:141] Initial sync of Deployment completed
W0422 02:27:21.926406       1 shared_informer.go:459] The sharedIndexInformer has started, run more than once is not allowed
W0422 02:27:21.926420       1 shared_informer.go:459] The sharedIndexInformer has started, run more than once is not allowed
W0422 02:27:21.926447       1 shared_informer.go:459] The sharedIndexInformer has started, run more than once is not allowed
W0422 02:27:21.926418       1 shared_informer.go:459] The sharedIndexInformer has started, run more than once is not allowed
W0422 02:27:21.926450       1 shared_informer.go:459] The sharedIndexInformer has started, run more than once is not allowed
W0422 02:27:21.926469       1 shared_informer.go:459] The sharedIndexInformer has started, run more than once is not allowed
W0422 02:27:21.926522       1 shared_informer.go:459] The sharedIndexInformer has started, run more than once is not allowed
I0422 02:27:22.026880       1 updater.go:246] Rate limit disabled
I0422 02:27:22.529602       1 api.go:94] Initial VPA synced successfully
E0422 02:28:22.542486       1 api.go:153] fail to get pod controller: pod=etcd-ip-10-0-171-129.eu-west-2.compute.internal err=Unhandled targetRef v1 / Node / ip-10-0-171-129.eu-west-2.compute.internal, last error node is not a valid owner
E0422 02:28:22.543285       1 api.go:153] fail to get pod controller: pod=etcd-ip-10-0-82-3.eu-west-2.compute.internal err=Unhandled targetRef v1 / Node / ip-10-0-82-3.eu-west-2.compute.internal, last error node is not a valid owner
E0422 02:28:22.543415       1 api.go:153] fail to get pod controller: pod=kube-scheduler-ip-10-0-82-3.eu-west-2.compute.internal err=Unhandled targetRef v1 / Node / ip-10-0-82-3.eu-west-2.compute.internal, last error node is not a valid owner
E0422 02:28:22.547464       1 api.go:153] fail to get pod controller: pod=kube-apiserver-ip-10-0-229-221.eu-west-2.compute.internal err=Unhandled targetRef v1 / Node / ip-10-0-229-221.eu-west-2.compute.internal, last error node is not a valid owner
E0422 02:28:22.547525       1 api.go:153] fail to get pod controller: pod=kube-apiserver-ip-10-0-82-3.eu-west-2.compute.internal err=Unhandled targetRef v1 / Node / ip-10-0-82-3.eu-west-2.compute.internal, last error node is not a valid owner
E0422 02:28:22.547567       1 api.go:153] fail to get pod controller: pod=etcd-ip-10-0-229-221.eu-west-2.compute.internal err=Unhandled targetRef v1 / Node / ip-10-0-229-221.eu-west-2.compute.internal, last error node is not a valid owner
E0422 02:28:22.547603       1 api.go:153] fail to get pod controller: pod=kube-scheduler-ip-10-0-171-129.eu-west-2.compute.internal err=Unhandled targetRef v1 / Node / ip-10-0-171-129.eu-west-2.compute.internal, last error node is not a valid owner
E0422 02:28:22.547646       1 api.go:153] fail to get pod controller: pod=kube-scheduler-ip-10-0-229-221.eu-west-2.compute.internal err=Unhandled targetRef v1 / Node / ip-10-0-229-221.eu-west-2.compute.internal, last error node is not a valid owner
E0422 02:28:22.547690       1 api.go:153] fail to get pod controller: pod=kube-controller-manager-ip-10-0-171-129.eu-west-2.compute.internal err=Unhandled targetRef v1 / Node / ip-10-0-171-129.eu-west-2.compute.internal, last error node is not a valid owner
E0422 02:28:22.547745       1 api.go:153] fail to get pod controller: pod=kube-controller-manager-ip-10-0-229-221.eu-west-2.compute.internal err=Unhandled targetRef v1 / Node / ip-10-0-229-221.eu-west-2.compute.internal, last error node is not a valid owner
E0422 02:28:22.547780       1 api.go:153] fail to get pod controller: pod=kube-apiserver-ip-10-0-171-129.eu-west-2.compute.internal err=Unhandled targetRef v1 / Node / ip-10-0-171-129.eu-west-2.compute.internal, last error node is not a valid owner
E0422 02:28:22.547846       1 api.go:153] fail to get pod controller: pod=kube-controller-manager-ip-10-0-82-3.eu-west-2.compute.internal err=Unhandled targetRef v1 / Node / ip-10-0-82-3.eu-west-2.compute.internal, last error node is not a valid owner
panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x18 pc=0x159129f]

goroutine 1 [running]:
k8s.io/autoscaler/vertical-pod-autoscaler/pkg/updater/priority.(*scalingDirectionPodEvictionAdmission).LoopInit(0xc000356a80, {0x1a1dda3?, 0xa?, 0x27?}, 0xc00087ee40)
	/gopath/src/k8s.io/autoscaler/vertical-pod-autoscaler/pkg/updater/priority/scaling_direction_pod_eviction_admission.go:111 +0x11f
k8s.io/autoscaler/vertical-pod-autoscaler/pkg/updater/logic.(*updater).RunOnce(0xc000316a50, {0x1c97290, 0xc00023c000})
	/gopath/src/k8s.io/autoscaler/vertical-pod-autoscaler/pkg/updater/logic/updater.go:183 +0xb44
main.main()
	/gopath/src/k8s.io/autoscaler/vertical-pod-autoscaler/pkg/updater/main.go:127 +0x7ef

Mitigation

Tasks

Fixing the issue

Tasks

@weseven
Copy link

weseven commented May 2, 2024

I think this has been fixed upstream, but we need to test the update kubernetes/autoscaler#6763

@weseven
Copy link

weseven commented May 9, 2024

Unfortunately vpa 1.1.1 does not fix this issue for us, we still see the same behaviour with the vpa-updater pod crashing:

vertical-pod-autoscaler-updater-64874f5854-5r92m updater panic: runtime error: invalid memory address or nil pointer dereference
vertical-pod-autoscaler-updater-64874f5854-5r92m updater [signal SIGSEGV: segmentation violation code=0x1 addr=0x18 pc=0x159129f]
vertical-pod-autoscaler-updater-64874f5854-5r92m updater
vertical-pod-autoscaler-updater-64874f5854-5r92m updater goroutine 1 [running]:
vertical-pod-autoscaler-updater-64874f5854-5r92m updater k8s.io/autoscaler/vertical-pod-autoscaler/pkg/updater/priority.(*scalingDirectionPodEvictionAdmission).LoopInit(0xc000432528, {0x1a1dda3?, 0xa?, 0x4f646165723a6622?}, 0xc000aa6000)
vertical-pod-autoscaler-updater-64874f5854-5r92m updater        /gopath/src/k8s.io/autoscaler/vertical-pod-autoscaler/pkg/updater/priority/scaling_direction_pod_eviction_admission.go:111 +0x11f
vertical-pod-autoscaler-updater-64874f5854-5r92m updater k8s.io/autoscaler/vertical-pod-autoscaler/pkg/updater/logic.(*updater).RunOnce(0xc000139130, {0x1c97290, 0xc00023cd20})
vertical-pod-autoscaler-updater-64874f5854-5r92m updater        /gopath/src/k8s.io/autoscaler/vertical-pod-autoscaler/pkg/updater/logic/updater.go:183 +0xb44
vertical-pod-autoscaler-updater-64874f5854-5r92m updater main.main()
vertical-pod-autoscaler-updater-64874f5854-5r92m updater        /gopath/src/k8s.io/autoscaler/vertical-pod-autoscaler/pkg/updater/main.go:127 +0x7ef

@weseven
Copy link

weseven commented May 13, 2024

The issue is still there, it's this one: kubernetes/autoscaler#6808
There's already a PR fixing it, when it gets merged and vpa does a new release (and the upstream chart gets updated/we make a PR to the upstream chart for the new version) we will test again.

@weseven
Copy link

weseven commented May 28, 2024

The issue was fixed with upstream VPA 1.1.2, which was released with our VPA app v5.2.2.

It is safe to upgrade VPA and VPA CRDs to their latest version as of this date (v5.2.2 and v3.1.0).

@weseven weseven closed this as completed May 28, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug provider/capvcd provider/cloud-director provider/cluster-api-aws Cluster API based running on AWS provider/cluster-api-azure Cluster API based running on Azure provider/vsphere Related to a VMware vSphere based on-premises solution team/turtles Team Turtles
Projects
Status: Done ✅
Development

No branches or pull requests

2 participants