The latest VPA app v5.2.1 is broken #3421

nprokopic · 2024-04-23T13:06:24Z

Summary

⚠️ I believe that this latest release of vertical-pod-autoscaler-app is broken giantswarm/vertical-pod-autoscaler-app#281.

It pulls in upstream v1.1.0 which contains this change which is I believe not working properly (or we have some issues that got uncovered here).

I have tested this on CAPA MC golem where VPA updater was crashlooping in the clusters that use vertical-pod-autoscaler-app v5.2.1, and the error can be tracked down to previously mentioned upstream VPA change. Test clusters were deployed with this cluster-aws PR where default apps are in cluster+cluster-aws and VPA app is on the latest (I think broken) version, while using VPA app v5.1.0 was working without issues.

VPA app have been already updated in default-apps-aws here giantswarm/default-apps-aws#455, but luckily not yet released (so not yet used in e2e tests which is why we have not seen the effects of the issue yet). I believe that this e2e test failure was a genuine one, but e2e tests had passed eventually there, since VPA updater is crashlooping, but when it gets restarted it is ready and running for some time.

Logs

These are the vertical-pod-autoscaler-updater logs after creating the cluster (confirmed multiple times in different clusters):

kubectl logs -n kube-system vertical-pod-autoscaler-updater-54b7fc465b-sm84d
I0422 02:27:21.221867       1 flags.go:57] FLAG: --add-dir-header="false"
I0422 02:27:21.221972       1 flags.go:57] FLAG: --address=":8943"
I0422 02:27:21.221978       1 flags.go:57] FLAG: --alsologtostderr="false"
I0422 02:27:21.221983       1 flags.go:57] FLAG: --evict-after-oom-threshold="10m0s"
I0422 02:27:21.221987       1 flags.go:57] FLAG: --eviction-rate-burst="1"
I0422 02:27:21.221991       1 flags.go:57] FLAG: --eviction-rate-limit="-1"
I0422 02:27:21.221995       1 flags.go:57] FLAG: --eviction-tolerance="0.5"
I0422 02:27:21.222001       1 flags.go:57] FLAG: --in-recommendation-bounds-eviction-lifetime-threshold="12h0m0s"
I0422 02:27:21.222005       1 flags.go:57] FLAG: --kube-api-burst="75"
I0422 02:27:21.222010       1 flags.go:57] FLAG: --kube-api-qps="50"
I0422 02:27:21.222014       1 flags.go:57] FLAG: --kubeconfig=""
I0422 02:27:21.222018       1 flags.go:57] FLAG: --log-backtrace-at=":0"
I0422 02:27:21.222030       1 flags.go:57] FLAG: --log-dir=""
I0422 02:27:21.222035       1 flags.go:57] FLAG: --log-file=""
I0422 02:27:21.222038       1 flags.go:57] FLAG: --log-file-max-size="1800"
I0422 02:27:21.222043       1 flags.go:57] FLAG: --logtostderr="true"
I0422 02:27:21.222047       1 flags.go:57] FLAG: --min-replicas="1"
I0422 02:27:21.222050       1 flags.go:57] FLAG: --one-output="false"
I0422 02:27:21.222054       1 flags.go:57] FLAG: --pod-update-threshold="0.1"
I0422 02:27:21.222059       1 flags.go:57] FLAG: --skip-headers="false"
I0422 02:27:21.222072       1 flags.go:57] FLAG: --skip-log-headers="false"
I0422 02:27:21.222076       1 flags.go:57] FLAG: --stderrthreshold="2"
I0422 02:27:21.222079       1 flags.go:57] FLAG: --updater-interval="1m0s"
I0422 02:27:21.222083       1 flags.go:57] FLAG: --use-admission-controller-status="true"
I0422 02:27:21.222087       1 flags.go:57] FLAG: --v="2"
I0422 02:27:21.222091       1 flags.go:57] FLAG: --vmodule=""
I0422 02:27:21.222094       1 flags.go:57] FLAG: --vpa-object-namespace=""
I0422 02:27:21.222105       1 main.go:82] Vertical Pod Autoscaler 1.1.0 Updater
I0422 02:27:21.323231       1 fetcher.go:99] Initial sync of ReplicaSet completed
I0422 02:27:21.423941       1 fetcher.go:99] Initial sync of StatefulSet completed
I0422 02:27:21.524585       1 fetcher.go:99] Initial sync of ReplicationController completed
I0422 02:27:21.624882       1 fetcher.go:99] Initial sync of Job completed
I0422 02:27:21.724973       1 fetcher.go:99] Initial sync of CronJob completed
I0422 02:27:21.825969       1 fetcher.go:99] Initial sync of DaemonSet completed
I0422 02:27:21.926159       1 fetcher.go:99] Initial sync of Deployment completed
I0422 02:27:21.926307       1 controller_fetcher.go:141] Initial sync of ReplicaSet completed
I0422 02:27:21.926338       1 controller_fetcher.go:141] Initial sync of StatefulSet completed
I0422 02:27:21.926344       1 controller_fetcher.go:141] Initial sync of ReplicationController completed
I0422 02:27:21.926350       1 controller_fetcher.go:141] Initial sync of Job completed
I0422 02:27:21.926355       1 controller_fetcher.go:141] Initial sync of CronJob completed
I0422 02:27:21.926362       1 controller_fetcher.go:141] Initial sync of DaemonSet completed
I0422 02:27:21.926368       1 controller_fetcher.go:141] Initial sync of Deployment completed
W0422 02:27:21.926406       1 shared_informer.go:459] The sharedIndexInformer has started, run more than once is not allowed
W0422 02:27:21.926420       1 shared_informer.go:459] The sharedIndexInformer has started, run more than once is not allowed
W0422 02:27:21.926447       1 shared_informer.go:459] The sharedIndexInformer has started, run more than once is not allowed
W0422 02:27:21.926418       1 shared_informer.go:459] The sharedIndexInformer has started, run more than once is not allowed
W0422 02:27:21.926450       1 shared_informer.go:459] The sharedIndexInformer has started, run more than once is not allowed
W0422 02:27:21.926469       1 shared_informer.go:459] The sharedIndexInformer has started, run more than once is not allowed
W0422 02:27:21.926522       1 shared_informer.go:459] The sharedIndexInformer has started, run more than once is not allowed
I0422 02:27:22.026880       1 updater.go:246] Rate limit disabled
I0422 02:27:22.529602       1 api.go:94] Initial VPA synced successfully
E0422 02:28:22.542486       1 api.go:153] fail to get pod controller: pod=etcd-ip-10-0-171-129.eu-west-2.compute.internal err=Unhandled targetRef v1 / Node / ip-10-0-171-129.eu-west-2.compute.internal, last error node is not a valid owner
E0422 02:28:22.543285       1 api.go:153] fail to get pod controller: pod=etcd-ip-10-0-82-3.eu-west-2.compute.internal err=Unhandled targetRef v1 / Node / ip-10-0-82-3.eu-west-2.compute.internal, last error node is not a valid owner
E0422 02:28:22.543415       1 api.go:153] fail to get pod controller: pod=kube-scheduler-ip-10-0-82-3.eu-west-2.compute.internal err=Unhandled targetRef v1 / Node / ip-10-0-82-3.eu-west-2.compute.internal, last error node is not a valid owner
E0422 02:28:22.547464       1 api.go:153] fail to get pod controller: pod=kube-apiserver-ip-10-0-229-221.eu-west-2.compute.internal err=Unhandled targetRef v1 / Node / ip-10-0-229-221.eu-west-2.compute.internal, last error node is not a valid owner
E0422 02:28:22.547525       1 api.go:153] fail to get pod controller: pod=kube-apiserver-ip-10-0-82-3.eu-west-2.compute.internal err=Unhandled targetRef v1 / Node / ip-10-0-82-3.eu-west-2.compute.internal, last error node is not a valid owner
E0422 02:28:22.547567       1 api.go:153] fail to get pod controller: pod=etcd-ip-10-0-229-221.eu-west-2.compute.internal err=Unhandled targetRef v1 / Node / ip-10-0-229-221.eu-west-2.compute.internal, last error node is not a valid owner
E0422 02:28:22.547603       1 api.go:153] fail to get pod controller: pod=kube-scheduler-ip-10-0-171-129.eu-west-2.compute.internal err=Unhandled targetRef v1 / Node / ip-10-0-171-129.eu-west-2.compute.internal, last error node is not a valid owner
E0422 02:28:22.547646       1 api.go:153] fail to get pod controller: pod=kube-scheduler-ip-10-0-229-221.eu-west-2.compute.internal err=Unhandled targetRef v1 / Node / ip-10-0-229-221.eu-west-2.compute.internal, last error node is not a valid owner
E0422 02:28:22.547690       1 api.go:153] fail to get pod controller: pod=kube-controller-manager-ip-10-0-171-129.eu-west-2.compute.internal err=Unhandled targetRef v1 / Node / ip-10-0-171-129.eu-west-2.compute.internal, last error node is not a valid owner
E0422 02:28:22.547745       1 api.go:153] fail to get pod controller: pod=kube-controller-manager-ip-10-0-229-221.eu-west-2.compute.internal err=Unhandled targetRef v1 / Node / ip-10-0-229-221.eu-west-2.compute.internal, last error node is not a valid owner
E0422 02:28:22.547780       1 api.go:153] fail to get pod controller: pod=kube-apiserver-ip-10-0-171-129.eu-west-2.compute.internal err=Unhandled targetRef v1 / Node / ip-10-0-171-129.eu-west-2.compute.internal, last error node is not a valid owner
E0422 02:28:22.547846       1 api.go:153] fail to get pod controller: pod=kube-controller-manager-ip-10-0-82-3.eu-west-2.compute.internal err=Unhandled targetRef v1 / Node / ip-10-0-82-3.eu-west-2.compute.internal, last error node is not a valid owner
panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x18 pc=0x159129f]

goroutine 1 [running]:
k8s.io/autoscaler/vertical-pod-autoscaler/pkg/updater/priority.(*scalingDirectionPodEvictionAdmission).LoopInit(0xc000356a80, {0x1a1dda3?, 0xa?, 0x27?}, 0xc00087ee40)
	/gopath/src/k8s.io/autoscaler/vertical-pod-autoscaler/pkg/updater/priority/scaling_direction_pod_eviction_admission.go:111 +0x11f
k8s.io/autoscaler/vertical-pod-autoscaler/pkg/updater/logic.(*updater).RunOnce(0xc000316a50, {0x1c97290, 0xc00023c000})
	/gopath/src/k8s.io/autoscaler/vertical-pod-autoscaler/pkg/updater/logic/updater.go:183 +0xb44
main.main()
	/gopath/src/k8s.io/autoscaler/vertical-pod-autoscaler/pkg/updater/main.go:127 +0x7ef

Mitigation

Tasks

Give feedback

Downgrade vertical-pod-autoscaler to v5.1.0 in default-apps-aws Downgrade vertical-pod-autoscaler to v5.1.0 default-apps-aws#461
Check if the same issue is happening on CAPZ, if yes then downgrade vertical-pod-autoscaler to v5.1.0 in default-apps-azure
Check if the same issue is happening on CAPV, if yes then downgrade vertical-pod-autoscaler to v5.1.0 in default-apps-vsphere
Check if the same issue is happening on CAPVCD, if yes then downgrade vertical-pod-autoscaler to v5.1.0 in default-apps-cloud-director
Options

Fixing the issue

Tasks

Give feedback

Investigate and understand why is the issue happening
If it's an issue in our setup, then fix whatever requires fixing
If it's an upstream issue, then open an issue in upstream VPA and if possible work on it to fix it
Options

The text was updated successfully, but these errors were encountered:

weseven · 2024-05-02T14:24:44Z

I think this has been fixed upstream, but we need to test the update kubernetes/autoscaler#6763

weseven · 2024-05-09T15:02:16Z

Unfortunately vpa 1.1.1 does not fix this issue for us, we still see the same behaviour with the vpa-updater pod crashing:

vertical-pod-autoscaler-updater-64874f5854-5r92m updater panic: runtime error: invalid memory address or nil pointer dereference
vertical-pod-autoscaler-updater-64874f5854-5r92m updater [signal SIGSEGV: segmentation violation code=0x1 addr=0x18 pc=0x159129f]
vertical-pod-autoscaler-updater-64874f5854-5r92m updater
vertical-pod-autoscaler-updater-64874f5854-5r92m updater goroutine 1 [running]:
vertical-pod-autoscaler-updater-64874f5854-5r92m updater k8s.io/autoscaler/vertical-pod-autoscaler/pkg/updater/priority.(*scalingDirectionPodEvictionAdmission).LoopInit(0xc000432528, {0x1a1dda3?, 0xa?, 0x4f646165723a6622?}, 0xc000aa6000)
vertical-pod-autoscaler-updater-64874f5854-5r92m updater        /gopath/src/k8s.io/autoscaler/vertical-pod-autoscaler/pkg/updater/priority/scaling_direction_pod_eviction_admission.go:111 +0x11f
vertical-pod-autoscaler-updater-64874f5854-5r92m updater k8s.io/autoscaler/vertical-pod-autoscaler/pkg/updater/logic.(*updater).RunOnce(0xc000139130, {0x1c97290, 0xc00023cd20})
vertical-pod-autoscaler-updater-64874f5854-5r92m updater        /gopath/src/k8s.io/autoscaler/vertical-pod-autoscaler/pkg/updater/logic/updater.go:183 +0xb44
vertical-pod-autoscaler-updater-64874f5854-5r92m updater main.main()
vertical-pod-autoscaler-updater-64874f5854-5r92m updater        /gopath/src/k8s.io/autoscaler/vertical-pod-autoscaler/pkg/updater/main.go:127 +0x7ef

weseven · 2024-05-13T07:47:06Z

The issue is still there, it's this one: kubernetes/autoscaler#6808
There's already a PR fixing it, when it gets merged and vpa does a new release (and the upstream chart gets updated/we make a PR to the upstream chart for the new version) we will test again.

weseven · 2024-05-28T09:00:08Z

The issue was fixed with upstream VPA 1.1.2, which was released with our VPA app v5.2.2.

It is safe to upgrade VPA and VPA CRDs to their latest version as of this date (v5.2.2 and v3.1.0).

AndiDog mentioned this issue Apr 25, 2024

Revert "Update dependency giantswarm/vertical-pod-autoscaler-crd to v3.1.0 (#147)" giantswarm/cluster#155

Merged

1 task

erkanerol mentioned this issue Apr 30, 2024

Revert giantswarm/vertical-pod-autoscaler-app to 5.1.0 giantswarm/default-apps-azure#256

Merged

weseven mentioned this issue May 7, 2024

Downgrade VPA in Vintage aws v20.1.0 #3437

Closed

weseven self-assigned this May 13, 2024

This was referenced May 13, 2024

Downgrade VPA due to an upstream bug giantswarm/default-apps-vsphere#243

Merged

Downgrade VPA due to an upstream bug. giantswarm/default-apps-cloud-director#263

Merged

erkanerol mentioned this issue May 16, 2024

Update dependency giantswarm/vertical-pod-autoscaler-crd to v3.1.0 giantswarm/cluster-azure#250

Merged

1 task

weseven mentioned this issue May 20, 2024

Update to VPA v1.1.2 giantswarm/vertical-pod-autoscaler-app#293

Merged

weseven closed this as completed May 28, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

The latest VPA app v5.2.1 is broken #3421

The latest VPA app v5.2.1 is broken #3421

nprokopic commented Apr 23, 2024 •

edited by weseven

Tasks

Tasks

weseven commented May 2, 2024

weseven commented May 9, 2024

weseven commented May 13, 2024

weseven commented May 28, 2024

The latest VPA app v5.2.1 is broken #3421

The latest VPA app v5.2.1 is broken #3421

Comments

nprokopic commented Apr 23, 2024 • edited by weseven

Summary

Logs

Mitigation

Tasks

Fixing the issue

Tasks

weseven commented May 2, 2024

weseven commented May 9, 2024

weseven commented May 13, 2024

weseven commented May 28, 2024

nprokopic commented Apr 23, 2024 •

edited by weseven