-
Notifications
You must be signed in to change notification settings - Fork 38.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[etcd] Bump etcd client to 3.5.1 #106589
Comments
@ahrtr: The label(s) In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
/assign |
we are in code freeze for 1.23 so this must happen for 1.24. /milestone v1.24 |
unless this is considered a critical bug without workarounds? EDIT |
right, we updated the server to 3.5.1 recently (i forgot). tagging for release triage. /triage accepted |
/retitle [etcd] Bump etcd client to 3.5.1 |
@ahrtr if we have a 3.5.0 client and a 3.5.1 server is this still a problem? |
@neolit123 There is an important fix (see below) being included in etcd 3.5.1 on v3 client, The related issue is etcd/issues/13192. I think @serathius is the best person to answer this question. cc @uthark |
v3.5.1 client includes fix for authority header in HA cluster. Without the fix, client will send invalid authority header when configured with multiple endpoints. This is not a problem when client communicates directly to etcd server, however will not work at all if there is any proxy before etcd. If there is a proxy before etcd there is a high chance that, invalid authority header will result in requests being dropped. This was deemed a critical bug for v3.5.0 as it totally broke some multi node etcd configurations. |
Thanks for the explanation. I guess this means we need to backport the client bump to 1.22. |
The PR pull/105706 is not included in 1.22, so we need to backport it to 1.22, and also bump etcd client to 3.5.1 for both 1.23 and 1.22. |
@neolit123 should we wait for #106591 to land in master or need to backport #105706 in v1.22 now on immediate basis ? |
This seems like a change that should be part of 1.23 before release and
backported to older releases.
Although our HA upgrade e2e tests are not exhibiting the bug for some
reason.
|
@neolit123 Could you please confirm whether the #106591 is planned after the v1.23 release ? |
Hi @Kartik494 , this is bug triage shadow here 👋 |
Looks like the release team did not want this last minute change in 1.23.0.
Backporting it might need discussion.. .
Seems fine for 1.24.
On Nov 29, 2021 10:24, "Kartik Sharma" ***@***.***> wrote:
@neolit123 <https://github.com/neolit123> Could you please confirm whether
the #106591 <#106591> is
planned after the v1.23 release ?
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#106589 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AACRATDTK6B6UMBWZKZKQSDUOM2EVANCNFSM5IQFICOA>
.
Triage notifications on the go with GitHub Mobile for iOS
<https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675>
or Android
<https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub>.
|
Can this backported to k8s 1.22.X please. My cluster will not upgrade from 1.21.7 to 1.22.4 due to this issue. It gets stuck waiting for etcd to be ready. Checking kube-system/etcd logs shows it is trying to contact the zombie etcd node. This is a single node etcd kubernetes cluster created using kubeadm a long time ago. I migrated the cluster so that the api sever is accessed using a dns name rather than an ip address and have since changed the ip address . This what caused the phantom etcd member. |
|
I think it should. But it need confirmation from Jordon and Marek. See discussions in #106591 (comment). |
Thanks for the clarification ! |
Should we wait for this 3.5 backport as well?
etcd-io/etcd#13706
Xref kubernetes/kubeadm#2567
|
If this won't be to 1.22 not 1.23 then what is the official upgrade path? We are running 1.21 in an environment where we access the etcd cluster through a proxy. Due to the broken etcd client in those version we can't use neither 1.22 nor 1.23, they can't connect to the etcd at all. As far as I know upgrading directly to 1.24 is not supported. How can we escape from this trap? |
we have kubeadm HA cluster upgrade tests from 1.21 -> 1.22 -> 1.23 -> latest and these are all green. so oddly we are not catching any of these reported problems. it's also not clear if the reporting users are:
kubeadm embeds an etcd client, so this means we have to backport a fix for kubeadm. |
We don't use kubeadm. We have an ALB in front of the etcd cluster as using a DNS based discovery wasn't - or isn't - working in AWS environment properly. The apiserver container fails to connect to the etcd cluster. We would like to update that "tool". |
ok, forgot the apiserver has the same client too (duh). cc @kubernetes/sig-api-machinery-bugs |
@neolit We suffer from the etcd-io/etcd#13192 issue. I think those two had the same root cause though. (I'm not an etcd developer, so that's just my hunch.) |
@pacoxu @serathius current state of the pending changes is a bit messy here. i don't like upgrading only client separately from server. i think we should keep them in sync. |
looks like the 3.5.1 client bump is blocked here: but from discussion on this 3.5.2 server PR, people already want a 3.5.2: |
@neolit123 Honestly, anything above 3.5.0 would make me smile. So 3.5.2 is even better. |
I have already submitted a PR etcd/pull/13737 for etcd/issues/13707. cc @serathius @ptabor |
Hi @neolit123 as #106591 has been merged, so can we close this issue? |
Yes, we can close this but we would need a tracking issue for the inbound
3.5.3 bump and we may have to backport it (at least the server bump).
Ideally we should have separate tracking for client / server.
/close
…On Mar 31, 2022 09:37, "Kartik Sharma" ***@***.***> wrote:
Hi @neolit123 <https://github.com/neolit123> as #106591
<#106591> has been merged,
so can we close this issue?
—
Reply to this email directly, view it on GitHub
<#106589 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AACRATAL6VSPUF3RTUHW3IDVCVB2HANCNFSM5IQFICOA>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
@neolit123: Closing this issue. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
Is upgrade to newer etcd version tracked somewhere? According to this 3.5.1 is not production-grade due to data corruption. |
In v1.25, #110033 already uses 3.5.4. And cherry-pick to v1.24-v1.22 are opened for reviewing. |
What would you like to be added?
When upgrading etcd from an old version to 3.5.0, then some zombie members may be displayed. Users can't even remove the zombie members using command
etcdctl member remove <id>
. Please see the discussion in etcd/issues/13196.A fix for this issue has already been included in etcd 3.5.1. So it'd be better to bump etcd 3.5.1, and cherry pick to 1.22.
Why is this needed?
Once etcd is upgraded to 3.5.1, then the zombie members can be removed either automatically or manually.
I see that PR pull/105706 fixed this, but the all the go.mod files are still referencing to etcd 3.5.0.
The text was updated successfully, but these errors were encountered: