[etcd] Bump etcd client to 3.5.1 #106589

ahrtr · 2021-11-22T06:12:56Z

What would you like to be added?

When upgrading etcd from an old version to 3.5.0, then some zombie members may be displayed. Users can't even remove the zombie members using command etcdctl member remove <id>. Please see the discussion in etcd/issues/13196.

A fix for this issue has already been included in etcd 3.5.1. So it'd be better to bump etcd 3.5.1, and cherry pick to 1.22.

Why is this needed?

Once etcd is upgraded to 3.5.1, then the zombie members can be removed either automatically or manually.

I see that PR pull/105706 fixed this, but the all the go.mod files are still referencing to etcd 3.5.0.

The text was updated successfully, but these errors were encountered:

k8s-ci-robot · 2021-11-22T06:14:39Z

@ahrtr: The label(s) sig/etcd cannot be applied, because the repository doesn't have them.

In response to this:

/sig etcd

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Kartik494 · 2021-11-22T09:35:34Z

Hi @ahrtr i am looking into this. Please let me know if 105706 also needs to be backported in v1.22

Kartik494 · 2021-11-22T09:56:44Z

/assign

neolit123 · 2021-11-22T16:34:02Z

we are in code freeze for 1.23 so this must happen for 1.24.

/milestone v1.24
/sig cluster-lifecycle release
/area etcd

neolit123 · 2021-11-22T16:37:26Z

we are in code freeze for 1.23 so this must happen for 1.24.

unless this is considered a critical bug without workarounds?

EDIT

neolit123 · 2021-11-22T17:20:27Z

I see that PR pull/105706 fixed this, but the all the go.mod files are still referencing to etcd 3.5.0.

right, we updated the server to 3.5.1 recently (i forgot).
but the go.mod files are indeed out of date, which would mean we are still using the 3.5.0 client.
matching server/client seems reasonable for 1.23.

tagging for release triage.

/triage accepted
/milestone v1.23
/priority important-soon

neolit123 · 2021-11-22T17:23:05Z

/retitle [etcd] Bump etcd client to 3.5.1

neolit123 · 2021-11-22T17:24:56Z

@ahrtr if we have a 3.5.0 client and a 3.5.1 server is this still a problem?

ahrtr · 2021-11-22T22:43:11Z

@neolit123 There is an important fix (see below) being included in etcd 3.5.1 on v3 client,
client: Use first endpoint as http2 authority header
Cherry pick "Fix http2 authority header in single endpoint scenario" to release-3.5

The related issue is etcd/issues/13192.

I think @serathius is the best person to answer this question.

cc @uthark

serathius · 2021-11-23T10:30:23Z

v3.5.1 client includes fix for authority header in HA cluster. Without the fix, client will send invalid authority header when configured with multiple endpoints. This is not a problem when client communicates directly to etcd server, however will not work at all if there is any proxy before etcd. If there is a proxy before etcd there is a high chance that, invalid authority header will result in requests being dropped. This was deemed a critical bug for v3.5.0 as it totally broke some multi node etcd configurations.

neolit123 · 2021-11-23T13:40:55Z

Thanks for the explanation.
/priority critical-urgent

I guess this means we need to backport the client bump to 1.22.

ahrtr · 2021-11-24T01:15:08Z

Thanks for the explanation. /priority critical-urgent

I guess this means we need to backport the client bump to 1.22.

The PR pull/105706 is not included in 1.22, so we need to backport it to 1.22, and also bump etcd client to 3.5.1 for both 1.23 and 1.22.

Kartik494 · 2021-11-24T05:19:49Z

@neolit123 should we wait for #106591 to land in master or need to backport #105706 in v1.22 now on immediate basis ?

neolit123 · 2021-11-24T13:45:44Z

This seems like a change that should be part of 1.23 before release and backported to older releases. Although our HA upgrade e2e tests are not exhibiting the bug for some reason.

Kartik494 · 2021-11-29T08:24:30Z

@neolit123 Could you please confirm whether the #106591 is planned after the v1.23 release ?

ritpanjw · 2021-12-01T17:30:15Z

Hi @Kartik494 , this is bug triage shadow here 👋
I'd like to check what's the status of this issue, release 1.23 would be happening this week
Thank you

neolit123 · 2021-12-06T14:05:22Z

Looks like the release team did not want this last minute change in 1.23.0. Backporting it might need discussion.. . Seems fine for 1.24. On Nov 29, 2021 10:24, "Kartik Sharma" ***@***.***> wrote: @neolit123 <https://github.com/neolit123> Could you please confirm whether the #106591 <#106591> is planned after the v1.23 release ? — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#106589 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AACRATDTK6B6UMBWZKZKQSDUOM2EVANCNFSM5IQFICOA> . Triage notifications on the go with GitHub Mobile for iOS <https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675> or Android <https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub>.

peterska · 2021-12-13T02:31:42Z

Can this backported to k8s 1.22.X please. My cluster will not upgrade from 1.21.7 to 1.22.4 due to this issue. It gets stuck waiting for etcd to be ready. Checking kube-system/etcd logs shows it is trying to contact the zombie etcd node. This is a single node etcd kubernetes cluster created using kubeadm a long time ago. I migrated the cluster so that the api sever is accessed using a dns name rather than an ip address and have since changed the ip address . This what caused the phantom etcd member.

pacoxu · 2022-02-17T03:24:44Z

Opened cherry-pick pr Automated cherry pick of #105706: Upgrade etcd to 3.5.1 #108176 for 1.23. (It only backports the server fix for etcd 3.5.0 resurrects ancient (unremovable) members etcd-io/etcd#13196).
To backport client change in 3.5.1 as well. We need to cherry-pick Updated Etcd Version to 3.5.1 in go.mod #106591 to 1.23 & 1.22 after it is merged.

Kartik494 · 2022-02-17T06:19:00Z

@pacoxu Could you please let me know if #106591 target for 1.24 release?

pacoxu · 2022-02-17T06:38:28Z

@pacoxu Could you please let me know if #106591 target for 1.24 release?

I think it should. But it need confirmation from Jordon and Marek. See discussions in #106591 (comment).

Kartik494 · 2022-02-17T07:02:07Z

Thanks for the clarification !

neolit123 · 2022-02-17T13:16:52Z

Should we wait for this 3.5 backport as well? etcd-io/etcd#13706 Xref kubernetes/kubeadm#2567

akunszt · 2022-02-24T16:19:46Z

If this won't be to 1.22 not 1.23 then what is the official upgrade path? We are running 1.21 in an environment where we access the etcd cluster through a proxy. Due to the broken etcd client in those version we can't use neither 1.22 nor 1.23, they can't connect to the etcd at all. As far as I know upgrading directly to 1.24 is not supported. How can we escape from this trap?

neolit123 · 2022-02-24T16:25:37Z

we have kubeadm HA cluster upgrade tests from 1.21 -> 1.22 -> 1.23 -> latest and these are all green.
https://k8s-testgrid.appspot.com/sig-cluster-lifecycle-kubeadm

so oddly we are not catching any of these reported problems. it's also not clear if the reporting users are:

reporting separate problems in etcd (client/server?)
using kubeadm or something else

kubeadm embeds an etcd client, so this means we have to backport a fix for kubeadm.
if you are not using kubeadm and e.g. directly using etcdctl or something else you'd have to update that separate tool with a patched client that works.

akunszt · 2022-02-24T16:32:37Z

We don't use kubeadm. We have an ALB in front of the etcd cluster as using a DNS based discovery wasn't - or isn't - working in AWS environment properly.

The apiserver container fails to connect to the etcd cluster. We would like to update that "tool".

neolit123 · 2022-02-24T16:46:30Z

ok, forgot the apiserver has the same client too (duh).
instead of 3.5.1 i think we should get the latest 3.5.x (.3 once it's out?) and backport it to the supported k8s versions.
is this all about etcd-io/etcd#13196 or are there other etcd client bugs in question here?

cc @kubernetes/sig-api-machinery-bugs

akunszt · 2022-02-24T16:49:58Z

@neolit We suffer from the etcd-io/etcd#13192 issue. I think those two had the same root cause though. (I'm not an etcd developer, so that's just my hunch.)

neolit123 · 2022-02-24T16:54:41Z

@pacoxu @serathius current state of the pending changes is a bit messy here.
i personally think we ~~should drop the 3.5.1 PRs and backport PRs and~~ should wait for 3.5.3.
...but given there are a number of users here than want to upgrade ASAP, we can upgrade client/server to 3.5.2 first.
EDIT: looks like #105706 merged and the ~~client /~~ server at master are supposedly at .1. so we now need > 3.5.1. the client is still at 3.5.0?

i don't like upgrading only client separately from server. i think we should keep them in sync.
so we can rename / repurpose this issue.

neolit123 · 2022-02-24T17:06:27Z

looks like the 3.5.1 client bump is blocked here:
#106591 (review)
due to etcd-io/etcd#13707

but from discussion on this 3.5.2 server PR, people already want a 3.5.2:
#107917

akunszt · 2022-02-24T17:13:03Z

@neolit123 Honestly, anything above 3.5.0 would make me smile. So 3.5.2 is even better.

ahrtr · 2022-02-24T22:57:21Z

looks like the 3.5.1 client bump is blocked here: #106591 (review) due to etcd-io/etcd#13707

but from discussion on this 3.5.2 server PR, people already want a 3.5.2: #107917

I have already submitted a PR etcd/pull/13737 for etcd/issues/13707. cc @serathius @ptabor

Kartik494 · 2022-03-31T06:37:12Z

Hi @neolit123 as #106591 has been merged, so can we close this issue?
Thanks!!

neolit123 · 2022-03-31T12:30:55Z

Yes, we can close this but we would need a tracking issue for the inbound 3.5.3 bump and we may have to backport it (at least the server bump). Ideally we should have separate tracking for client / server. /close

…

On Mar 31, 2022 09:37, "Kartik Sharma" ***@***.***> wrote: Hi @neolit123 <https://github.com/neolit123> as #106591 <#106591> has been merged, so can we close this issue? — Reply to this email directly, view it on GitHub <#106589 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AACRATAL6VSPUF3RTUHW3IDVCVB2HANCNFSM5IQFICOA> . You are receiving this because you were mentioned.Message ID: ***@***.***>

k8s-ci-robot · 2022-03-31T12:31:17Z

@neolit123: Closing this issue.

In response to this:

Yes, we can close this but we would need a tracking issue for the inbound
3.5.3 bump and we may have to backport it (at least the server bump).

Ideally we should have separate tracking for client / server.

/close
On Mar 31, 2022 09:37, "Kartik Sharma" @.***> wrote:

Hi @neolit123 https://github.com/neolit123 as #106591
#106591 has been merged,
so can we close this issue?

—
Reply to this email directly, view it on GitHub
#106589 (comment),
or unsubscribe
https://github.com/notifications/unsubscribe-auth/AACRATAL6VSPUF3RTUHW3IDVCVB2HANCNFSM5IQFICOA
.
You are receiving this because you were mentioned.Message ID:
@.***>

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

kopiczko · 2022-08-10T13:57:13Z

Is upgrade to newer etcd version tracked somewhere? According to this 3.5.1 is not production-grade due to data corruption.

pacoxu · 2022-08-10T13:59:10Z

In v1.25, #110033 already uses 3.5.4. And cherry-pick to v1.24-v1.22 are opened for reviewing.

ahrtr added the kind/feature Categorizes issue or PR as related to a new feature. label Nov 22, 2021

k8s-ci-robot added needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Nov 22, 2021

k8s-ci-robot assigned Kartik494 Nov 22, 2021

Kartik494 mentioned this issue Nov 22, 2021

Updated Etcd Version to 3.5.1 in go.mod #106591

Merged

k8s-ci-robot added sig/cluster-lifecycle Categorizes an issue or PR as relevant to SIG Cluster Lifecycle. sig/release Categorizes an issue or PR as relevant to SIG Release. area/etcd and removed needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels Nov 22, 2021

k8s-ci-robot added this to the v1.24 milestone Nov 22, 2021

k8s-ci-robot modified the milestones: v1.24, v1.23 Nov 22, 2021

k8s-ci-robot changed the title ~~[etcd] Bump etcd 3.5.1~~ [etcd] Bump etcd client to 3.5.1 Nov 22, 2021

k8s-ci-robot added the priority/critical-urgent Highest priority. Must be actively worked on as someone's top priority right now. label Nov 23, 2021

ptabor mentioned this issue Jan 20, 2022

etcd 3.5.0 resurrects ancient (unremovable) members etcd-io/etcd#13196

Closed

serathius mentioned this issue Feb 17, 2022

Add serathius to etcd image owners #108179

Merged

neolit123 mentioned this issue Feb 17, 2022

Automated cherry pick of #105706: Upgrade etcd to 3.5.1 #108176

Closed

k8s-ci-robot added sig/api-machinery Categorizes an issue or PR as relevant to SIG API Machinery. kind/bug Categorizes issue or PR as related to a bug. labels Feb 24, 2022

k8s-ci-robot closed this as completed Mar 31, 2022

[etcd] Bump etcd client to 3.5.1 #106589

[etcd] Bump etcd client to 3.5.1 #106589

Comments

ahrtr commented Nov 22, 2021 • edited

What would you like to be added?

Why is this needed?

k8s-ci-robot commented Nov 22, 2021

Kartik494 commented Nov 22, 2021

Kartik494 commented Nov 22, 2021

neolit123 commented Nov 22, 2021

neolit123 commented Nov 22, 2021 • edited

neolit123 commented Nov 22, 2021

neolit123 commented Nov 22, 2021

neolit123 commented Nov 22, 2021

ahrtr commented Nov 22, 2021 • edited

serathius commented Nov 23, 2021

neolit123 commented Nov 23, 2021

ahrtr commented Nov 24, 2021

Kartik494 commented Nov 24, 2021 • edited

neolit123 commented Nov 24, 2021 via email

Kartik494 commented Nov 29, 2021

ritpanjw commented Dec 1, 2021

neolit123 commented Dec 6, 2021 via email

peterska commented Dec 13, 2021

pacoxu commented Feb 17, 2022 • edited

Kartik494 commented Feb 17, 2022

pacoxu commented Feb 17, 2022

Kartik494 commented Feb 17, 2022

neolit123 commented Feb 17, 2022 via email

akunszt commented Feb 24, 2022

neolit123 commented Feb 24, 2022

akunszt commented Feb 24, 2022

neolit123 commented Feb 24, 2022 • edited

akunszt commented Feb 24, 2022

neolit123 commented Feb 24, 2022 • edited

neolit123 commented Feb 24, 2022

akunszt commented Feb 24, 2022

ahrtr commented Feb 24, 2022 • edited

Kartik494 commented Mar 31, 2022 • edited

neolit123 commented Mar 31, 2022 via email

k8s-ci-robot commented Mar 31, 2022

kopiczko commented Aug 10, 2022

pacoxu commented Aug 10, 2022

ahrtr commented Nov 22, 2021 •

edited

neolit123 commented Nov 22, 2021 •

edited

ahrtr commented Nov 22, 2021 •

edited

Kartik494 commented Nov 24, 2021 •

edited

pacoxu commented Feb 17, 2022 •

edited

neolit123 commented Feb 24, 2022 •

edited

neolit123 commented Feb 24, 2022 •

edited

ahrtr commented Feb 24, 2022 •

edited

Kartik494 commented Mar 31, 2022 •

edited