Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[etcd] Bump etcd client to 3.5.1 #106589

Closed
ahrtr opened this issue Nov 22, 2021 · 37 comments
Closed

[etcd] Bump etcd client to 3.5.1 #106589

ahrtr opened this issue Nov 22, 2021 · 37 comments
Assignees
Labels
area/etcd kind/bug Categorizes issue or PR as related to a bug. kind/feature Categorizes issue or PR as related to a new feature. priority/critical-urgent Highest priority. Must be actively worked on as someone's top priority right now. priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. sig/api-machinery Categorizes an issue or PR as relevant to SIG API Machinery. sig/cluster-lifecycle Categorizes an issue or PR as relevant to SIG Cluster Lifecycle. sig/release Categorizes an issue or PR as relevant to SIG Release. triage/accepted Indicates an issue or PR is ready to be actively worked on.
Milestone

Comments

@ahrtr
Copy link
Member

ahrtr commented Nov 22, 2021

What would you like to be added?

When upgrading etcd from an old version to 3.5.0, then some zombie members may be displayed. Users can't even remove the zombie members using command etcdctl member remove <id>. Please see the discussion in etcd/issues/13196.

A fix for this issue has already been included in etcd 3.5.1. So it'd be better to bump etcd 3.5.1, and cherry pick to 1.22.

Why is this needed?

Once etcd is upgraded to 3.5.1, then the zombie members can be removed either automatically or manually.

I see that PR pull/105706 fixed this, but the all the go.mod files are still referencing to etcd 3.5.0.

@ahrtr ahrtr added the kind/feature Categorizes issue or PR as related to a new feature. label Nov 22, 2021
@k8s-ci-robot k8s-ci-robot added needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Nov 22, 2021
@k8s-ci-robot
Copy link
Contributor

@ahrtr: The label(s) sig/etcd cannot be applied, because the repository doesn't have them.

In response to this:

/sig etcd

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@Kartik494
Copy link
Contributor

Hi @ahrtr i am looking into this. Please let me know if 105706 also needs to be backported in v1.22

@Kartik494
Copy link
Contributor

/assign

@neolit123
Copy link
Member

we are in code freeze for 1.23 so this must happen for 1.24.

/milestone v1.24
/sig cluster-lifecycle release
/area etcd

@k8s-ci-robot k8s-ci-robot added sig/cluster-lifecycle Categorizes an issue or PR as relevant to SIG Cluster Lifecycle. sig/release Categorizes an issue or PR as relevant to SIG Release. area/etcd and removed needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels Nov 22, 2021
@k8s-ci-robot k8s-ci-robot added this to the v1.24 milestone Nov 22, 2021
@neolit123
Copy link
Member

neolit123 commented Nov 22, 2021

we are in code freeze for 1.23 so this must happen for 1.24.

unless this is considered a critical bug without workarounds?

EDIT

@neolit123
Copy link
Member

I see that PR pull/105706 fixed this, but the all the go.mod files are still referencing to etcd 3.5.0.

right, we updated the server to 3.5.1 recently (i forgot).
but the go.mod files are indeed out of date, which would mean we are still using the 3.5.0 client.
matching server/client seems reasonable for 1.23.

tagging for release triage.

/triage accepted
/milestone v1.23
/priority important-soon

@k8s-ci-robot k8s-ci-robot added triage/accepted Indicates an issue or PR is ready to be actively worked on. priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. and removed needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Nov 22, 2021
@k8s-ci-robot k8s-ci-robot modified the milestones: v1.24, v1.23 Nov 22, 2021
@neolit123
Copy link
Member

/retitle [etcd] Bump etcd client to 3.5.1

@k8s-ci-robot k8s-ci-robot changed the title [etcd] Bump etcd 3.5.1 [etcd] Bump etcd client to 3.5.1 Nov 22, 2021
@neolit123
Copy link
Member

@ahrtr if we have a 3.5.0 client and a 3.5.1 server is this still a problem?

@ahrtr
Copy link
Member Author

ahrtr commented Nov 22, 2021

@neolit123 There is an important fix (see below) being included in etcd 3.5.1 on v3 client,
client: Use first endpoint as http2 authority header
Cherry pick "Fix http2 authority header in single endpoint scenario" to release-3.5

The related issue is etcd/issues/13192.

I think @serathius is the best person to answer this question.

cc @uthark

@serathius
Copy link
Contributor

v3.5.1 client includes fix for authority header in HA cluster. Without the fix, client will send invalid authority header when configured with multiple endpoints. This is not a problem when client communicates directly to etcd server, however will not work at all if there is any proxy before etcd. If there is a proxy before etcd there is a high chance that, invalid authority header will result in requests being dropped. This was deemed a critical bug for v3.5.0 as it totally broke some multi node etcd configurations.

@neolit123
Copy link
Member

Thanks for the explanation.
/priority critical-urgent

I guess this means we need to backport the client bump to 1.22.

@k8s-ci-robot k8s-ci-robot added the priority/critical-urgent Highest priority. Must be actively worked on as someone's top priority right now. label Nov 23, 2021
@ahrtr
Copy link
Member Author

ahrtr commented Nov 24, 2021

Thanks for the explanation. /priority critical-urgent

I guess this means we need to backport the client bump to 1.22.

The PR pull/105706 is not included in 1.22, so we need to backport it to 1.22, and also bump etcd client to 3.5.1 for both 1.23 and 1.22.

@Kartik494
Copy link
Contributor

Kartik494 commented Nov 24, 2021

@neolit123 should we wait for #106591 to land in master or need to backport #105706 in v1.22 now on immediate basis ?

@neolit123
Copy link
Member

neolit123 commented Nov 24, 2021 via email

@Kartik494
Copy link
Contributor

@neolit123 Could you please confirm whether the #106591 is planned after the v1.23 release ?

@ritpanjw
Copy link

ritpanjw commented Dec 1, 2021

Hi @Kartik494 , this is bug triage shadow here 👋
I'd like to check what's the status of this issue, release 1.23 would be happening this week
Thank you

@neolit123
Copy link
Member

neolit123 commented Dec 6, 2021 via email

@peterska
Copy link

Can this backported to k8s 1.22.X please. My cluster will not upgrade from 1.21.7 to 1.22.4 due to this issue. It gets stuck waiting for etcd to be ready. Checking kube-system/etcd logs shows it is trying to contact the zombie etcd node. This is a single node etcd kubernetes cluster created using kubeadm a long time ago. I migrated the cluster so that the api sever is accessed using a dns name rather than an ip address and have since changed the ip address . This what caused the phantom etcd member.

@pacoxu
Copy link
Member

pacoxu commented Feb 17, 2022

@Kartik494
Copy link
Contributor

@pacoxu Could you please let me know if #106591 target for 1.24 release?

@pacoxu
Copy link
Member

pacoxu commented Feb 17, 2022

@pacoxu Could you please let me know if #106591 target for 1.24 release?

I think it should. But it need confirmation from Jordon and Marek. See discussions in #106591 (comment).

@Kartik494
Copy link
Contributor

Thanks for the clarification !

@neolit123
Copy link
Member

neolit123 commented Feb 17, 2022 via email

@akunszt
Copy link

akunszt commented Feb 24, 2022

If this won't be to 1.22 not 1.23 then what is the official upgrade path? We are running 1.21 in an environment where we access the etcd cluster through a proxy. Due to the broken etcd client in those version we can't use neither 1.22 nor 1.23, they can't connect to the etcd at all. As far as I know upgrading directly to 1.24 is not supported. How can we escape from this trap?

@neolit123
Copy link
Member

we have kubeadm HA cluster upgrade tests from 1.21 -> 1.22 -> 1.23 -> latest and these are all green.
https://k8s-testgrid.appspot.com/sig-cluster-lifecycle-kubeadm

so oddly we are not catching any of these reported problems. it's also not clear if the reporting users are:

  • reporting separate problems in etcd (client/server?)
  • using kubeadm or something else

kubeadm embeds an etcd client, so this means we have to backport a fix for kubeadm.
if you are not using kubeadm and e.g. directly using etcdctl or something else you'd have to update that separate tool with a patched client that works.

@akunszt
Copy link

akunszt commented Feb 24, 2022

We don't use kubeadm. We have an ALB in front of the etcd cluster as using a DNS based discovery wasn't - or isn't - working in AWS environment properly.

The apiserver container fails to connect to the etcd cluster. We would like to update that "tool".

@neolit123
Copy link
Member

neolit123 commented Feb 24, 2022

ok, forgot the apiserver has the same client too (duh).
instead of 3.5.1 i think we should get the latest 3.5.x (.3 once it's out?) and backport it to the supported k8s versions.
is this all about etcd-io/etcd#13196 or are there other etcd client bugs in question here?

cc @kubernetes/sig-api-machinery-bugs

@k8s-ci-robot k8s-ci-robot added sig/api-machinery Categorizes an issue or PR as relevant to SIG API Machinery. kind/bug Categorizes issue or PR as related to a bug. labels Feb 24, 2022
@akunszt
Copy link

akunszt commented Feb 24, 2022

@neolit We suffer from the etcd-io/etcd#13192 issue. I think those two had the same root cause though. (I'm not an etcd developer, so that's just my hunch.)

@neolit123
Copy link
Member

neolit123 commented Feb 24, 2022

@pacoxu @serathius current state of the pending changes is a bit messy here.
i personally think we should drop the 3.5.1 PRs and backport PRs and should wait for 3.5.3.
...but given there are a number of users here than want to upgrade ASAP, we can upgrade client/server to 3.5.2 first.
EDIT: looks like #105706 merged and the client / server at master are supposedly at .1. so we now need > 3.5.1. the client is still at 3.5.0?

i don't like upgrading only client separately from server. i think we should keep them in sync.
so we can rename / repurpose this issue.

@neolit123
Copy link
Member

looks like the 3.5.1 client bump is blocked here:
#106591 (review)
due to etcd-io/etcd#13707

but from discussion on this 3.5.2 server PR, people already want a 3.5.2:
#107917

@akunszt
Copy link

akunszt commented Feb 24, 2022

@neolit123 Honestly, anything above 3.5.0 would make me smile. So 3.5.2 is even better.

@ahrtr
Copy link
Member Author

ahrtr commented Feb 24, 2022

looks like the 3.5.1 client bump is blocked here: #106591 (review) due to etcd-io/etcd#13707

but from discussion on this 3.5.2 server PR, people already want a 3.5.2: #107917

I have already submitted a PR etcd/pull/13737 for etcd/issues/13707. cc @serathius @ptabor

@Kartik494
Copy link
Contributor

Kartik494 commented Mar 31, 2022

Hi @neolit123 as #106591 has been merged, so can we close this issue?
Thanks!!

@neolit123
Copy link
Member

neolit123 commented Mar 31, 2022 via email

@k8s-ci-robot
Copy link
Contributor

@neolit123: Closing this issue.

In response to this:

Yes, we can close this but we would need a tracking issue for the inbound
3.5.3 bump and we may have to backport it (at least the server bump).

Ideally we should have separate tracking for client / server.

/close
On Mar 31, 2022 09:37, "Kartik Sharma" @.***> wrote:

Hi @neolit123 https://github.com/neolit123 as #106591
#106591 has been merged,
so can we close this issue?


Reply to this email directly, view it on GitHub
#106589 (comment),
or unsubscribe
https://github.com/notifications/unsubscribe-auth/AACRATAL6VSPUF3RTUHW3IDVCVB2HANCNFSM5IQFICOA
.
You are receiving this because you were mentioned.Message ID:
@.***>

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@kopiczko
Copy link

Is upgrade to newer etcd version tracked somewhere? According to this 3.5.1 is not production-grade due to data corruption.

@pacoxu
Copy link
Member

pacoxu commented Aug 10, 2022

In v1.25, #110033 already uses 3.5.4. And cherry-pick to v1.24-v1.22 are opened for reviewing.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/etcd kind/bug Categorizes issue or PR as related to a bug. kind/feature Categorizes issue or PR as related to a new feature. priority/critical-urgent Highest priority. Must be actively worked on as someone's top priority right now. priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. sig/api-machinery Categorizes an issue or PR as relevant to SIG API Machinery. sig/cluster-lifecycle Categorizes an issue or PR as relevant to SIG Cluster Lifecycle. sig/release Categorizes an issue or PR as relevant to SIG Release. triage/accepted Indicates an issue or PR is ready to be actively worked on.
Projects
None yet
Development

No branches or pull requests

10 participants