Leader Election taking between 3 to 15 minutes in openshift deployment when any single node - even non leader - is evicted starting from version 1.13.x #15231

TheDevOps · 2022-11-02T09:39:02Z

Overview of the Issue

We are running a 3 node Consul Server deployment within an onprem openshift cluster using the official helm chart from https://github.com/hashicorp/consul-k8s/tree/main/charts/consul
We noticed that after updating consul to version 1.13.1 on loss of any single node, even a non leader node, e.g. because of a node eviction or a rolling update of the StatefulSet that the whole consul cluster would become unstable and constantly lose and reelect a new leader for anything between 3 to 15 minutes before he becomes stable again at which point he remains stable until yet another node is lost.
Up to version 1.12.5 this has only taken around 2 to 10 seconds at most for which our clients are prepared with short term fallback caches and so had 0 impact for us.
Due to this it is currently impossible for us to update further than 1.12.5 since it could happen at any time that 1 replica is evicted and the whole cluster becomes unstable for an unfreasibly long time impacting our java clients.

Reproduction Steps

Right away again: this happens in an onprem Openshift cluster, I can not entirely rule out it's not immediately reproducible everywhere.

To reproduce the following tools are used:

openshift cluster running version 4.10.34 with kubernetes version v1.23.5+8471591 (happens also for openshift 4.9 and kubernetes v1.22.8+9e95cb9)
oc client version 4.11.0-0.okd-2022-08-20-022919
helm version v3.10.0
consul version 3.13.3 (happens also with 3.13.1 or 1.14.0-beta1, but does NOT happen with 1.12.2 or 1.12.5)
consul helm chart from https://github.com/hashicorp/consul-k8s/tree/main/charts/consul with version 0.49.0
Make sure there are no previous persistent volumes left in openshift and the installation happens on a completely blank state

First provide a helm values file adjusted for an openshift deployment with some security rules and so on, see values-consul-helm.yaml.txt, (had to rename it to .yaml.txt to upload it...).
Note: There are two placeholder <url> and <secretname> inside for obvious reasons, adjust them as needed, the rest is exactly as used for our tests.

Next run

helm repo add hashicorp https://helm.releases.hashicorp.com
helm repo update
helm upgrade --install -f values-consul-helm.yaml consul-helm hashicorp/consul -n <namespace>

to deploy the cluster which then shows in openshift as

and in the consul UI a healthy cluster state where in this example "consul-helm-server-1" become the leader

(Note: I forgot to take the screenshot before deleting 1 node namely "consul-helm-server-2, so the IP no longer matches since after delete openshift assigned a new one when recreating it. Immediately after the initial deployments it obviously did match),

Original the first time we noticed the issue was when due to a disk issue openshift automatically evicted 1 replica from one node and we suddenly encountered issues in our java clients that persisted for multiple minutes. So in order to reproduce this "force" an eviction of 1 consul server by simply deleting any node even a non leader, in the case of the attached logs in the "log fragments" section this was done for "consul-helm-server-2" which was a "follower" but it works for any node also the old leader.

Now the deleted node gets recreated by openshift starts and tries to start a leader election that initially does not work because he is not yet a voter. Eventually he becomes a voter and from the on the cluster keeps electing a leader, losing it immediately again and repeat from start. This continues for anything between 3 to up to 15 minutes for a seemingly random amount of times until eventually the cluster elects a leader and stays stable again.

Deleting another node keeps repeating this and there is never any improvement.

On further tests we also found the same issue to start happening when updating from 1.12.5 to 1.13.3 by openshift rolling update mechanisms, but as explained above it can be reproduced even with a completely new cluster without any clients connected.

In the "log fragments" section I've attached logs from all 3 consul server nodes as well as the output of "consul operator raft list-peers -stale" on one node several times during the recovery process

Consul info for both Client and Server

Client info

We are not using actual consul clients but only java applications using https://github.com/Ecwid/consul-api as client. 
Since the issue can be reproduced even without a single client connecting I don't think this is relevant for the problem anyway

Server info

agent:
        check_monitors = 0
        check_ttls = 0
        checks = 0
        services = 0
build:
        prerelease = 
        revision = b29e5894
        version = 1.13.3
        version_metadata = 
consul:
        acl = disabled
        bootstrap = false
        known_datacenters = 1
        leader = false
        leader_addr = 10.128.13.223:8300
        server = true
raft:
        applied_index = 218
        commit_index = 218
        fsm_pending = 0
        last_contact = 57.449617ms
        last_log_index = 218
        last_log_term = 38
        last_snapshot_index = 0
        last_snapshot_term = 0
        latest_configuration = [{Suffrage:Voter ID:2e3fe218-6337-8425-a4a4-32ca569dacc2 Address:10.129.38.31:8300} {Suffrage:Voter ID:a784393f-e56f-55d7-23d4-38d90764eaea Address:10.128.13.223:8300} {Suffrage:Voter ID:545cf64c-1a8f-7997-d354-cb381c8feacc Address:10.131.11.181:8300}]
        latest_configuration_index = 0
        num_peers = 2
        protocol_version = 3
        protocol_version_max = 3
        protocol_version_min = 0
        snapshot_version_max = 1
        snapshot_version_min = 0
        state = Follower
        term = 38
runtime:
        arch = amd64
        cpu_count = 4
        goroutines = 129
        max_procs = 4
        os = linux
        version = go1.18.1
serf_lan:
        coordinate_resets = 0
        encrypted = false
        event_queue = 0
        event_time = 31
        failed = 0
        health_score = 0
        intent_queue = 0
        left = 0
        member_time = 11
        members = 3
        query_queue = 0
        query_time = 1
serf_wan:
        coordinate_resets = 0
        encrypted = false
        event_queue = 0
        event_time = 1
        failed = 0
        health_score = 0
        intent_queue = 0
        left = 0
        member_time = 7
        members = 3
        query_queue = 0
        query_time = 1

Operating system and Environment details

Openshift 4.10.34 with kubernetes version v1.23.5+8471591
CRI-O (1.23.3) container runtime
Red Hat Enterprise Linux CoreOS 410.84.202209161756-0 (kernel 4.18.0-305.62.1.el8_4.x86_64) compute nodes
VMware ESXi as host system
Consul server 1.13.3
Consul helm chart 0.49.0
helm 3.10.0

Log Fragments

The following logs are debug level logs from all 3 consul server nodes during the failover process

consul-helm-server-0.log
consul-helm-server-1.log
consul-helm-server-2.log

The following log is the output of the "consul operator raft list-peers -stale" command multiple times before, during and after the whole failover

consul-list-peers.log

If you need anything else please just let me know!

The text was updated successfully, but these errors were encountered:

TheDevOps · 2022-11-02T14:29:38Z

After digging around a bit more this is very likely caused by hashicorp/raft#524
Can someone confirm or deny this any maybe also provide a timeline until when and in which consul versions the raft fix will be included?

nvx · 2022-11-07T01:32:27Z

It looks like PR #15175 merged bd3451f into the release/1.13.x branch already, so I'd expect the next point release (1.13.4) to contain the fix. No idea when that's scheduled to come out mind you - hopefully sooner rather than later due to the severity of this bug.

jkirschner-hashicorp · 2022-11-21T18:01:15Z

My understanding is that we currently intend to release 1.13.4 (which includes PR #15175) in the window of Nov 30 - Dec 2. If I become aware of that changing substantially, I'll post here. Feel free to reply here if you haven't heard anything by the end of Dec 2 and 1.13.4 hasn't been released yet.

I'll leave this open until one of the posters on this issue has confirmed that 1.13.4 improves their situation.

nvx · 2022-11-21T23:53:33Z

Just to confirm, this fix is also already in 1.14.0 right? The changelog didn't specifically mention it, but it looks like the version was bumped in the go.mod for that version.

jkirschner-hashicorp · 2022-11-22T00:40:57Z

Yes - the fix is already in 1.14.0! Here's how I checked:

That fix went into main with this PR: #14897

Looking at the changelog entries from that PR, like raft: Fix a race condition where the snapshot file is closed without being opened, I see them in the Consul 1.14.0 release.

TheDevOps · 2022-11-22T07:28:48Z

From a quick test this issue looks to be resolved with 1.14.0 indeed.
Right now we still would prefer to wait for 1.13.4 for now and not go for the new major that close to the end of the year which is a pretty critical time for us, also latest Dec 2. works pretty well for us as well.
I'll update once 1.13.4 is released and tested - probably on Dec 5. - but as said since the issue no longer happens with 1.14.0 in the test setup I'm very positive it will be fixed in 1.13.4 as well.

TheDevOps · 2022-12-05T14:32:06Z

Finally got around to deploying and testing 1.13.4 today sorry for the wait and can confirm everything is stable again now. On loss a new leader is elected once within usuall <2 seconds and stays leader as long as no other eviction happens. Issue can be closed!

david-yu · 2022-12-05T17:19:39Z

Thanks @TheDevOps for confirming!

dpericaxon mentioned this issue Nov 17, 2022

Proposal: Release v1.13.4 #15427

Closed

david-yu closed this as completed Dec 5, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Leader Election taking between 3 to 15 minutes in openshift deployment when any single node - even non leader - is evicted starting from version 1.13.x #15231

Leader Election taking between 3 to 15 minutes in openshift deployment when any single node - even non leader - is evicted starting from version 1.13.x #15231

TheDevOps commented Nov 2, 2022 •

edited

TheDevOps commented Nov 2, 2022

nvx commented Nov 7, 2022

jkirschner-hashicorp commented Nov 21, 2022

nvx commented Nov 21, 2022

jkirschner-hashicorp commented Nov 22, 2022

TheDevOps commented Nov 22, 2022

TheDevOps commented Dec 5, 2022

david-yu commented Dec 5, 2022

Leader Election taking between 3 to 15 minutes in openshift deployment when any single node - even non leader - is evicted starting from version 1.13.x #15231

Leader Election taking between 3 to 15 minutes in openshift deployment when any single node - even non leader - is evicted starting from version 1.13.x #15231

Comments

TheDevOps commented Nov 2, 2022 • edited

Overview of the Issue

Reproduction Steps

Consul info for both Client and Server

Operating system and Environment details

Log Fragments

TheDevOps commented Nov 2, 2022

nvx commented Nov 7, 2022

jkirschner-hashicorp commented Nov 21, 2022

nvx commented Nov 21, 2022

jkirschner-hashicorp commented Nov 22, 2022

TheDevOps commented Nov 22, 2022

TheDevOps commented Dec 5, 2022

david-yu commented Dec 5, 2022

TheDevOps commented Nov 2, 2022 •

edited