Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Single Vault follower restart causes election even with established quorum #14153

Closed
justingood opened this issue Feb 18, 2022 · 4 comments
Closed
Labels
core/ha specific to high-availability k8s storage/raft

Comments

@justingood
Copy link

Describe the bug
We're encountering an odd case where a Vault leader is being replaced by a single follower that is restarted in a cluster of 3. We've deployed Vault to Kubernetes using the Helm chart and it's using the internal raft storage. The restart was performed with kubectl delete pod.

Details

Debug logs from the leader

2022-02-18T10:04:17.403Z [INFO]  storage.raft: aborting pipeline replication: peer="{Voter aed4746c-6e41-ba17-dd97-848b8b58d11b vault-2.vault-internal:8201}"
2022-02-18T10:04:17.491Z [ERROR] storage.raft: failed to appendEntries to: peer="{Voter aed4746c-6e41-ba17-dd97-848b8b58d11b vault-2.vault-internal:8201}" error=EOF
2022-02-18T10:04:17.505Z [DEBUG] core.cluster-listener: creating rpc dialer: address=vault-2.vault-internal:8201 alpn=raft_storage_v1 host=raft-c15b9662-2d50-11da-29c3-6b62fa2e11fa
2022-02-18T10:04:17.508Z [ERROR] storage.raft: failed to heartbeat to: peer=vault-2.vault-internal:8201 error="dial tcp 172.16.32.53:8201: connect: connection refused"
2022-02-18T10:04:17.583Z [DEBUG] core.cluster-listener: creating rpc dialer: address=vault-2.vault-internal:8201 alpn=raft_storage_v1 host=raft-c15b9662-2d50-11da-29c3-6b62fa2e11fa
2022-02-18T10:04:17.629Z [DEBUG] core.cluster-listener: creating rpc dialer: address=vault-2.vault-internal:8201 alpn=raft_storage_v1 host=raft-c15b9662-2d50-11da-29c3-6b62fa2e11fa
2022-02-18T10:04:17.903Z [WARN]  storage.raft: failed to contact: server-id=aed4746c-6e41-ba17-dd97-848b8b58d11b time=500.514487ms
2022-02-18T10:04:18.383Z [WARN]  storage.raft: failed to contact: server-id=aed4746c-6e41-ba17-dd97-848b8b58d11b time=980.221753ms
2022-02-18T10:04:18.852Z [WARN]  storage.raft: failed to contact: server-id=aed4746c-6e41-ba17-dd97-848b8b58d11b time=1.448886321s
2022-02-18T10:04:19.351Z [DEBUG] storage.raft: failed to contact: server-id=aed4746c-6e41-ba17-dd97-848b8b58d11b time=1.948375865s
2022-02-18T10:04:19.785Z [DEBUG] storage.raft: failed to contact: server-id=aed4746c-6e41-ba17-dd97-848b8b58d11b time=2.381793723s
2022-02-18T10:04:20.274Z [DEBUG] storage.raft: failed to contact: server-id=aed4746c-6e41-ba17-dd97-848b8b58d11b time=2.871663404s
2022-02-18T10:04:20.662Z [ERROR] storage.raft: failed to heartbeat to: peer=vault-2.vault-internal:8201 error="dial tcp 172.16.32.53:8201: connect: no route to host"
2022-02-18T10:04:20.662Z [ERROR] storage.raft: failed to appendEntries to: peer="{Voter aed4746c-6e41-ba17-dd97-848b8b58d11b vault-2.vault-internal:8201}" error="dial tcp 172.16.32.53:8201: connect: no route to host"
2022-02-18T10:04:20.742Z [DEBUG] core.cluster-listener: creating rpc dialer: address=vault-2.vault-internal:8201 alpn=raft_storage_v1 host=raft-c15b9662-2d50-11da-29c3-6b62fa2e11fa
2022-02-18T10:04:20.768Z [DEBUG] storage.raft: failed to contact: server-id=aed4746c-6e41-ba17-dd97-848b8b58d11b time=3.364927462s
2022-02-18T10:04:20.778Z [DEBUG] core.cluster-listener: creating rpc dialer: address=vault-2.vault-internal:8201 alpn=raft_storage_v1 host=raft-c15b9662-2d50-11da-29c3-6b62fa2e11fa
2022-02-18T10:04:21.246Z [DEBUG] storage.raft: failed to contact: server-id=aed4746c-6e41-ba17-dd97-848b8b58d11b time=3.843671429s
2022-02-18T10:04:21.723Z [DEBUG] storage.raft: failed to contact: server-id=aed4746c-6e41-ba17-dd97-848b8b58d11b time=4.319956192s
2022-02-18T10:04:22.210Z [DEBUG] storage.raft: failed to contact: server-id=aed4746c-6e41-ba17-dd97-848b8b58d11b time=4.807359124s
2022-02-18T10:04:22.675Z [DEBUG] storage.raft: failed to contact: server-id=aed4746c-6e41-ba17-dd97-848b8b58d11b time=5.272142026s
2022-02-18T10:04:23.152Z [DEBUG] storage.raft: failed to contact: server-id=aed4746c-6e41-ba17-dd97-848b8b58d11b time=5.749080432s
2022-02-18T10:04:23.640Z [DEBUG] storage.raft: failed to contact: server-id=aed4746c-6e41-ba17-dd97-848b8b58d11b time=6.237388002s
2022-02-18T10:04:23.734Z [ERROR] storage.raft: failed to appendEntries to: peer="{Voter aed4746c-6e41-ba17-dd97-848b8b58d11b vault-2.vault-internal:8201}" error="dial tcp 172.16.32.53:8201: connect: no route to host"
2022-02-18T10:04:23.734Z [ERROR] storage.raft: failed to heartbeat to: peer=vault-2.vault-internal:8201 error="dial tcp 172.16.32.53:8201: connect: no route to host"
2022-02-18T10:04:23.814Z [DEBUG] core.cluster-listener: creating rpc dialer: address=vault-2.vault-internal:8201 alpn=raft_storage_v1 host=raft-c15b9662-2d50-11da-29c3-6b62fa2e11fa
2022-02-18T10:04:23.858Z [DEBUG] core.cluster-listener: creating rpc dialer: address=vault-2.vault-internal:8201 alpn=raft_storage_v1 host=raft-c15b9662-2d50-11da-29c3-6b62fa2e11fa
2022-02-18T10:04:24.083Z [DEBUG] storage.raft: failed to contact: server-id=aed4746c-6e41-ba17-dd97-848b8b58d11b time=6.680564421s
2022-02-18T10:04:24.551Z [DEBUG] storage.raft: failed to contact: server-id=aed4746c-6e41-ba17-dd97-848b8b58d11b time=7.148301984s
2022-02-18T10:04:24.992Z [DEBUG] storage.raft: failed to contact: server-id=aed4746c-6e41-ba17-dd97-848b8b58d11b time=7.588839969s
2022-02-18T10:04:25.465Z [DEBUG] storage.raft: failed to contact: server-id=aed4746c-6e41-ba17-dd97-848b8b58d11b time=8.062711818s
2022-02-18T10:04:25.963Z [DEBUG] storage.raft: failed to contact: server-id=aed4746c-6e41-ba17-dd97-848b8b58d11b time=8.560429476s
2022-02-18T10:04:26.431Z [DEBUG] storage.raft: failed to contact: server-id=aed4746c-6e41-ba17-dd97-848b8b58d11b time=9.027937456s
2022-02-18T10:04:26.806Z [ERROR] storage.raft: failed to appendEntries to: peer="{Voter aed4746c-6e41-ba17-dd97-848b8b58d11b vault-2.vault-internal:8201}" error="dial tcp 172.16.32.53:8201: connect: no route to host"
2022-02-18T10:04:26.918Z [DEBUG] storage.raft: failed to contact: server-id=aed4746c-6e41-ba17-dd97-848b8b58d11b time=9.515057505s
2022-02-18T10:04:26.939Z [DEBUG] core.cluster-listener: creating rpc dialer: address=vault-2.vault-internal:8201 alpn=raft_storage_v1 host=raft-c15b9662-2d50-11da-29c3-6b62fa2e11fa
2022-02-18T10:04:27.383Z [DEBUG] storage.raft: failed to contact: server-id=aed4746c-6e41-ba17-dd97-848b8b58d11b time=9.979825297s
2022-02-18T10:04:27.852Z [DEBUG] storage.raft: failed to contact: server-id=aed4746c-6e41-ba17-dd97-848b8b58d11b time=10.449708301s
2022-02-18T10:04:28.307Z [DEBUG] storage.raft: failed to contact: server-id=aed4746c-6e41-ba17-dd97-848b8b58d11b time=10.903817242s
2022-02-18T10:04:28.777Z [DEBUG] storage.raft: failed to contact: server-id=aed4746c-6e41-ba17-dd97-848b8b58d11b time=11.374681422s
2022-02-18T10:04:29.255Z [DEBUG] storage.raft: failed to contact: server-id=aed4746c-6e41-ba17-dd97-848b8b58d11b time=11.852191499s
2022-02-18T10:04:29.693Z [DEBUG] storage.raft: failed to contact: server-id=aed4746c-6e41-ba17-dd97-848b8b58d11b time=12.28998255s
2022-02-18T10:04:29.878Z [ERROR] storage.raft: failed to heartbeat to: peer=vault-2.vault-internal:8201 error="dial tcp 172.16.32.53:8201: connect: no route to host"
2022-02-18T10:04:29.878Z [ERROR] storage.raft: failed to appendEntries to: peer="{Voter aed4746c-6e41-ba17-dd97-848b8b58d11b vault-2.vault-internal:8201}" error="dial tcp 172.16.32.53:8201: connect: no route to host"
2022-02-18T10:04:30.010Z [DEBUG] core.cluster-listener: creating rpc dialer: address=vault-2.vault-internal:8201 alpn=raft_storage_v1 host=raft-c15b9662-2d50-11da-29c3-6b62fa2e11fa
2022-02-18T10:04:30.058Z [DEBUG] core.cluster-listener: creating rpc dialer: address=vault-2.vault-internal:8201 alpn=raft_storage_v1 host=raft-c15b9662-2d50-11da-29c3-6b62fa2e11fa
2022-02-18T10:04:30.128Z [DEBUG] storage.raft: failed to contact: server-id=aed4746c-6e41-ba17-dd97-848b8b58d11b time=12.725059148s
2022-02-18T10:04:30.593Z [DEBUG] storage.raft: failed to contact: server-id=aed4746c-6e41-ba17-dd97-848b8b58d11b time=13.190626414s
2022-02-18T10:04:31.066Z [DEBUG] storage.raft: failed to contact: server-id=aed4746c-6e41-ba17-dd97-848b8b58d11b time=13.663332836s
2022-02-18T10:04:31.554Z [DEBUG] storage.raft: failed to contact: server-id=aed4746c-6e41-ba17-dd97-848b8b58d11b time=14.151433352s
2022-02-18T10:04:31.992Z [DEBUG] storage.raft: failed to contact: server-id=aed4746c-6e41-ba17-dd97-848b8b58d11b time=14.589504492s
2022-02-18T10:04:32.474Z [DEBUG] storage.raft: failed to contact: server-id=aed4746c-6e41-ba17-dd97-848b8b58d11b time=15.071567854s
2022-02-18T10:04:32.950Z [ERROR] storage.raft: failed to appendEntries to: peer="{Voter aed4746c-6e41-ba17-dd97-848b8b58d11b vault-2.vault-internal:8201}" error="dial tcp 172.16.32.53:8201: connect: no route to host"
2022-02-18T10:04:32.962Z [DEBUG] storage.raft: failed to contact: server-id=aed4746c-6e41-ba17-dd97-848b8b58d11b time=15.558984804s
2022-02-18T10:04:33.181Z [DEBUG] core.cluster-listener: creating rpc dialer: address=vault-2.vault-internal:8201 alpn=raft_storage_v1 host=raft-c15b9662-2d50-11da-29c3-6b62fa2e11fa
2022-02-18T10:04:33.440Z [DEBUG] storage.raft: failed to contact: server-id=aed4746c-6e41-ba17-dd97-848b8b58d11b time=16.037516496s
2022-02-18T10:04:33.882Z [DEBUG] storage.raft: failed to contact: server-id=aed4746c-6e41-ba17-dd97-848b8b58d11b time=16.479150749s
2022-02-18T10:04:34.375Z [DEBUG] storage.raft: failed to contact: server-id=aed4746c-6e41-ba17-dd97-848b8b58d11b time=16.972016773s
2022-02-18T10:04:34.846Z [DEBUG] storage.raft: failed to contact: server-id=aed4746c-6e41-ba17-dd97-848b8b58d11b time=17.443499502s
2022-02-18T10:04:35.341Z [DEBUG] storage.raft: failed to contact: server-id=aed4746c-6e41-ba17-dd97-848b8b58d11b time=17.938211882s
2022-02-18T10:04:35.789Z [DEBUG] storage.raft: failed to contact: server-id=aed4746c-6e41-ba17-dd97-848b8b58d11b time=18.386742135s
2022-02-18T10:04:36.022Z [ERROR] storage.raft: failed to heartbeat to: peer=vault-2.vault-internal:8201 error="dial tcp 172.16.32.53:8201: connect: no route to host"
2022-02-18T10:04:36.022Z [ERROR] storage.raft: failed to appendEntries to: peer="{Voter aed4746c-6e41-ba17-dd97-848b8b58d11b vault-2.vault-internal:8201}" error="dial tcp 172.16.32.53:8201: connect: no route to host"
2022-02-18T10:04:36.226Z [DEBUG] core.cluster-listener: creating rpc dialer: address=vault-2.vault-internal:8201 alpn=raft_storage_v1 host=raft-c15b9662-2d50-11da-29c3-6b62fa2e11fa
2022-02-18T10:04:36.237Z [DEBUG] storage.raft: failed to contact: server-id=aed4746c-6e41-ba17-dd97-848b8b58d11b time=18.834601049s
2022-02-18T10:04:36.278Z [ERROR] storage.raft: failed to heartbeat to: peer=vault-2.vault-internal:8201 error="dial tcp: lookup vault-2.vault-internal on 10.100.10.100:53: no such host"
2022-02-18T10:04:36.438Z [DEBUG] core.cluster-listener: creating rpc dialer: address=vault-2.vault-internal:8201 alpn=raft_storage_v1 host=raft-c15b9662-2d50-11da-29c3-6b62fa2e11fa
2022-02-18T10:04:36.440Z [ERROR] storage.raft: failed to appendEntries to: peer="{Voter aed4746c-6e41-ba17-dd97-848b8b58d11b vault-2.vault-internal:8201}" error="dial tcp: lookup vault-2.vault-internal on 10.100.10.100:53: no such host"
2022-02-18T10:04:36.595Z [DEBUG] core.cluster-listener: creating rpc dialer: address=vault-2.vault-internal:8201 alpn=raft_storage_v1 host=raft-c15b9662-2d50-11da-29c3-6b62fa2e11fa
2022-02-18T10:04:36.597Z [ERROR] storage.raft: failed to heartbeat to: peer=vault-2.vault-internal:8201 error="dial tcp: lookup vault-2.vault-internal on 10.100.10.100:53: no such host"
2022-02-18T10:04:36.701Z [DEBUG] storage.raft: failed to contact: server-id=aed4746c-6e41-ba17-dd97-848b8b58d11b time=19.29823513s
2022-02-18T10:04:37.082Z [DEBUG] core.cluster-listener: creating rpc dialer: address=vault-2.vault-internal:8201 alpn=raft_storage_v1 host=raft-c15b9662-2d50-11da-29c3-6b62fa2e11fa
2022-02-18T10:04:37.085Z [ERROR] storage.raft: failed to heartbeat to: peer=vault-2.vault-internal:8201 error="dial tcp: lookup vault-2.vault-internal on 10.100.10.100:53: no such host"
2022-02-18T10:04:37.142Z [DEBUG] core.cluster-listener: creating rpc dialer: address=vault-2.vault-internal:8201 alpn=raft_storage_v1 host=raft-c15b9662-2d50-11da-29c3-6b62fa2e11fa
2022-02-18T10:04:37.144Z [ERROR] storage.raft: failed to appendEntries to: peer="{Voter aed4746c-6e41-ba17-dd97-848b8b58d11b vault-2.vault-internal:8201}" error="dial tcp: lookup vault-2.vault-internal on 10.100.10.100:53: no such host"
2022-02-18T10:04:37.162Z [DEBUG] storage.raft: failed to contact: server-id=aed4746c-6e41-ba17-dd97-848b8b58d11b time=19.758994095s
2022-02-18T10:04:37.647Z [DEBUG] storage.raft: failed to contact: server-id=aed4746c-6e41-ba17-dd97-848b8b58d11b time=20.244676534s
2022-02-18T10:04:37.893Z [DEBUG] core.cluster-listener: creating rpc dialer: address=vault-2.vault-internal:8201 alpn=raft_storage_v1 host=raft-c15b9662-2d50-11da-29c3-6b62fa2e11fa
2022-02-18T10:04:37.895Z [ERROR] storage.raft: failed to heartbeat to: peer=vault-2.vault-internal:8201 error="dial tcp: lookup vault-2.vault-internal on 10.100.10.100:53: no such host"
2022-02-18T10:04:38.146Z [DEBUG] storage.raft: failed to contact: server-id=aed4746c-6e41-ba17-dd97-848b8b58d11b time=20.743018444s
2022-02-18T10:04:38.517Z [DEBUG] core.cluster-listener: creating rpc dialer: address=vault-2.vault-internal:8201 alpn=raft_storage_v1 host=raft-c15b9662-2d50-11da-29c3-6b62fa2e11fa
2022-02-18T10:04:38.519Z [ERROR] storage.raft: failed to appendEntries to: peer="{Voter aed4746c-6e41-ba17-dd97-848b8b58d11b vault-2.vault-internal:8201}" error="dial tcp: lookup vault-2.vault-internal on 10.100.10.100:53: no such host"
2022-02-18T10:04:38.604Z [DEBUG] storage.raft: failed to contact: server-id=aed4746c-6e41-ba17-dd97-848b8b58d11b time=21.201239805s
2022-02-18T10:04:39.100Z [DEBUG] storage.raft: failed to contact: server-id=aed4746c-6e41-ba17-dd97-848b8b58d11b time=21.697686171s
2022-02-18T10:04:39.283Z [DEBUG] core.cluster-listener: creating rpc dialer: address=vault-2.vault-internal:8201 alpn=raft_storage_v1 host=raft-c15b9662-2d50-11da-29c3-6b62fa2e11fa
2022-02-18T10:04:39.285Z [ERROR] storage.raft: failed to heartbeat to: peer=vault-2.vault-internal:8201 error="dial tcp: lookup vault-2.vault-internal on 10.100.10.100:53: no such host"
2022-02-18T10:04:39.591Z [DEBUG] storage.raft: failed to contact: server-id=aed4746c-6e41-ba17-dd97-848b8b58d11b time=22.188032397s
2022-02-18T10:04:40.053Z [DEBUG] storage.raft: failed to contact: server-id=aed4746c-6e41-ba17-dd97-848b8b58d11b time=22.650398418s
2022-02-18T10:04:40.524Z [DEBUG] storage.raft: failed to contact: server-id=aed4746c-6e41-ba17-dd97-848b8b58d11b time=23.121134016s
2022-02-18T10:04:40.999Z [DEBUG] storage.raft: failed to contact: server-id=aed4746c-6e41-ba17-dd97-848b8b58d11b time=23.596327017s
2022-02-18T10:04:41.146Z [DEBUG] core.cluster-listener: creating rpc dialer: address=vault-2.vault-internal:8201 alpn=raft_storage_v1 host=raft-c15b9662-2d50-11da-29c3-6b62fa2e11fa
2022-02-18T10:04:41.148Z [ERROR] storage.raft: failed to appendEntries to: peer="{Voter aed4746c-6e41-ba17-dd97-848b8b58d11b vault-2.vault-internal:8201}" error="dial tcp: lookup vault-2.vault-internal on 10.100.10.100:53: no such host"
2022-02-18T10:04:41.476Z [DEBUG] storage.raft: failed to contact: server-id=aed4746c-6e41-ba17-dd97-848b8b58d11b time=24.073632945s
2022-02-18T10:04:41.937Z [DEBUG] storage.raft: failed to contact: server-id=aed4746c-6e41-ba17-dd97-848b8b58d11b time=24.534144542s
2022-02-18T10:04:42.011Z [DEBUG] core.cluster-listener: creating rpc dialer: address=vault-2.vault-internal:8201 alpn=raft_storage_v1 host=raft-c15b9662-2d50-11da-29c3-6b62fa2e11fa
2022-02-18T10:04:42.014Z [ERROR] storage.raft: failed to heartbeat to: peer=vault-2.vault-internal:8201 error="dial tcp 172.16.38.163:8201: connect: connection refused"
2022-02-18T10:04:42.397Z [DEBUG] storage.raft: failed to contact: server-id=aed4746c-6e41-ba17-dd97-848b8b58d11b time=24.994504709s
2022-02-18T10:04:42.833Z [DEBUG] storage.raft: failed to contact: server-id=aed4746c-6e41-ba17-dd97-848b8b58d11b time=25.430536679s
2022-02-18T10:04:43.325Z [DEBUG] storage.raft: failed to contact: server-id=aed4746c-6e41-ba17-dd97-848b8b58d11b time=25.921880386s
2022-02-18T10:04:43.814Z [DEBUG] storage.raft: failed to contact: server-id=aed4746c-6e41-ba17-dd97-848b8b58d11b time=26.411563481s
2022-02-18T10:04:44.278Z [DEBUG] storage.raft: failed to contact: server-id=aed4746c-6e41-ba17-dd97-848b8b58d11b time=26.874840614s
2022-02-18T10:04:44.766Z [DEBUG] storage.raft: failed to contact: server-id=aed4746c-6e41-ba17-dd97-848b8b58d11b time=27.363277322s
2022-02-18T10:04:45.238Z [DEBUG] storage.raft: failed to contact: server-id=aed4746c-6e41-ba17-dd97-848b8b58d11b time=27.835264267s
2022-02-18T10:04:45.717Z [DEBUG] core.cluster-listener: performing server cert lookup
2022-02-18T10:04:45.736Z [DEBUG] storage.raft.raft-net: accepted connection: local-address=vault-0.vault-internal:8201 remote-address=172.16.38.163:51474
2022-02-18T10:04:45.737Z [WARN]  storage.raft: rejecting vote request since we have a leader: from=vault-2.vault-internal:8201 leader=vault-0.vault-internal:8201
2022-02-18T10:04:45.738Z [DEBUG] storage.raft: failed to contact: server-id=aed4746c-6e41-ba17-dd97-848b8b58d11b time=28.335421268s
2022-02-18T10:04:46.201Z [DEBUG] storage.raft: failed to contact: server-id=aed4746c-6e41-ba17-dd97-848b8b58d11b time=28.798085525s
2022-02-18T10:04:46.327Z [DEBUG] core.cluster-listener: creating rpc dialer: address=vault-2.vault-internal:8201 alpn=raft_storage_v1 host=raft-c15b9662-2d50-11da-29c3-6b62fa2e11fa
2022-02-18T10:04:46.336Z [DEBUG] core.cluster-listener: performing client cert lookup
2022-02-18T10:04:46.344Z [ERROR] storage.raft: peer has newer term, stopping replication: peer="{Voter aed4746c-6e41-ba17-dd97-848b8b58d11b vault-2.vault-internal:8201}"
2022-02-18T10:04:46.344Z [INFO]  storage.raft: entering follower state: follower="Node at vault-0.vault-internal:8201 [Follower]" leader=
2022-02-18T10:04:46.344Z [WARN]  core: leadership lost, stopping active operation
2022-02-18T10:04:46.344Z [INFO]  core: pre-seal teardown starting
2022-02-18T10:04:46.344Z [INFO]  storage.raft: aborting pipeline replication: peer="{Voter 8a322099-8e67-3c92-3687-fd0bd8bd8b10 vault-1.vault-internal:8201}"
2022-02-18T10:04:46.344Z [DEBUG] storage.raft.autopilot: state update routine is now stopped
2022-02-18T10:04:46.344Z [DEBUG] storage.raft.autopilot: autopilot is now stopped
2022-02-18T10:04:46.844Z [INFO]  core: stopping raft active node
2022-02-18T10:04:46.844Z [DEBUG] expiration: stop triggered
2022-02-18T10:04:46.845Z [DEBUG] expiration: finished stopping
2022-02-18T10:04:46.845Z [INFO]  rollback: stopping rollback manager
2022-02-18T10:04:46.845Z [INFO]  core: pre-seal teardown complete
2022-02-18T10:04:46.863Z [ERROR] core: clearing leader advertisement failed: error="node is not the leader"
2022-02-18T10:04:46.863Z [ERROR] core: unlocking HA lock failed: error="node is not the leader"
2022-02-18T10:04:46.939Z [DEBUG] core: parsing information for new active node: active_cluster_addr=https://vault-0.vault-internal:8201 active_redirect_addr=https://172.16.23.221:8200
2022-02-18T10:04:46.939Z [DEBUG] core: refreshing forwarding connection
2022-02-18T10:04:46.939Z [DEBUG] core: clearing forwarding clients
2022-02-18T10:04:46.939Z [DEBUG] core: done clearing forwarding clients
2022-02-18T10:04:46.939Z [DEBUG] core: done refreshing forwarding connection
2022-02-18T10:04:46.940Z [DEBUG] core.cluster-listener: creating rpc dialer: address=vault-0.vault-internal:8201 alpn=req_fw_sb-act_v1 host=fw-ba9cdaac-d2c2-350e-2008-50bfea8641bf
2022-02-18T10:04:46.942Z [WARN]  core.cluster-listener: no TLS config found for ALPN: ALPN=["req_fw_sb-act_v1"]
2022-02-18T10:04:46.942Z [DEBUG] core.cluster-listener: error handshaking cluster connection: error="unsupported protocol"
2022-02-18T10:04:46.943Z [DEBUG] core: forwarding: error sending echo request to active node: error="rpc error: code = Unavailable desc = connection error: desc = \"transport: Error while dialing remote error: tls: internal error\""
2022-02-18T10:04:47.146Z [WARN]  core.cluster-listener: no TLS config found for ALPN: ALPN=["req_fw_sb-act_v1"]
2022-02-18T10:04:47.146Z [DEBUG] core.cluster-listener: error handshaking cluster connection: error="unsupported protocol"
2022-02-18T10:04:47.235Z [DEBUG] storage.raft: lost leadership because received a requestVote with a newer term
2022-02-18T10:04:47.387Z [DEBUG] core.cluster-listener: performing server cert lookup
2022-02-18T10:04:47.402Z [DEBUG] storage.raft.raft-net: accepted connection: local-address=vault-0.vault-internal:8201 remote-address=172.16.38.163:51514
2022-02-18T10:04:49.440Z [DEBUG] core: parsing information for new active node: active_cluster_addr=https://vault-2.vault-internal:8201 active_redirect_addr=https://172.16.38.163:8200
2022-02-18T10:04:49.440Z [DEBUG] core: refreshing forwarding connection
2022-02-18T10:04:49.440Z [DEBUG] core: clearing forwarding clients
2022-02-18T10:04:49.440Z [DEBUG] core: done clearing forwarding clients
2022-02-18T10:04:49.440Z [DEBUG] core: forwarding: stopping heartbeating
2022-02-18T10:04:49.440Z [DEBUG] core: done refreshing forwarding connection
2022-02-18T10:04:49.440Z [DEBUG] core.cluster-listener: creating rpc dialer: address=vault-2.vault-internal:8201 alpn=req_fw_sb-act_v1 host=fw-7012efed-b8b7-d213-4d80-7e33705a8c05
2022-02-18T10:04:49.451Z [DEBUG] core.cluster-listener: performing client cert lookup

Debug logs from the restarted node

==> Vault shutdown triggered
2022-02-18T10:04:17.396Z [DEBUG] would have sent systemd notification (systemd not present): notification=STOPPING=1
2022-02-18T10:04:17.397Z [DEBUG] core: shutdown called
2022-02-18T10:04:17.397Z [INFO]  core: marked as sealed
2022-02-18T10:04:17.397Z [DEBUG] core: clearing forwarding clients
2022-02-18T10:04:17.397Z [DEBUG] core: done clearing forwarding clients
2022-02-18T10:04:17.397Z [DEBUG] core: finished triggering standbyStopCh for runStandby
2022-02-18T10:04:17.397Z [DEBUG] core: shutting down periodic key rotation checker
2022-02-18T10:04:17.397Z [DEBUG] core: shutting down periodic leader refresh
2022-02-18T10:04:17.397Z [DEBUG] core: shutting down leader elections
2022-02-18T10:04:17.397Z [DEBUG] core: forwarding: stopping heartbeating
2022-02-18T10:04:17.397Z [DEBUG] core: runStandby done
2022-02-18T10:04:17.398Z [ERROR] storage.raft.raft-net: failed to accept connection: error="Raft RPC layer closed"
2022-02-18T10:04:17.398Z [INFO]  core: stopping cluster listeners
2022-02-18T10:04:17.398Z [INFO]  core.cluster-listener: forwarding rpc listeners stopped
2022-02-18T10:04:17.402Z [ERROR] storage.raft.raft-net: failed to decode incoming command: error="transport shutdown"
rpc error: code = Unknown desc = Error: No such container: 0969cb9e26d1f5af9f2a69618f0c628e23f2ddabb524fdde522d294ebdccd388



==> Vault server configuration:

             Api Address: https://172.16.38.163:8200
                     Cgo: disabled
         Cluster Address: https://vault-2.vault-internal:8201
              Go Version: go1.17.5
              Listener 1: tcp (addr: "[::]:8200", cluster address: "[::]:8201", max_request_duration: "1m30s", max_request_size: "33554432", tls: "enabled")
               Log Level: debug
                   Mlock: supported: true, enabled: false
           Recovery Mode: false
                 Storage: raft (HA available)
                 Version: Vault v1.9.2
             Version Sha: f4c6d873e2767c0d6853b5d9ffc77b0d297bfbdf

==> Vault server started! Log data will stream in below:

2022-02-18T10:04:39.391Z [INFO]  proxy environment: http_proxy="\"\"" https_proxy="\"\"" no_proxy="\"\""
2022-02-18T10:04:39.392Z [WARN]  storage.raft.fsm: raft FSM db file has wider permissions than needed: needed=-rw------- existing=-rw-rw----
2022-02-18T10:04:39.439Z [DEBUG] storage.raft.fsm: time to open database: elapsed=47.148699ms path=/vault/data/vault.db
2022-02-18T10:04:44.303Z [DEBUG] service_registration.kubernetes: "namespace": "vault"
2022-02-18T10:04:44.303Z [DEBUG] service_registration.kubernetes: "pod_name": "vault-2"
2022-02-18T10:04:44.303Z [DEBUG] seal.awskms: adding web identity provider: roleARN=arn:aws:iam::045466806556:role/vault20211124104445947400000001
2022-02-18T10:04:44.610Z [DEBUG] core: set config: sanitized config="{\"api_addr\":\"\",\"cache_size\":0,\"cluster_addr\":\"https://vault-2.vault-internal:8201\",\"cluster_cipher_suites\":\"\",\"cluster_name\":\"vault\",\"default_lease_ttl\":0,\"default_max_request_duration\":0,\"disable_cache\":false,\"disable_clustering\":false,\"disable_indexing\":false,\"disable_mlock\":true,\"disable_performance_standby\":false,\"disable_printable_check\":false,\"disable_sealwrap\":false,\"disable_sentinel_trace\":false,\"enable_response_header_hostname\":false,\"enable_response_header_raft_node_id\":false,\"enable_ui\":true,\"listeners\":[{\"config\":{\"address\":\"[::]:8200\",\"cluster_address\":\"[::]:8201\",\"telemetry\":[{\"unauthenticated_metrics_access\":true}],\"tls_cert_file\":\"/vault/tls/tls.crt\",\"tls_disable_client_certs\":true,\"tls_key_file\":\"/vault/tls/tls.key\"},\"type\":\"tcp\"}],\"log_format\":\"unspecified\",\"log_level\":\"\",\"max_lease_ttl\":0,\"pid_file\":\"/vault/pids/vault.pid\",\"plugin_directory\":\"\",\"raw_storage_endpoint\":false,\"seals\":[{\"disabled\":false,\"type\":\"awskms\"}],\"service_registration\":{\"type\":\"kubernetes\"},\"storage\":{\"cluster_addr\":\"\",\"disable_clustering\":false,\"redirect_addr\":\"\",\"type\":\"raft\"},\"telemetry\":{\"add_lease_metrics_namespace_labels\":false,\"circonus_api_app\":\"\",\"circonus_api_token\":\"\",\"circonus_api_url\":\"\",\"circonus_broker_id\":\"\",\"circonus_broker_select_tag\":\"\",\"circonus_check_display_name\":\"\",\"circonus_check_force_metric_activation\":\"\",\"circonus_check_id\":\"\",\"circonus_check_instance_id\":\"\",\"circonus_check_search_tag\":\"\",\"circonus_check_tags\":\"\",\"circonus_submission_interval\":\"\",\"circonus_submission_url\":\"\",\"disable_hostname\":true,\"dogstatsd_addr\":\"\",\"dogstatsd_tags\":null,\"lease_metrics_epsilon\":3600000000000,\"maximum_gauge_cardinality\":500,\"metrics_prefix\":\"\",\"num_lease_metrics_buckets\":168,\"prometheus_retention_time\":30000000000,\"stackdriver_debug_logs\":false,\"stackdriver_location\":\"\",\"stackdriver_namespace\":\"\",\"stackdriver_project_id\":\"\",\"statsd_address\":\"\",\"statsite_address\":\"\",\"usage_gauge_period\":600000000000}}"
2022-02-18T10:04:44.610Z [DEBUG] storage.cache: creating LRU cache: size=0
2022-02-18T10:04:44.611Z [INFO]  core: Initializing VersionTimestamps for core
2022-02-18T10:04:44.611Z [DEBUG] cluster listener addresses synthesized: cluster_addresses=[[::]:8201]
2022-02-18T10:04:44.611Z [INFO]  core: stored unseal keys supported, attempting fetch
2022-02-18T10:04:44.611Z [INFO]  core: raft retry join initiated
2022-02-18T10:04:44.640Z [DEBUG] would have sent systemd notification (systemd not present): notification=READY=1
2022-02-18T10:04:44.640Z [DEBUG] core: starting cluster listeners
2022-02-18T10:04:44.640Z [INFO]  core.cluster-listener.tcp: starting listener: listener_address=[::]:8201
2022-02-18T10:04:44.640Z [INFO]  core.cluster-listener: serving cluster requests: cluster_listen_address=[::]:8201
2022-02-18T10:04:44.641Z [INFO]  storage.raft: creating Raft: config="&raft.Config{ProtocolVersion:3, HeartbeatTimeout:1000000000, ElectionTimeout:1000000000, CommitTimeout:50000000, MaxAppendEntries:64, BatchApplyCh:true, ShutdownOnRemove:true, TrailingLogs:0x2800, SnapshotInterval:120000000000, SnapshotThreshold:0x2000, LeaderLeaseTimeout:500000000, LocalID:\"aed4746c-6e41-ba17-dd97-848b8b58d11b\", NotifyCh:(chan<- bool)(0xc00116d500), LogOutput:io.Writer(nil), LogLevel:\"DEBUG\", Logger:(*hclog.interceptLogger)(0xc0001f2840), NoSnapshotRestoreOnStart:true, skipStartup:false}"
2022-02-18T10:04:44.642Z [INFO]  storage.raft: initial configuration: index=37 servers="[{Suffrage:Voter ID:ae7aec79-1c42-bd66-e0bb-bff308e27ca2 Address:vault-0.vault-internal:8201} {Suffrage:Voter ID:8a322099-8e67-3c92-3687-fd0bd8bd8b10 Address:vault-1.vault-internal:8201} {Suffrage:Voter ID:aed4746c-6e41-ba17-dd97-848b8b58d11b Address:vault-2.vault-internal:8201}]"
2022-02-18T10:04:44.642Z [INFO]  core: vault is unsealed
2022-02-18T10:04:44.643Z [INFO]  storage.raft: entering follower state: follower="Node at vault-2.vault-internal:8201 [Follower]" leader=
2022-02-18T10:04:44.643Z [INFO]  core: entering standby mode
2022-02-18T10:04:44.689Z [INFO]  core: unsealed with stored key
2022-02-18T10:04:45.712Z [WARN]  storage.raft: heartbeat timeout reached, starting election: last-leader=
2022-02-18T10:04:45.712Z [INFO]  storage.raft: entering candidate state: node="Node at vault-2.vault-internal:8201 [Candidate]" term=354
2022-02-18T10:04:45.714Z [DEBUG] core.cluster-listener: creating rpc dialer: address=vault-0.vault-internal:8201 alpn=raft_storage_v1 host=raft-c15b9662-2d50-11da-29c3-6b62fa2e11fa
2022-02-18T10:04:45.714Z [DEBUG] core.cluster-listener: creating rpc dialer: address=vault-1.vault-internal:8201 alpn=raft_storage_v1 host=raft-c15b9662-2d50-11da-29c3-6b62fa2e11fa
2022-02-18T10:04:45.717Z [DEBUG] storage.raft: votes: needed=2
2022-02-18T10:04:45.717Z [DEBUG] storage.raft: vote granted: from=aed4746c-6e41-ba17-dd97-848b8b58d11b term=354 tally=1
2022-02-18T10:04:45.728Z [DEBUG] core.cluster-listener: performing client cert lookup
2022-02-18T10:04:45.734Z [DEBUG] core.cluster-listener: performing client cert lookup
2022-02-18T10:04:46.329Z [DEBUG] core.cluster-listener: performing server cert lookup
2022-02-18T10:04:46.343Z [DEBUG] storage.raft.raft-net: accepted connection: local-address=vault-2.vault-internal:8201 remote-address=172.16.23.221:52174
2022-02-18T10:04:47.144Z [DEBUG] core: parsing information for new active node: active_cluster_addr=https://vault-0.vault-internal:8201 active_redirect_addr=https://172.16.23.221:8200
2022-02-18T10:04:47.144Z [DEBUG] core: refreshing forwarding connection
2022-02-18T10:04:47.144Z [DEBUG] core: clearing forwarding clients
2022-02-18T10:04:47.144Z [DEBUG] core: done clearing forwarding clients
2022-02-18T10:04:47.144Z [DEBUG] core: done refreshing forwarding connection
2022-02-18T10:04:47.144Z [DEBUG] core.cluster-listener: creating rpc dialer: address=vault-0.vault-internal:8201 alpn=req_fw_sb-act_v1 host=fw-ba9cdaac-d2c2-350e-2008-50bfea8641bf
2022-02-18T10:04:47.146Z [DEBUG] core: forwarding: error sending echo request to active node: error="rpc error: code = Unavailable desc = connection error: desc = \"transport: Error while dialing remote error: tls: internal error\""
2022-02-18T10:04:47.232Z [WARN]  storage.raft: Election timeout reached, restarting election
2022-02-18T10:04:47.232Z [INFO]  storage.raft: entering candidate state: node="Node at vault-2.vault-internal:8201 [Candidate]" term=355
2022-02-18T10:04:47.237Z [DEBUG] storage.raft: votes: needed=2
2022-02-18T10:04:47.237Z [DEBUG] storage.raft: vote granted: from=aed4746c-6e41-ba17-dd97-848b8b58d11b term=355 tally=1
2022-02-18T10:04:47.239Z [DEBUG] storage.raft: vote granted: from=ae7aec79-1c42-bd66-e0bb-bff308e27ca2 term=355 tally=2
2022-02-18T10:04:47.239Z [INFO]  storage.raft: election won: tally=2
2022-02-18T10:04:47.239Z [INFO]  storage.raft: entering leader state: leader="Node at vault-2.vault-internal:8201 [Leader]"
2022-02-18T10:04:47.239Z [INFO]  storage.raft: added peer, starting replication: peer=ae7aec79-1c42-bd66-e0bb-bff308e27ca2
2022-02-18T10:04:47.239Z [INFO]  storage.raft: added peer, starting replication: peer=8a322099-8e67-3c92-3687-fd0bd8bd8b10
2022-02-18T10:04:47.240Z [INFO]  storage.raft: pipelining replication: peer="{Voter ae7aec79-1c42-bd66-e0bb-bff308e27ca2 vault-0.vault-internal:8201}"
2022-02-18T10:04:47.243Z [INFO]  storage.raft: pipelining replication: peer="{Voter 8a322099-8e67-3c92-3687-fd0bd8bd8b10 vault-1.vault-internal:8201}"
2022-02-18T10:04:47.248Z [INFO]  core: acquired lock, enabling active operation
2022-02-18T10:04:47.249Z [DEBUG] core: generating cluster private key
2022-02-18T10:04:47.251Z [DEBUG] core: generating local cluster certificate
2022-02-18T10:04:47.307Z [INFO]  core: post-unseal setup starting
2022-02-18T10:04:47.307Z [DEBUG] core: clearing forwarding clients
2022-02-18T10:04:47.307Z [DEBUG] core: done clearing forwarding clients
2022-02-18T10:04:47.307Z [DEBUG] core: persisting feature flags
2022-02-18T10:04:47.307Z [DEBUG] core: forwarding: stopping heartbeating
2022-02-18T10:04:47.317Z [INFO]  core: loaded wrapping token key
2022-02-18T10:04:47.317Z [INFO]  core: successfully setup plugin catalog: plugin-directory="\"\""
2022-02-18T10:04:47.318Z [INFO]  core: successfully mounted backend: type=identity path=identity/
2022-02-18T10:04:47.318Z [INFO]  core: successfully mounted backend: type=system path=sys/
2022-02-18T10:04:47.319Z [INFO]  core: successfully mounted backend: type=transit path=transit/
2022-02-18T10:04:47.319Z [INFO]  core: successfully mounted backend: type=kv path=secret/
2022-02-18T10:04:47.319Z [INFO]  core: successfully mounted backend: type=cubbyhole path=cubbyhole/
2022-02-18T10:04:47.320Z [INFO]  core: successfully enabled credential backend: type=token path=token/
2022-02-18T10:04:47.320Z [INFO]  core: successfully enabled credential backend: type=oidc path=oidc/
2022-02-18T10:04:47.320Z [INFO]  rollback: starting rollback manager
2022-02-18T10:04:47.321Z [INFO]  core: restoring leases
2022-02-18T10:04:47.321Z [DEBUG] identity: loading entities
2022-02-18T10:04:47.321Z [DEBUG] identity: entities collected: num_existing=7
2022-02-18T10:04:47.321Z [DEBUG] identity: entities loading: progress=0
2022-02-18T10:04:47.321Z [DEBUG] expiration: collecting leases
2022-02-18T10:04:47.321Z [DEBUG] expiration: leases collected: num_existing=82
2022-02-18T10:04:47.322Z [INFO]  identity: entities restored
2022-02-18T10:04:47.322Z [DEBUG] identity: identity loading groups
2022-02-18T10:04:47.322Z [DEBUG] identity: groups collected: num_existing=1
2022-02-18T10:04:47.323Z [DEBUG] identity: loading group: name=admin id=bb870247-83db-bb49-0fa7-90663f44869a
2022-02-18T10:04:47.323Z [INFO]  identity: groups restored
2022-02-18T10:04:47.323Z [DEBUG] identity: identity loading OIDC clients
2022-02-18T10:04:47.323Z [INFO]  core: starting raft active node
2022-02-18T10:04:47.323Z [INFO]  storage.raft: starting autopilot: config="&{false 0 10s 24h0m0s 1000 3 10s}" reconcile_interval=0s
2022-02-18T10:04:47.323Z [DEBUG] core: request forwarding setup function
2022-02-18T10:04:47.323Z [DEBUG] core: clearing forwarding clients
2022-02-18T10:04:47.323Z [DEBUG] core: done clearing forwarding clients
2022-02-18T10:04:47.323Z [DEBUG] core: leaving request forwarding setup function
2022-02-18T10:04:47.324Z [DEBUG] storage.raft.autopilot: autopilot is now running
2022-02-18T10:04:47.324Z [DEBUG] storage.raft.autopilot: state update routine is now running
2022-02-18T10:04:47.328Z [INFO]  expiration: lease restore complete
2022-02-18T10:04:47.358Z [INFO]  core: post-unseal setup complete
2022-02-18T10:04:47.385Z [DEBUG] core.cluster-listener: creating rpc dialer: address=vault-0.vault-internal:8201 alpn=raft_storage_v1 host=raft-c15b9662-2d50-11da-29c3-6b62fa2e11fa
2022-02-18T10:04:47.395Z [DEBUG] core.cluster-listener: performing client cert lookup
2022-02-18T10:04:47.414Z [DEBUG] core.cluster-listener: creating rpc dialer: address=vault-1.vault-internal:8201 alpn=raft_storage_v1 host=raft-c15b9662-2d50-11da-29c3-6b62fa2e11fa
2022-02-18T10:04:47.424Z [DEBUG] core.cluster-listener: performing client cert lookup
2022-02-18T10:04:48.609Z [DEBUG] core.cluster-listener: performing server cert lookup
2022-02-18T10:04:48.626Z [DEBUG] core.request-forward: got request forwarding connection
2022-02-18T10:04:49.443Z [DEBUG] core.cluster-listener: performing server cert lookup
2022-02-18T10:04:49.458Z [DEBUG] core.request-forward: got request forwarding connection

Expected behavior
From reading the docs, we expected that the restarted node should re-join the cluster as a follower and not force an election

Environment:

  • Vault Server Version (retrieve with vault status): 1.9.2
  • Vault CLI Version (retrieve with vault version): 1.9.2
  • Vault Helm chart version: 0.18.0
  • Server Operating System/Architecture: AWS EKS 1.20

Vault server configuration file(s):

disable_mlock = true
cluster_name = "vault"
ui = true
disable_mlock = true
pid_file = "/vault/pids/vault.pid"
listener "tcp" {
  address = "[::]:8200"
  cluster_address = "[::]:8201"
  tls_cert_file = "/vault/tls/tls.crt"
  tls_key_file = "/vault/tls/tls.key"
  tls_disable_client_certs = true
  telemetry {
    unauthenticated_metrics_access = true
  }
}
telemetry {
  prometheus_retention_time = "30s"
  disable_hostname = true
  enable_hostname_label = true
}
storage "raft" {
  path = "/vault/data"
  performance_multiplier = 1
  retry_join {
    leader_api_addr = "vault-0.vault-internal.vault.svc.cluster.local:8200"
    leader_tls_servername = "vault.vault.svc.cluster.local"
    leader_ca_cert_file = "/vault/tls-ca/ca.crt"
  }
  retry_join {
    leader_api_addr = "vault-1.vault-internal.vault.svc.cluster.local:8200"
    leader_tls_servername = "vault.vault.svc.cluster.local"
    leader_ca_cert_file = "/vault/tls-ca/ca.crt"
  }
  retry_join {
    leader_api_addr = "vault-2.vault-internal.vault.svc.cluster.local:8200"
    leader_tls_servername = "vault.vault.svc.cluster.local"
    leader_ca_cert_file = "/vault/tls-ca/ca.crt"
  }
}
service_registration "kubernetes" {}
@hsimon-hashicorp hsimon-hashicorp added storage/raft core/ha specific to high-availability k8s labels Feb 22, 2022
@stgarf-sx
Copy link

stgarf-sx commented Mar 11, 2022

We are experiencing this behavior as well. We expected the cluster to be online and serving in HA mode in the event that a passive pod is lost however what we observe is that occasionally (no real rhyme or reason AFAICT) when the passive pod returns it seems to trigger an entire leadership election and cause then entire cluster to become momentarily unavailable until quorum amongst the raft members is re-established. This doesn't behave like the HA-design that we expected (i.e. we expected reads and writes to the cluster to still work since the rest of the raft members are still in quorum and there's nothing theoretically wrong with the cluster -- except that the new follower node doesn't like the old status-quo for some unknown reason so it forces a leader election).

@ncabatoff
Copy link
Contributor

While the longer term fix is probably going to be hashicorp/raft#494, note that #15042 should also address this issue, and will be released sooner.

@awnumar
Copy link

awnumar commented Feb 9, 2023

❓ Since hashicorp/raft#494 is merged now, is this issue resolved?

@justingood
Copy link
Author

Yes, I can confirm we're no longer seeing the problem after those two fixes were released 👍

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
core/ha specific to high-availability k8s storage/raft
Projects
None yet
Development

No branches or pull requests

5 participants