Vault Leader flapping in HA mode with DynamoDB #6572

awwithro · 2019-04-11T23:58:10Z

Describe the bug
I've set up a new vault cluster with three nodes using kms for auto-unseal and dynamodb for the backing store. Watching vault status on a node shows that leadership is changing nearly every second between all three nodes.

To Reproduce
Steps to reproduce the behavior:

Run watch vault status
Observe flappy leaders

Expected behavior
The leader to be stable and change in response to a failed server or other unhealthy state.

Environment:

Vault Server Version (retrieve with vault status): 1.1.0
Vault CLI Version (retrieve with vault version): 1.1.0 ('36aa8c8dd1936e10ebd7a4c1d412ae0e6f7900bd')
Server Operating System/Architecture:
Docker Image vault:1.1.0

NAME="Alpine Linux"
ID=alpine
VERSION_ID=3.8.4

The vault instances are running in Kubernetes

Vault server configuration file(s):

    vault.hcl: |
      ui = true

      storage "dynamodb" {
        ha_enabled = "true"
        region     = "us-west-2"
        table      = "<my-dynamodb-table>"
      }

      listener "tcp" {
        address     = "0.0.0.0:8200"
        tls_cert_file = "/etc/vault/certs/tls.crt"
        tls_key_file = "/etc/vault/certs/tls.key"
      }
      log_level = "trace"
      seal "awskms" {
        kms_key_id = "<my kms key id>"
        region     = "us-west-2"
      }

Also am using the following env vars:

          - name: VAULT_API_ADDR
            value: https://$(POD_NAME).vault.<namespace>.svc.cluster.local:8200
          - name: VAULT_CLUSTER_ADDR
            value: https://$(POD_NAME).vault.<namespace>.svc.cluster.local:8201
          - name: VAULT_ADDR
            value: https://vault.<namespace>.svc.cluster.local:8200
          - name: VAULT_CACERT
            value: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt

Additional context
I'm running the logs in trace mode and don't see anything that would suggest an issue or that leaders are even changing.

==> Vault server configuration:

           AWS KMS KeyID: <key id>
          AWS KMS Region: us-west-2
               Seal Type: awskms
             Api Address: https://vault-1.vault.<namespace>.svc.cluster.local:8200
                     Cgo: disabled
         Cluster Address: https://vault-1.vault.<namespace>.svc.cluster.local:8201
              Listener 1: tcp (addr: "0.0.0.0:8200", cluster address: "0.0.0.0:8201", max_request_duration: "1m30s", max_request_size: "33554432", tls: "enabled")
               Log Level: trace
                   Mlock: supported: true, enabled: true
                 Storage: dynamodb (HA available)
                 Version: Vault v1.1.0
             Version Sha: 36aa8c8dd1936e10ebd7a4c1d412ae0e6f7900bd

==> Vault server started! Log data will stream in below:

2019-04-11T23:27:04.246Z [DEBUG] storage.cache: creating LRU cache: size=0
2019-04-11T23:27:04.282Z [DEBUG] cluster listener addresses synthesized: cluster_addresses=[0.0.0.0:8201]
2019-04-11T23:27:04.289Z [INFO]  core: stored unseal keys supported, attempting fetch
2019-04-11T23:27:04.326Z [INFO]  core: vault is unsealed
2019-04-11T23:27:04.327Z [DEBUG] core: starting cluster listeners
2019-04-11T23:27:04.327Z [INFO]  core.cluster-listener: starting listener: listener_address=0.0.0.0:8201
2019-04-11T23:27:04.327Z [INFO]  core.cluster-listener: serving cluster requests: cluster_listen_address=[::]:8201
2019-04-11T23:27:04.327Z [INFO]  core: entering standby mode
2019-04-11T23:27:04.329Z [INFO]  core: unsealed with stored keys: stored_keys_used=1
2019-04-11T23:27:06.832Z [TRACE] core: found new active node information, refreshing
2019-04-11T23:27:06.835Z [DEBUG] core: parsing information for new active node: active_cluster_addr=https://vault-0.vault.<namespace>.svc.cluster.local:8201 active_redirect_addr=https://vault-0.vault.<namespace>.svc.cluster.local:8200
2019-04-11T23:27:06.836Z [DEBUG] core: refreshing forwarding connection
2019-04-11T23:27:06.836Z [DEBUG] core: clearing forwarding clients
2019-04-11T23:27:06.836Z [DEBUG] core: done clearing forwarding clients
2019-04-11T23:27:06.836Z [DEBUG] core: done refreshing forwarding connection
2019-04-11T23:27:06.836Z [DEBUG] core: creating rpc dialer: host=fw-2339a440-0c09-8986-a766-064f5a32f3d0
2019-04-11T23:27:06.864Z [DEBUG] core.cluster-listener: performing client cert lookup

Let me know if any other info would be useful

The text was updated successfully, but these errors were encountered:

awwithro · 2019-04-12T00:18:31Z

Looking at CloudWatch metrics for DynamoDB shows that this otherwise unused vault cluster is using 1.3 Read Units/Sec and 2.3 Write Units/Sec which I'm assuming are the nodes fighting over the lock

awwithro · 2019-04-12T00:36:44Z

I've also added the AmazonDynamoDBFullAccess policy to the role used by Vault to rule out an IAM issue with Dynamo

kalafut · 2019-04-12T05:15:10Z

Hi. You may want to try Vault 1.1.1 which was just released as it has some improvements for DynamoDB HA handling (#5828). The race condition in that issue is a bit different, but your setup may be triggering it in other ways.

awwithro · 2019-04-12T15:49:39Z

Thanks!

Just bumped to running 1.1.1 and I'm still seeing the same behavior. One interesting thing I'm noticing. When I run watch vault status on all the nodes at once, regardless of which node is active, the other nodes always see https://vault-0.vault.<namespace>.svc.cluster.local:8200 as the active node. The HA Cluster Address is also listed as https://vault-0.vault.<namespace>.svc.cluster.local:8201

awwithro · 2019-04-12T15:55:02Z

...and now I feel silly. The VAULT_ADDR I'm using will land on any of the three nodes so of course it'll show standby/active at different times. Running the same watch with the VAULT_ADDR of the local node works as expected and leadership is stable. False alarm!

awwithro closed this as completed Apr 12, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Vault Leader flapping in HA mode with DynamoDB #6572

Vault Leader flapping in HA mode with DynamoDB #6572

awwithro commented Apr 11, 2019 •

edited

awwithro commented Apr 12, 2019

awwithro commented Apr 12, 2019

kalafut commented Apr 12, 2019

awwithro commented Apr 12, 2019

awwithro commented Apr 12, 2019

Vault Leader flapping in HA mode with DynamoDB #6572

Vault Leader flapping in HA mode with DynamoDB #6572

Comments

awwithro commented Apr 11, 2019 • edited

awwithro commented Apr 12, 2019

awwithro commented Apr 12, 2019

kalafut commented Apr 12, 2019

awwithro commented Apr 12, 2019

awwithro commented Apr 12, 2019

awwithro commented Apr 11, 2019 •

edited