Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Vault Leader flapping in HA mode with DynamoDB #6572

Closed
awwithro opened this issue Apr 11, 2019 · 5 comments
Closed

Vault Leader flapping in HA mode with DynamoDB #6572

awwithro opened this issue Apr 11, 2019 · 5 comments

Comments

@awwithro
Copy link

awwithro commented Apr 11, 2019

Describe the bug
I've set up a new vault cluster with three nodes using kms for auto-unseal and dynamodb for the backing store. Watching vault status on a node shows that leadership is changing nearly every second between all three nodes.

To Reproduce
Steps to reproduce the behavior:

  1. Run watch vault status
  2. Observe flappy leaders

Expected behavior
The leader to be stable and change in response to a failed server or other unhealthy state.

Environment:

  • Vault Server Version (retrieve with vault status): 1.1.0
  • Vault CLI Version (retrieve with vault version): 1.1.0 ('36aa8c8dd1936e10ebd7a4c1d412ae0e6f7900bd')
  • Server Operating System/Architecture:
    Docker Image vault:1.1.0
NAME="Alpine Linux"
ID=alpine
VERSION_ID=3.8.4

The vault instances are running in Kubernetes

Vault server configuration file(s):

    vault.hcl: |
      ui = true

      storage "dynamodb" {
        ha_enabled = "true"
        region     = "us-west-2"
        table      = "<my-dynamodb-table>"
      }

      listener "tcp" {
        address     = "0.0.0.0:8200"
        tls_cert_file = "/etc/vault/certs/tls.crt"
        tls_key_file = "/etc/vault/certs/tls.key"
      }
      log_level = "trace"
      seal "awskms" {
        kms_key_id = "<my kms key id>"
        region     = "us-west-2"
      }

Also am using the following env vars:

          - name: VAULT_API_ADDR
            value: https://$(POD_NAME).vault.<namespace>.svc.cluster.local:8200
          - name: VAULT_CLUSTER_ADDR
            value: https://$(POD_NAME).vault.<namespace>.svc.cluster.local:8201
          - name: VAULT_ADDR
            value: https://vault.<namespace>.svc.cluster.local:8200
          - name: VAULT_CACERT
            value: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt

Additional context
I'm running the logs in trace mode and don't see anything that would suggest an issue or that leaders are even changing.

==> Vault server configuration:

           AWS KMS KeyID: <key id>
          AWS KMS Region: us-west-2
               Seal Type: awskms
             Api Address: https://vault-1.vault.<namespace>.svc.cluster.local:8200
                     Cgo: disabled
         Cluster Address: https://vault-1.vault.<namespace>.svc.cluster.local:8201
              Listener 1: tcp (addr: "0.0.0.0:8200", cluster address: "0.0.0.0:8201", max_request_duration: "1m30s", max_request_size: "33554432", tls: "enabled")
               Log Level: trace
                   Mlock: supported: true, enabled: true
                 Storage: dynamodb (HA available)
                 Version: Vault v1.1.0
             Version Sha: 36aa8c8dd1936e10ebd7a4c1d412ae0e6f7900bd

==> Vault server started! Log data will stream in below:

2019-04-11T23:27:04.246Z [DEBUG] storage.cache: creating LRU cache: size=0
2019-04-11T23:27:04.282Z [DEBUG] cluster listener addresses synthesized: cluster_addresses=[0.0.0.0:8201]
2019-04-11T23:27:04.289Z [INFO]  core: stored unseal keys supported, attempting fetch
2019-04-11T23:27:04.326Z [INFO]  core: vault is unsealed
2019-04-11T23:27:04.327Z [DEBUG] core: starting cluster listeners
2019-04-11T23:27:04.327Z [INFO]  core.cluster-listener: starting listener: listener_address=0.0.0.0:8201
2019-04-11T23:27:04.327Z [INFO]  core.cluster-listener: serving cluster requests: cluster_listen_address=[::]:8201
2019-04-11T23:27:04.327Z [INFO]  core: entering standby mode
2019-04-11T23:27:04.329Z [INFO]  core: unsealed with stored keys: stored_keys_used=1
2019-04-11T23:27:06.832Z [TRACE] core: found new active node information, refreshing
2019-04-11T23:27:06.835Z [DEBUG] core: parsing information for new active node: active_cluster_addr=https://vault-0.vault.<namespace>.svc.cluster.local:8201 active_redirect_addr=https://vault-0.vault.<namespace>.svc.cluster.local:8200
2019-04-11T23:27:06.836Z [DEBUG] core: refreshing forwarding connection
2019-04-11T23:27:06.836Z [DEBUG] core: clearing forwarding clients
2019-04-11T23:27:06.836Z [DEBUG] core: done clearing forwarding clients
2019-04-11T23:27:06.836Z [DEBUG] core: done refreshing forwarding connection
2019-04-11T23:27:06.836Z [DEBUG] core: creating rpc dialer: host=fw-2339a440-0c09-8986-a766-064f5a32f3d0
2019-04-11T23:27:06.864Z [DEBUG] core.cluster-listener: performing client cert lookup

Let me know if any other info would be useful

@awwithro
Copy link
Author

Looking at CloudWatch metrics for DynamoDB shows that this otherwise unused vault cluster is using 1.3 Read Units/Sec and 2.3 Write Units/Sec which I'm assuming are the nodes fighting over the lock

@awwithro
Copy link
Author

I've also added the AmazonDynamoDBFullAccess policy to the role used by Vault to rule out an IAM issue with Dynamo

@kalafut
Copy link
Contributor

kalafut commented Apr 12, 2019

Hi. You may want to try Vault 1.1.1 which was just released as it has some improvements for DynamoDB HA handling (#5828). The race condition in that issue is a bit different, but your setup may be triggering it in other ways.

@awwithro
Copy link
Author

Thanks!

Just bumped to running 1.1.1 and I'm still seeing the same behavior. One interesting thing I'm noticing. When I run watch vault status on all the nodes at once, regardless of which node is active, the other nodes always see https://vault-0.vault.<namespace>.svc.cluster.local:8200 as the active node. The HA Cluster Address is also listed as https://vault-0.vault.<namespace>.svc.cluster.local:8201

@awwithro
Copy link
Author

...and now I feel silly. The VAULT_ADDR I'm using will land on any of the three nodes so of course it'll show standby/active at different times. Running the same watch with the VAULT_ADDR of the local node works as expected and leadership is stable. False alarm!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants