Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DynamoDB backed HA fails to release locks #26580

Open
dhumphries-sainsburys opened this issue Apr 22, 2024 · 0 comments
Open

DynamoDB backed HA fails to release locks #26580

dhumphries-sainsburys opened this issue Apr 22, 2024 · 0 comments

Comments

@dhumphries-sainsburys
Copy link

Describe the bug
Our clusters use s3 for storage and dynamoDB for ha_storage in a 3 replica configuration on EKS. We have had a few instances where an underlying node has failed and when this happens and vault happens to be running on that host we are seeing vault fail. Looking at this issue it appears in instances where the master node gets disabled by any method other than someone terminating the pod the other replicas fail to take control and as a result vault stops working. This looks to be due to the lockfile in dynamodb preventing a new one taking over despite the fact that the code suggest a TTL for this being 15s

Things we have tried that all resulted in failure of vault

  • Removing network interfaces from underlying hosts
  • Removing security groups
  • Disabling kubelet on underlying host

To Reproduce
Steps to reproduce the behavior:

  1. Install vault using the included config
  2. Disable an underlying host somehow that hosts the current master (removing the security groups allowing communication probably the easiest)
  3. See that the other vault hosts do not take over (I timed up to 30 minutes but docs seem to suggest 15s TTL)

Expected behavior
Within 15 seconds of the master vault becoming unavailable or unable to service requests one of the others takes over

Environment:

  • Vault Server Version (retrieve with vault status):
/vault/config # vault status -tls-skip-verify
WARNING! VAULT_ADDR and -address unset. Defaulting to https://127.0.0.1:8200.
Key                    Value
---                    -----
Seal Type              shamir
Initialized            true
Sealed                 false
Total Shares           5
Threshold              3
Version                1.16.1
Build Date             2024-04-03T12:35:53Z
Storage Type           s3
Cluster Name           vault-cluster-9c8fea11
Cluster ID             52acdb68-3327-bed2-f56a-f1eab38f7dbd
HA Enabled             true
HA Cluster             https://vault:8201
HA Mode                standby
Active Node Address    https://vault:8200
  • Vault CLI Version (retrieve with vault version):
/vault/config # vault version
Vault v1.16.1 (6b5986790d7748100de77f7f127119c4a0f78946), built 2024-04-03T12:35:53Z
  • Server Operating System/Architecture:
    bottlerocket-aws-k8s-1.28-x86_64-v1.19.4-4f0a078e

Vault server configuration file(s):

{"api_addr":"https://vault:8200","default_lease_ttl":"4320h","ha_storage":{"dynamodb":{"ha_enabled":"true","region":"eu-west-1","table":"vault-ha-storage"}},"listener":[{"tcp":{"address":"0.0.0.0:8200","tls_cert_file":"/vault/tls/server.crt","tls_key_file":"/vault/tls/server.key"}}],"max_lease_ttl":"4320h","service_registration":{"kubernetes":{"namespace":"vault"}},"storage":{"s3":{"bucket":"vault-lab-ie-core","region":"eu-west-1"}},"telemetry":{"statsd_address":"localhost:9125"},"ui":true}

Additional context
Add any other context about the problem here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants