Stepdown with DynamoDB backend can put you in a state where no leader ever gets elected #5828

mahmoudm · 2018-11-20T22:54:38Z

Describe the bug
Leader stepdown with the DynamoDB HA backed can put you in a state where no leader exists, and the lock cannot be acquired by any node.

Vault leadership with Dynamo works as follows:

When it writes the lock, it has the following logic (writeItem code:

vault/physical/dynamodb/dynamodb.go

Line 659 in f85efad

func (l *DynamoDBLock) writeItem() error {

):

    // If both key and path already exist, we can only write if
    // A. identity is equal to our identity (or the identity doesn't exist)
    // or
    // B. The ttl on the item is <= to the current time

In an infinite loop (code:

vault/vault/ha.go

Line 376 in a58d313

for {

):

Generate an identity UID
Acquire lock using the identity generated in Initial Website Import #1 — block until you have the lock. Retries to get the lock every 1 second.
1. Lock acquisition is done via a successful writeItem
2. Once lock is acquired, a stopLeaderCh is returned that can be used to “give up” the lock.
Once lock is acquired, two goroutines are started:
1. renewLock - runs every 5 seconds, calls writeItem. Can be stopped by stopLeaderCh
  1. code:
    
    vault/physical/dynamodb/dynamodb.go
    
    Line 564 in f85efad
    
    go l.periodicallyRenewLock(leader)
2. watchLock - runs every 5 seconds, calls stopLeaderCh if the lock no longer exists / its identity changed
  1. code:
    
    vault/physical/dynamodb/dynamodb.go
    
    Line 565 in f85efad
    
    go l.watch(leader)
wait for stopLeaderCh OR manual step down
Clear leadership, delete the lock
1. Note that this DOES NOT trigger the stopLeaderCh
Go to Initial Website Import #1

The race condition is once #5 occurs, there's a race between whether 3.a or 3.b happens first. If 3.a happens after the delete, you now have the lock re-created and will continue to renew forever, even though this host gave up being leader. It is now stuck at #2 unable to acquire a lock, and no one else is able to acquire it. If 3.b happens first, then it detects the lock is deleted and then calls stopLeaderCh, which will stop the renewLock from happening and a new leader getting elected.

This only happens on manual step-down because during a shutdown stopLeaderCh is closed, so there is no goroutine to continuously renew the lock.

Environment:

Vault Server Version 0.11.4

The text was updated successfully, but these errors were encountered:

mahmoudm mentioned this issue Nov 20, 2018

Fix dynamodb HA lock race #5829

Closed

jhmartin mentioned this issue Dec 18, 2018

When using DynamoDB as a backend, Vault is unhealthy when the standby instance is going down #4876

Closed

mahmoudm mentioned this issue Apr 1, 2019

Fix dynamodb HA lock race #6512

Merged

briankassouf closed this as completed in #6512 Apr 2, 2019

kalafut mentioned this issue Apr 3, 2019

Added HA backend for postgres based on dynamodb model #5731

Merged

kalafut mentioned this issue Apr 12, 2019

Vault Leader flapping in HA mode with DynamoDB #6572

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Stepdown with DynamoDB backend can put you in a state where no leader ever gets elected #5828

Stepdown with DynamoDB backend can put you in a state where no leader ever gets elected #5828

mahmoudm commented Nov 20, 2018

Stepdown with DynamoDB backend can put you in a state where no leader ever gets elected #5828

Stepdown with DynamoDB backend can put you in a state where no leader ever gets elected #5828

Comments

mahmoudm commented Nov 20, 2018