Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Stepdown with DynamoDB backend can put you in a state where no leader ever gets elected #5828

Closed
mahmoudm opened this issue Nov 20, 2018 · 0 comments

Comments

@mahmoudm
Copy link
Contributor

Describe the bug
Leader stepdown with the DynamoDB HA backed can put you in a state where no leader exists, and the lock cannot be acquired by any node.

Vault leadership with Dynamo works as follows:

When it writes the lock, it has the following logic (writeItem code:

func (l *DynamoDBLock) writeItem() error {
):

    // If both key and path already exist, we can only write if
    // A. identity is equal to our identity (or the identity doesn't exist)
    // or
    // B. The ttl on the item is <= to the current time

In an infinite loop (code:

for {
):

  1. Generate an identity UID
  2. Acquire lock using the identity generated in Initial Website Import #1 — block until you have the lock. Retries to get the lock every 1 second.
    1. Lock acquisition is done via a successful writeItem
    2. Once lock is acquired, a stopLeaderCh is returned that can be used to “give up” the lock.
  3. Once lock is acquired, two goroutines are started:
    1. renewLock - runs every 5 seconds, calls writeItem. Can be stopped by stopLeaderCh
      1. code:
        go l.periodicallyRenewLock(leader)
    2. watchLock - runs every 5 seconds, calls stopLeaderCh if the lock no longer exists / its identity changed
      1. code:
        go l.watch(leader)
  4. wait for stopLeaderCh OR manual step down
  5. Clear leadership, delete the lock
    1. Note that this DOES NOT trigger the stopLeaderCh
  6. Go to Initial Website Import #1

The race condition is once #5 occurs, there's a race between whether 3.a or 3.b happens first. If 3.a happens after the delete, you now have the lock re-created and will continue to renew forever, even though this host gave up being leader. It is now stuck at #2 unable to acquire a lock, and no one else is able to acquire it. If 3.b happens first, then it detects the lock is deleted and then calls stopLeaderCh, which will stop the renewLock from happening and a new leader getting elected.

This only happens on manual step-down because during a shutdown stopLeaderCh is closed, so there is no goroutine to continuously renew the lock.

Environment:

  • Vault Server Version 0.11.4
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant