Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

operator step-down causes vault to hang when connectivity to postgres is lost #10619

Open
ffung opened this issue Dec 24, 2020 · 0 comments
Open
Labels
core/ha specific to high-availability secret/database/postgresql

Comments

@ffung
Copy link

ffung commented Dec 24, 2020

Describe the bug
When connectivity to a postgres database has been lost, an operator step-down will cause vault to hang and becomes unresponsive for an extended period (> 5min). This is an HA issue and is blocking our DR test.

To Reproduce
Steps to reproduce the behavior:

  1. Confirm credentials can be retrieved vault read database/creds/my-role
  2. Confirm which node is leader vault status
  3. Disable traffic to postgres on leader iptables -A OUTPUT -p tcp -d <db host> --dport 9142 -j DROP
  4. Confirm credentials cannot be retrieved vault read database/creds/my-role, Error reading database/creds/my-role: context deadline exceeded
  5. Issue vault operator step-down, Success! Stepped down: https://vault.xxxx
  6. Issue vault status, Error checking seal status: context deadline exceeded

Expected behavior
Step-down succeeds and vault is responsive.

Environment:

  • Vault Server Version (retrieve with vault status):
Key                     Value
---                     -----
Seal Type               shamir
Initialized             true
Sealed                  false
Total Shares            1
Threshold               1
Version                 1.6.1
Cluster Name            vault-cluster-edfc2327
Cluster ID              
HA Enabled              true
HA Cluster              
HA Mode                 active
Raft Committed Index    388093
Raft Applied Index      388093
  • Vault CLI Version (retrieve with vault version):
    1.5.5

  • Server Operating System/Architecture:
    Linux flatcar Initial Website Import #1 SMP Fri Oct 23 16:42:52 -00 2020 x86_64 Intel(R) Xeon(R) Platinum 8175M CPU @ 2.50GHz GenuineIntel GNU/Linux

Additional context
Seems to be related that the postgres-database-plugin, doesn't handle connection failures resiliently #6792 and a step down forces the plugin to be gracefully shutdown

Find attached goroutine traces of vault process of leader when stepping down.
log.json.zip

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
core/ha specific to high-availability secret/database/postgresql
Projects
None yet
Development

No branches or pull requests

3 participants