Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

vault performance degration due to fragmented bbolt db when using raft backend #11072

Closed
write0nly opened this issue Mar 10, 2021 · 6 comments
Closed
Labels
bug Used to indicate a potential bug storage/raft

Comments

@write0nly
Copy link

Describe the bug

Vault environment using 5 nodes with vault 1.6.0 raft backend. About 1000 approle role_ids, and the very unfortunate situation of having apps that churn more than 5million approle logins a day due to batch processing, and let the background token/lease watermark expire them in bulk.

Over time vault boltdb becomes more and more fragmented, this happens to a point where the fragmentation caused boltdb to behave in a completely different way than expected by changing its normal IO write sizes from ~16k to ~5M. When that happens vault goes from doing 40MB/s writes to about 300MB/s writes and starts blocking on I/O and having a number of operation timeouts.

Normally when the database is <1G and not fragmented all is very fast, but when the db is >3.5G then all operations can be very slow due to the 5M IO size issue. See more details at the bottom of Description.

To Reproduce

We have not yet been able to reproduce it from a new database but we do have a database where the problem happens. Therefore the steps below are for now an expectation of how it could be reproduced.

Steps to reproduce the behavior:

  1. Run vault 1.6.0 with raft storage
  2. Run millions of approle logins over a long period of months
  3. Monitor database for fragmentation, and the IO levels, in particular if it is writing small or large block sizes.

Expected behavior
Expected behaviour is that vault will continue to have low latency and write low amounts to disk.

Environment:

  • Vault Server Version: v1.6.0
  • Vault CLI Version: v1.6.0, and many clients, curl, python hvac, etc.
  • Server Operating System/Architecture: centos7
  • vault auditing is disabled as auditing to files causes even more blockages and outages.

Vault server configuration file(s):

cluster_name      = "vault"
log_level         = "trace"
disable_mlock     = true

api_addr          = "https://vault.prod.mydomain.com/"
cluster_addr      = "https://1.vault.prod.mydomain.com:8201"
default_lease_ttl = "24h"
max_lease_ttl     = "168h"

plugin_directory = "/opt/vault/plugins"
tls_require_and_verify_client_cert = "false"
ui = true

listener "tcp" {
  address = "10.1.1.1:8200"
  tls_disable  = "false"
  tls_disable_client_certs = "true"
  tls_cert_file ="/etc/vault.d/tls/vault.crt"
  tls_key_file ="/etc/vault.d/tls/vault.key"
  telemetry {
    unauthenticated_metrics_access = true
  }
}

listener "tcp" {
  address     = "127.0.0.1:8200"
  tls_disable  = "true"
}

storage "raft" {
  path = "/opt/vault/raft"
  node_id = "1.vault.prod.mydomain.com"

    retry_join {
        # must have full protocol in the string
        leader_api_addr = "https://1.vault.prod.mydomain.com:8200"
    }
    retry_join {
        # must have full protocol in the string
        leader_api_addr = "https://2.vault.prod.mydomain.com:8200"
    }
    retry_join {
        # must have full protocol in the string
        leader_api_addr = "https://3.vault.prod.mydomain.com:8200"
    }
    retry_join {
        # must have full protocol in the string
        leader_api_addr = "https://4.vault.prod.mydomain.com:8200"
    }
    retry_join {
        # must have full protocol in the string
        leader_api_addr = "https://5.vault.prod.mydomain.com:8200"
    }
}

telemetry {
}

Workarounds

Two possible workarounds that are very simple:

  • stop vault, wipe it, and restart + unseal copies the DB from the leader. The DB will be very small and fast.
  • stop vault, run bbolt compact, restart + unseal. DB will be small and fast.

Additional context

Vault opens 2 files:

FD 8 /opt/vault/raft/vault.db about 3.6GB this is the boltdb with all vault data
FD 9 /opt/vault/raft/raft/raft.db 100MB in size, raft transactions only

during pathological behaviour we see 250MB/s writes. The I/O profile of vault is drastically different in the two scenarios.

When it is not ok, it keeps on writing lots of 5MB pwrite64s to vault.db, instead of the typical 16k~24k pwrite64s.

c8 is the counter of writes to FD8, and c9 for FD9

normal I/O, around 30MB/s:

len c8: 15295
{'28672': 0.0, '4096': 90.0, '8192': 0.0, '24576': 8.0}
len c9: 6948
{'4096': 85.0, '8192': 0.0, '126976': 4.0, '122880': 9.0}

high I/O pathological case, around 250MB/s:

len c8: 2358
{'4096': 85.0, '8192': 1.0, '5120000': 14.0}
len c9: 1245
{'4096': 84.0, '122880': 15.0, '8192': 0.0}
@write0nly
Copy link
Author

Before and after compaction.

In the before case we are in the 300MB/s ballpark all the time, and after the compaction is goes to 30-40MB/s:

$ sdiff <(bbolt stats compacted.vault.db) <(bbolt stats broken.vault.db)
Aggregate statistics for 2 buckets                              Aggregate statistics for 2 buckets

Page count statistics                                           Page count statistics
        Number of logical branch pages: 5416                  |         Number of logical branch pages: 15244
        Number of physical branch overflow pages: 0                     Number of physical branch overflow pages: 0
        Number of logical leaf pages: 183079                  |         Number of logical leaf pages: 246863
        Number of physical leaf overflow pages: 78            |         Number of physical leaf overflow pages: 22457
Tree statistics                                                 Tree statistics
        Number of keys/value pairs: 1002185                             Number of keys/value pairs: 1002185
        Number of levels in B+tree: 6                         |         Number of levels in B+tree: 10
Page size utilization                                           Page size utilization
        Bytes allocated for physical branch pages: 22183936   |         Bytes allocated for physical branch pages: 62439424
        Bytes actually used for branch data: 21124113 (95%)   |         Bytes actually used for branch data: 30738714 (49%)
        Bytes allocated for physical leaf pages: 750211072    |         Bytes allocated for physical leaf pages: 1103134720
        Bytes actually used for leaf data: 711603722 (94%)    |         Bytes actually used for leaf data: 712624266 (64%)
Bucket statistics                                               Bucket statistics
        Total number of buckets: 2                                      Total number of buckets: 2
        Total number on inlined buckets: 1 (50%)                        Total number on inlined buckets: 1 (50%)
        Bytes used for inlined buckets: 313 (0%)                        Bytes used for inlined buckets: 313 (0%)

@HridoyRoy HridoyRoy added bug Used to indicate a potential bug core/raft storage/raft labels Mar 18, 2021
@write0nly
Copy link
Author

For the record with a sustained amount of approle logins in the order of 5M a day, the database starts experiencing degradation after 20 days or so, going from 20MB/s to 100MB/s, and:

$ sdiff <(bbolt stats compacted.vault.db ) <(bbolt stats bloatdb.vault.db )
Aggregate statistics for 2 buckets                              Aggregate statistics for 2 buckets

Page count statistics                                           Page count statistics
        Number of logical branch pages: 6971                  |         Number of logical branch pages: 19125
        Number of physical branch overflow pages: 0                     Number of physical branch overflow pages: 0
        Number of logical leaf pages: 239454                  |         Number of logical leaf pages: 326409
        Number of physical leaf overflow pages: 85            |         Number of physical leaf overflow pages: 25901
Tree statistics                                                 Tree statistics
        Number of keys/value pairs: 1296396                             Number of keys/value pairs: 1296396
        Number of levels in B+tree: 6                         |         Number of levels in B+tree: 10
Page size utilization                                           Page size utilization
        Bytes allocated for physical branch pages: 28553216   |         Bytes allocated for physical branch pages: 78336000
        Bytes actually used for branch data: 27232903 (95%)   |         Bytes actually used for branch data: 39431388 (50%)
        Bytes allocated for physical leaf pages: 981151744    |         Bytes allocated for physical leaf pages: 1443061760
        Bytes actually used for leaf data: 927847641 (94%)    |         Bytes actually used for leaf data: 929238921 (64%)
Bucket statistics                                               Bucket statistics
        Total number of buckets: 2                                      Total number of buckets: 2
        Total number on inlined buckets: 1 (50%)                        Total number on inlined buckets: 1 (50%)
        Bytes used for inlined buckets: 315 (0%)                        Bytes used for inlined buckets: 315 (0%)

It's worth noting that the actual number of secrets does not grow much in 20 days, the churn is mainly caused by token creation and expiry.

@write0nly
Copy link
Author

@ncabatoff Do you think that this issue could be related to GH-11377

@raskchanky
Copy link
Contributor

@write0nly Based on some internal investigations we've been doing, our current hypothesis is that this behavior is related to the BoltDB freelist and how it behaves as the size of the BoltDB file grows. I just merged #11895, which will be available in the next major release of Vault (1.8). I'm hopeful that these changes improve the problems you're seeing.

@write0nly
Copy link
Author

@raskchanky Thank you very much! This is really great. If you have any beta builds with or without more debugging pls send me a link and I'll try it, otherwise i'll patch it and compile tomorrow.

Thank you :-)

@ncabatoff
Copy link
Contributor

@write0nly, based on @raskchanky's 1.8 work I'm going to treat this issue as resolved. Feel free to open a new ticket if I'm mistaken.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Used to indicate a potential bug storage/raft
Projects
None yet
Development

No branches or pull requests

5 participants