vault performance degration due to fragmented bbolt db when using raft backend #11072

write0nly · 2021-03-10T16:33:25Z

Describe the bug

Vault environment using 5 nodes with vault 1.6.0 raft backend. About 1000 approle role_ids, and the very unfortunate situation of having apps that churn more than 5million approle logins a day due to batch processing, and let the background token/lease watermark expire them in bulk.

Over time vault boltdb becomes more and more fragmented, this happens to a point where the fragmentation caused boltdb to behave in a completely different way than expected by changing its normal IO write sizes from ~16k to ~5M. When that happens vault goes from doing 40MB/s writes to about 300MB/s writes and starts blocking on I/O and having a number of operation timeouts.

Normally when the database is <1G and not fragmented all is very fast, but when the db is >3.5G then all operations can be very slow due to the 5M IO size issue. See more details at the bottom of Description.

To Reproduce

We have not yet been able to reproduce it from a new database but we do have a database where the problem happens. Therefore the steps below are for now an expectation of how it could be reproduced.

Steps to reproduce the behavior:

Run vault 1.6.0 with raft storage
Run millions of approle logins over a long period of months
Monitor database for fragmentation, and the IO levels, in particular if it is writing small or large block sizes.

Expected behavior
Expected behaviour is that vault will continue to have low latency and write low amounts to disk.

Environment:

Vault Server Version: v1.6.0
Vault CLI Version: v1.6.0, and many clients, curl, python hvac, etc.
Server Operating System/Architecture: centos7
vault auditing is disabled as auditing to files causes even more blockages and outages.

Vault server configuration file(s):

cluster_name      = "vault"
log_level         = "trace"
disable_mlock     = true

api_addr          = "https://vault.prod.mydomain.com/"
cluster_addr      = "https://1.vault.prod.mydomain.com:8201"
default_lease_ttl = "24h"
max_lease_ttl     = "168h"

plugin_directory = "/opt/vault/plugins"
tls_require_and_verify_client_cert = "false"
ui = true

listener "tcp" {
  address = "10.1.1.1:8200"
  tls_disable  = "false"
  tls_disable_client_certs = "true"
  tls_cert_file ="/etc/vault.d/tls/vault.crt"
  tls_key_file ="/etc/vault.d/tls/vault.key"
  telemetry {
    unauthenticated_metrics_access = true
  }
}

listener "tcp" {
  address     = "127.0.0.1:8200"
  tls_disable  = "true"
}

storage "raft" {
  path = "/opt/vault/raft"
  node_id = "1.vault.prod.mydomain.com"

    retry_join {
        # must have full protocol in the string
        leader_api_addr = "https://1.vault.prod.mydomain.com:8200"
    }
    retry_join {
        # must have full protocol in the string
        leader_api_addr = "https://2.vault.prod.mydomain.com:8200"
    }
    retry_join {
        # must have full protocol in the string
        leader_api_addr = "https://3.vault.prod.mydomain.com:8200"
    }
    retry_join {
        # must have full protocol in the string
        leader_api_addr = "https://4.vault.prod.mydomain.com:8200"
    }
    retry_join {
        # must have full protocol in the string
        leader_api_addr = "https://5.vault.prod.mydomain.com:8200"
    }
}

telemetry {
}

Workarounds

Two possible workarounds that are very simple:

stop vault, wipe it, and restart + unseal copies the DB from the leader. The DB will be very small and fast.
stop vault, run bbolt compact, restart + unseal. DB will be small and fast.

Additional context

Vault opens 2 files:

FD 8 /opt/vault/raft/vault.db about 3.6GB this is the boltdb with all vault data
FD 9 /opt/vault/raft/raft/raft.db 100MB in size, raft transactions only

during pathological behaviour we see 250MB/s writes. The I/O profile of vault is drastically different in the two scenarios.

When it is not ok, it keeps on writing lots of 5MB pwrite64s to vault.db, instead of the typical 16k~24k pwrite64s.

c8 is the counter of writes to FD8, and c9 for FD9

normal I/O, around 30MB/s:

len c8: 15295
{'28672': 0.0, '4096': 90.0, '8192': 0.0, '24576': 8.0}
len c9: 6948
{'4096': 85.0, '8192': 0.0, '126976': 4.0, '122880': 9.0}

high I/O pathological case, around 250MB/s:

len c8: 2358
{'4096': 85.0, '8192': 1.0, '5120000': 14.0}
len c9: 1245
{'4096': 84.0, '122880': 15.0, '8192': 0.0}

The text was updated successfully, but these errors were encountered:

write0nly · 2021-03-10T16:35:39Z

Before and after compaction.

In the before case we are in the 300MB/s ballpark all the time, and after the compaction is goes to 30-40MB/s:

$ sdiff <(bbolt stats compacted.vault.db) <(bbolt stats broken.vault.db)
Aggregate statistics for 2 buckets                              Aggregate statistics for 2 buckets

Page count statistics                                           Page count statistics
        Number of logical branch pages: 5416                  |         Number of logical branch pages: 15244
        Number of physical branch overflow pages: 0                     Number of physical branch overflow pages: 0
        Number of logical leaf pages: 183079                  |         Number of logical leaf pages: 246863
        Number of physical leaf overflow pages: 78            |         Number of physical leaf overflow pages: 22457
Tree statistics                                                 Tree statistics
        Number of keys/value pairs: 1002185                             Number of keys/value pairs: 1002185
        Number of levels in B+tree: 6                         |         Number of levels in B+tree: 10
Page size utilization                                           Page size utilization
        Bytes allocated for physical branch pages: 22183936   |         Bytes allocated for physical branch pages: 62439424
        Bytes actually used for branch data: 21124113 (95%)   |         Bytes actually used for branch data: 30738714 (49%)
        Bytes allocated for physical leaf pages: 750211072    |         Bytes allocated for physical leaf pages: 1103134720
        Bytes actually used for leaf data: 711603722 (94%)    |         Bytes actually used for leaf data: 712624266 (64%)
Bucket statistics                                               Bucket statistics
        Total number of buckets: 2                                      Total number of buckets: 2
        Total number on inlined buckets: 1 (50%)                        Total number on inlined buckets: 1 (50%)
        Bytes used for inlined buckets: 313 (0%)                        Bytes used for inlined buckets: 313 (0%)

write0nly · 2021-03-23T13:26:26Z

For the record with a sustained amount of approle logins in the order of 5M a day, the database starts experiencing degradation after 20 days or so, going from 20MB/s to 100MB/s, and:

$ sdiff <(bbolt stats compacted.vault.db ) <(bbolt stats bloatdb.vault.db )
Aggregate statistics for 2 buckets                              Aggregate statistics for 2 buckets

Page count statistics                                           Page count statistics
        Number of logical branch pages: 6971                  |         Number of logical branch pages: 19125
        Number of physical branch overflow pages: 0                     Number of physical branch overflow pages: 0
        Number of logical leaf pages: 239454                  |         Number of logical leaf pages: 326409
        Number of physical leaf overflow pages: 85            |         Number of physical leaf overflow pages: 25901
Tree statistics                                                 Tree statistics
        Number of keys/value pairs: 1296396                             Number of keys/value pairs: 1296396
        Number of levels in B+tree: 6                         |         Number of levels in B+tree: 10
Page size utilization                                           Page size utilization
        Bytes allocated for physical branch pages: 28553216   |         Bytes allocated for physical branch pages: 78336000
        Bytes actually used for branch data: 27232903 (95%)   |         Bytes actually used for branch data: 39431388 (50%)
        Bytes allocated for physical leaf pages: 981151744    |         Bytes allocated for physical leaf pages: 1443061760
        Bytes actually used for leaf data: 927847641 (94%)    |         Bytes actually used for leaf data: 929238921 (64%)
Bucket statistics                                               Bucket statistics
        Total number of buckets: 2                                      Total number of buckets: 2
        Total number on inlined buckets: 1 (50%)                        Total number on inlined buckets: 1 (50%)
        Bytes used for inlined buckets: 315 (0%)                        Bytes used for inlined buckets: 315 (0%)

It's worth noting that the actual number of secrets does not grow much in 20 days, the churn is mainly caused by token creation and expiry.

write0nly · 2021-05-18T17:48:44Z

@ncabatoff Do you think that this issue could be related to GH-11377

raskchanky · 2021-06-22T17:49:33Z

@write0nly Based on some internal investigations we've been doing, our current hypothesis is that this behavior is related to the BoltDB freelist and how it behaves as the size of the BoltDB file grows. I just merged #11895, which will be available in the next major release of Vault (1.8). I'm hopeful that these changes improve the problems you're seeing.

write0nly · 2021-06-22T20:48:38Z

@raskchanky Thank you very much! This is really great. If you have any beta builds with or without more debugging pls send me a link and I'll try it, otherwise i'll patch it and compile tomorrow.

Thank you :-)

ncabatoff · 2021-07-21T12:49:45Z

@write0nly, based on @raskchanky's 1.8 work I'm going to treat this issue as resolved. Feel free to open a new ticket if I'm mistaken.

HridoyRoy added bug Used to indicate a potential bug core/raft storage/raft labels Mar 18, 2021

write0nly mentioned this issue Jul 2, 2021

raft sync times out with default timers when boltdb is bigger than 5-6GB #11983

Open

hsimon-hashicorp removed core/raft labels Jul 20, 2021

ncabatoff closed this as completed Jul 21, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

vault performance degration due to fragmented bbolt db when using raft backend #11072

vault performance degration due to fragmented bbolt db when using raft backend #11072

write0nly commented Mar 10, 2021

write0nly commented Mar 10, 2021

write0nly commented Mar 23, 2021

write0nly commented May 18, 2021

raskchanky commented Jun 22, 2021

write0nly commented Jun 22, 2021

ncabatoff commented Jul 21, 2021

vault performance degration due to fragmented bbolt db when using raft backend #11072

vault performance degration due to fragmented bbolt db when using raft backend #11072

Comments

write0nly commented Mar 10, 2021

write0nly commented Mar 10, 2021

write0nly commented Mar 23, 2021

write0nly commented May 18, 2021

raskchanky commented Jun 22, 2021

write0nly commented Jun 22, 2021

ncabatoff commented Jul 21, 2021