Skip to content

Latest commit

 

History

History
521 lines (466 loc) · 125 KB

telemetry.mdx

File metadata and controls

521 lines (466 loc) · 125 KB
layout page_title description
docs
Telemetry
Learn about the telemetry data available in Vault.

Telemetry

The Vault server process collects various runtime metrics about the performance of different libraries and subsystems. These metrics are aggregated on a ten second interval and are retained for one minute in-memory. In order to monitor Vault and collect durable metrics, Telemetry from Vault must be stored in metrics aggregation software.

To view the raw data, you must send a signal to the Vault process: on Unix-style operating systems, this is USR1 while on Windows it is BREAK. When the Vault process receives this signal it will dump the current telemetry information to the process's stderr.

This telemetry information can be used for debugging or otherwise getting a better view of what Vault is doing.

Telemetry information can also be streamed directly from Vault to a range of metrics aggregation solutions as described in the telemetry Stanza documentation.

The following is an example telemetry dump snippet:

[2017-12-19 20:37:50 +0000 UTC][G] 'vault.7f320e57f9fe.expire.num_leases': 5100.000
[2017-12-19 20:37:50 +0000 UTC][G] 'vault.7f320e57f9fe.runtime.num_goroutines': 39.000
[2017-12-19 20:37:50 +0000 UTC][G] 'vault.7f320e57f9fe.runtime.sys_bytes': 222746880.000
[2017-12-19 20:37:50 +0000 UTC][G] 'vault.7f320e57f9fe.runtime.malloc_count': 109189192.000
[2017-12-19 20:37:50 +0000 UTC][G] 'vault.7f320e57f9fe.runtime.free_count': 108408240.000
[2017-12-19 20:37:50 +0000 UTC][G] 'vault.7f320e57f9fe.runtime.heap_objects': 780953.000
[2017-12-19 20:37:50 +0000 UTC][G] 'vault.7f320e57f9fe.runtime.total_gc_runs': 232.000
[2017-12-19 20:37:50 +0000 UTC][G] 'vault.7f320e57f9fe.runtime.alloc_bytes': 72954392.000
[2017-12-19 20:37:50 +0000 UTC][G] 'vault.7f320e57f9fe.runtime.total_gc_pause_ns': 150293024.000
[2017-12-19 20:37:50 +0000 UTC][S] 'vault.merkle.flushDirty': Count: 100 Min: 0.008 Mean: 0.027 Max: 0.183 Stddev: 0.024 Sum: 2.681 LastUpdated: 2017-12-19 20:37:59.848733035 +0000 UTC m=+10463.692105920
[2017-12-19 20:37:50 +0000 UTC][S] 'vault.merkle.saveCheckpoint': Count: 4 Min: 0.021 Mean: 0.054 Max: 0.110 Stddev: 0.039 Sum: 0.217 LastUpdated: 2017-12-19 20:37:57.048458148 +0000 UTC m=+10460.891835029
[2017-12-19 20:38:00 +0000 UTC][G] 'vault.7f320e57f9fe.runtime.alloc_bytes': 73326136.000
[2017-12-19 20:38:00 +0000 UTC][G] 'vault.7f320e57f9fe.runtime.sys_bytes': 222746880.000
[2017-12-19 20:38:00 +0000 UTC][G] 'vault.7f320e57f9fe.runtime.malloc_count': 109195904.000
[2017-12-19 20:38:00 +0000 UTC][G] 'vault.7f320e57f9fe.runtime.free_count': 108409568.000
[2017-12-19 20:38:00 +0000 UTC][G] 'vault.7f320e57f9fe.runtime.heap_objects': 786342.000
[2017-12-19 20:38:00 +0000 UTC][G] 'vault.7f320e57f9fe.runtime.total_gc_pause_ns': 150293024.000
[2017-12-19 20:38:00 +0000 UTC][G] 'vault.7f320e57f9fe.expire.num_leases': 5100.000
[2017-12-19 20:38:00 +0000 UTC][G] 'vault.7f320e57f9fe.runtime.num_goroutines': 39.000
[2017-12-19 20:38:00 +0000 UTC][G] 'vault.7f320e57f9fe.runtime.total_gc_runs': 232.000
[2017-12-19 20:38:00 +0000 UTC][S] 'vault.route.rollback.consul-': Count: 1 Sum: 0.013 LastUpdated: 2017-12-19 20:38:01.968471579 +0000 UTC m=+10465.811842067
[2017-12-19 20:38:00 +0000 UTC][S] 'vault.rollback.attempt.consul-': Count: 1 Sum: 0.073 LastUpdated: 2017-12-19 20:38:01.968502743 +0000 UTC m=+10465.811873131
[2017-12-19 20:38:00 +0000 UTC][S] 'vault.rollback.attempt.pki-': Count: 1 Sum: 0.070 LastUpdated: 2017-12-19 20:38:01.96867005 +0000 UTC m=+10465.812041936
[2017-12-19 20:38:00 +0000 UTC][S] 'vault.route.rollback.auth-app-id-': Count: 1 Sum: 0.012 LastUpdated: 2017-12-19 20:38:01.969146401 +0000 UTC m=+10465.812516689
[2017-12-19 20:38:00 +0000 UTC][S] 'vault.rollback.attempt.identity-': Count: 1 Sum: 0.063 LastUpdated: 2017-12-19 20:38:01.968029888 +0000 UTC m=+10465.811400276
[2017-12-19 20:38:00 +0000 UTC][S] 'vault.rollback.attempt.database-': Count: 1 Sum: 0.066 LastUpdated: 2017-12-19 20:38:01.969394215 +0000 UTC m=+10465.812764603
[2017-12-19 20:38:00 +0000 UTC][S] 'vault.barrier.get': Count: 16 Min: 0.010 Mean: 0.015 Max: 0.031 Stddev: 0.005 Sum: 0.237 LastUpdated: 2017-12-19 20:38:01.983268118 +0000 UTC m=+10465.826637008
[2017-12-19 20:38:00 +0000 UTC][S] 'vault.merkle.flushDirty': Count: 100 Min: 0.006 Mean: 0.024 Max: 0.098 Stddev: 0.019 Sum: 2.386 LastUpdated: 2017-12-19 20:38:09.848158309 +0000 UTC m=+10473.691527099

You'll note that log entries are prefixed with the metric type as follows:

  • [C] is a counter. Counters are cumulative metrics that are incremented when some event occurs, and are reset at the end of reporting intervals. Vault retains counters and other metrics for one minute in-memory, so to see accurate and persistent counters over time an aggregation solution must be configured.
  • [G] is a gauge. Gauges provide measurements of current values.
  • [S] is a summary. Summaries provide sample observations of values. Vault commonly uses summaries for measuring timing duration of discrete events in the reporting interval.

The following sections describe available Vault metrics. The metrics interval can be assumed to be 10 seconds when manually triggering metrics output using the above described signals. Some high-cardinality gauges, like vault.kv.secret.count, are emitted every 10 minutes, or at an interval configured in the telemetry stanza.

Some Vault metrics come with additional labels describing the measurement in more detail, such as the namespace in which an operation takes place, or the auth method used to create a token. In the in-memory telemetry, or other telemetry engines that do not support labels, this additional information is incorporated into the metric name. The metric name in the table below is followed by a list of labels supported, in the order in which they appear if flattened.

Audit Metrics

These metrics relate to auditing.

Metric Description Unit Type
vault.audit.log_request Duration of time taken by all audit log requests across all audit log devices ms summary
vault.audit.log_response Duration of time taken by audit log responses across all audit log devices ms summary
vault.audit.log_request_failure Number of audit log request failures. NOTE: This is a particularly important metric. Any non-zero value here indicates that there was a failure to make an audit log request to any of the configured audit log devices; when Vault cannot log to any of the configured audit log devices it ceases all user operations, and you should begin troubleshooting the audit log devices immediately if this metric continually increases. failures counter
vault.audit.log_response_failure Number of audit log response failures. NOTE: This is a particularly important metric. Any non-zero value here indicates that there was a failure to receive a response to a request made to one of the configured audit log devices; when Vault cannot log to any of the configured audit log devices it ceases all user operations, and you should begin troubleshooting the audit log devices immediately if this metric continually increases. failures counter

NOTE: In addition, there are audit metrics for each enabled audit device represented as vault.audit.<type>.log_request. For example, if a file audit device is enabled, its metrics would be vault.audit.file.log_request and vault.audit.file.log_response .

Core Metrics

These metrics represent operational aspects of the running Vault instance.

Metric Description Unit Type
vault.barrier.delete Duration of time taken by DELETE operations at the barrier ms summary
vault.barrier.get Duration of time taken by GET operations at the barrier ms summary
vault.barrier.put Duration of time taken by PUT operations at the barrier ms summary
vault.barrier.list Duration of time taken by LIST operations at the barrier ms summary
vault.cache.hit Number of times a value was retrieved from the LRU cache. cache hit counter
vault.cache.miss Number of times a value was not in the LRU cache. The results in a read from the configured storage. cache miss counter
vault.cache.write Number of times a value was written to the LRU cache. cache write counter
vault.cache.delete Number of times a value was deleted from the LRU cache. This does not count cache expirations. cache delete counter
vault.core.active Has value 1 when the vault node is active, and 0 when node is in standby. bool gauge
vault.core.activity.fragment_size Number of entities or tokens (depending on the "type" label) observed by the local node. tokens counter
vault.core.activity.segment_write Duration of time taken writing activity log segments to storage. ms summary
vault.core.check_token Duration of time taken by token checks handled by Vault core ms summary
vault.core.fetch_acl_and_token Duration of time taken by ACL and corresponding token entry fetches handled by Vault core ms summary
vault.core.handle_request Duration of time taken by requests handled by Vault core ms summary
vault.core.handle_login_request Duration of time taken by login requests handled by Vault core ms summary
vault.core.leadership_setup_failed Duration of time taken by cluster leadership setup failures which have occurred in a highly available Vault cluster. This should be monitored and alerted on for overall cluster leadership status. ms summary
vault.core.leadership_lost Duration of time taken by cluster leadership losses which have occurred in a highly available Vault cluster. This should be monitored and alerted on for overall cluster leadership status. ms summary
vault.core.license.expiration_time_epoch Time as epoch (seconds since Jan 1 1970) at which license will expire. ms gauge
vault.core.mount_table.num_entries Number of mounts in a particular mount table. This metric is labeled by table type (auth or logical) and whether or not the table is replicated (local or not) objects gauge
vault.core.mount_table.size Size of a particular mount table. This metric is labeled by table type (auth or logical) and whether or not the table is replicated (local or not) objects gauge
vault.core.post_unseal Duration of time taken by post-unseal operations handled by Vault core ms summary
vault.core.pre_seal Duration of time taken by pre-seal operations ms summary
vault.core.seal-with-request Duration of time taken by requested seal operations ms summary
vault.core.seal Duration of time taken by seal operations ms summary
vault.core.seal-internal Duration of time taken by internal seal operations ms summary
vault.core.step_down Duration of time taken by cluster leadership step downs. This should be monitored and alerted on for overall cluster leadership status. ms summary
vault.core.unseal Duration of time taken by unseal operations ms summary
vault.core.unsealed Has value 1 when Vault is unsealed, and 0 when Vault is sealed. bool gauge
vault.metrics.collection (cluster,gauge) Time taken to collect usage gauges, labelled by gauge type. summary
vault.metrics.collection.interval (cluster,gauge) Current value of usage gauge collection interval. summary
vault.metrics.collection.error (cluster,gauge) Errors while collection usage guages, labeled by gauge type. counter
vault.rollback.attempt.<mountpoint> Time taken to perform a rollback operation on the given mount point. The mount point name has its forward slashes / replaced by -. For example, a rollback operation on the auth/token backend would be reportes as vault.rollback.attempt.auth-token-. ms summary
vault.route.create.<mountpoint> Time taken to dispatch a create operation to a backend, and for that backend to process it. The mount point name has its forward slashes / replaced by -. For example, a create operation to ns1/secret/ would have corresponding metric vault.route.create.ns1-secret-. The number of samples of this metric, and the corresponding ones for other operations below, indicates how many operations were performed per mount point. ms summary
vault.route.delete.<mountpoint> Time taken to dispatch a delete operation to a backend, and for that backend to process it. ms summary
vault.route.list.<mountpoint> Time taken to dispatch a list operation to a backend, and for that backend to process it. ms summary
vault.route.read.<mountpoint> Time taken to dispatch a read operation to a backend, and for that backend to process it. ms summary
vault.route.rollback.<mountpoint> Time taken to dispatch a rollback operation to a backend, and for that backend to process it. Rollback operations are automatically scheduled to clean up partial errors. ms summary

Runtime Metrics

These metrics collect information from Vault's Go runtime, such as memory usage information.

Metric Description Unit Type
vault.runtime.alloc_bytes Number of bytes allocated by the Vault process. This could burst from time to time, but should return to a steady state value. bytes gauge
vault.runtime.free_count Number of freed objects objects gauge
vault.runtime.heap_objects Number of objects on the heap. This is a good general memory pressure indicator worth establishing a baseline and thresholds for alerting. objects gauge
vault.runtime.malloc_count Cumulative count of allocated heap objects objects gauge
vault.runtime.num_goroutines Number of goroutines. This serves as a general system load indicator worth establishing a baseline and thresholds for alerting. goroutines gauge
vault.runtime.sys_bytes Number of bytes allocated to Vault. This includes what is being used by Vault's heap and what has been reclaimed but not given back to the operating system. bytes gauge
vault.runtime.total_gc_pause_ns The total garbage collector pause time since Vault was last started ns gauge
vault.runtime.gc_pause_ns Total duration of the last garbage collection run ns summary
vault.runtime.total_gc_runs Total number of garbage collection runs since Vault was last started operations gauge

Policy Metrics

These metrics report measurements of the time spent performing policy operations.

Metric Description Unit Type
vault.policy.get_policy Time taken to get a policy ms summary
vault.policy.list_policies Time taken to list policies ms summary
vault.policy.delete_policy Time taken to delete a policy ms summary
vault.policy.set_policy Time taken to set a policy ms summary

Token, Identity, and Lease Metrics

These metrics cover measurement of token, identity, and lease operations, and counts of the number of such objects managed by Vault.

Metric Description Unit Type
vault.expire.fetch-lease-times Time taken to fetch lease times ms summary
vault.expire.fetch-lease-times-by-token Time taken to fetch lease times by token ms summary
vault.expire.num_leases Number of all leases which are eligible for eventual expiry leases gauge
vault.expire.num_irrevocable_leases Number of leases that cannot be revoked automatically leases gauge
vault.expire.leases.by_expiration (cluster,gauge,expiring,namespace) Number of leases set to expire, grouped by a time interval. This time interval and total number of time intervals are configurable via lease_metrics_epsilon and num_lease_metrics_buckets in the telemetry stanza of a vault server configuration. The default values for these are 1hr and 168 respectively, so the metric will report the number of leases that will expire each hour from the current time to a week from the current time. One can additionally group lease expiration by namespace by setting add_lease_metrics_namespace_labels to true in the config file (default is false). leases gauge
vault.expire.lease_expiration Count of lease expirations leases counter
vault.expire.job_manager.total_jobs Total pending revocation jobs leases summary
vault.expire.job_manager.queue_length Total pending revocation jobs by auth method leases summary
vault.expire.lease_expiration Count of lease expirations leases counter
vault.expire.lease_expiration.time_in_queue Time taken for lease to get to the front of the revoke queue ms summary
vault.expire.lease_expiration.error Count of lease expiration errors errors counter
vault.expire.revoke Time taken to revoke a token ms summary
vault.expire.revoke-force Time taken to forcibly revoke a token ms summary
vault.expire.revoke-prefix Time taken to revoke tokens on a prefix ms summary
vault.expire.revoke-by-token Time taken to revoke all secrets issued with a given token ms summary
vault.expire.renew Time taken to renew a lease ms summary
vault.expire.renew-token Time taken to renew a token which does not need to invoke a logical backend ms summary
vault.expire.register Time taken for register operations ms summary
vault.expire.register-auth Time taken for register authentication operations which create lease entries without lease ID ms summary
vault.identity.num_entities Number of identity entities stored in Vault entities gauge
vault.identity.entity.active.monthly (cluster, namespace) Number of distinct entities that created a token during the past month, per namespace. Only available if client count is enabled. Reported at the start of each month. entities gauge
vault.identity.entity.active.partial_month (cluster) Total number of distinct entities that created a token during the current month. Only available if client count is enabled. Reported periodically within each month. entities gauge
vault.identity.entity.active.reporting_period (cluster, namespace) Number of distinct entities that created a token in the past N months, as defined by the client count default reporting period. Only available if client count is enabled. Reported at the start of each month. entities gauge
vault.identity.entity.alias.count (cluster, namespace, auth_method, mount_point) Number of identity entities aliases stored in Vault, grouped by the auth mount that created them. This gauge is computed every 10 minutes. aliases gauge
vault.identity.entity.count (cluster, namespace) Number of identity entities stored in Vault, grouped by namespace. entities gauge
vault.identity.entity.creation (cluster, namespace, auth_method, mount_point) Number of identity entities created, grouped by the auth mount that created them. entities counter
vault.identity.upsert_entity_txn Time taken to insert a new or modified entity into the in-memory database, and persist it to storage. ms summary
vault.identity.upsert_group_txn Time taken to insert a new or modified group into the in-memory database, and persist it to storage. This operation is performed on group membership changes. ms summary
vault.token.count (cluster, namespace) Number of service tokens available for use; counts all un-expired and un-revoked tokens in Vault's token store. This measurement is performed every 10 minutes. token gauge
vault.token.count.by_auth (cluster, namespace, auth_method) Number of service tokens that were created by a particular auth method. tokens gauge
vault.token.count.by_policy (cluster, namespace, policy) Number of service tokens that have a particular policy attached. If a token has more than one policy, it is counted in each policy gauge. tokens gauge
vault.token.count.by_ttl (cluster, namespace, creation_ttl) Number of service tokens, grouped by the TTL range they were assigned at creation. tokens gauge
vault.token.create The time taken to create a token ms summary
vault.token.create_root Number of created root tokens. Does not decrease on revocation. tokens counter
vault.token.createAccessor The time taken to create a token accessor ms summary
vault.token.creation (cluster, namespace, auth_method, mount_point, creation_ttl, token_type) Number of service or batch tokens created. tokens counter
vault.token.lookup The time taken to look up a token ms summary
vault.token.revoke Time taken to revoke a token ms summary
vault.token.revoke-tree Time taken to revoke a token tree ms summary
vault.token.store Time taken to store an updated token entry without writing to the secondary index ms summary

Resource Quota Metrics

These metrics relate to rate limit and lease count quotas. Each metric comes with a label "name" identifying the specific quota.

Metric Description Unit Type
vault.quota.rate_limit.violation Total number of rate limit quota violations quota counter
vault.quota.lease_count.violation Total number of lease count quota violations quota counter
vault.quota.lease_count.max Total maximum amount of leases allowed by the lease count quota lease gauge
vault.quota.lease_count.counter Total current amount of leases generated by the lease count quota lease gauge

Merkle Tree and Write Ahead Log Metrics

These metrics relate to internal operations on Merkle Trees and Write Ahead Logs (WAL)

Metric Description Unit Type
vault.merkle.flushDirty Time taken to flush any dirty pages to cold storage ms summary
vault.merkle.flushDirty.num_pages Number of pages flushed pages gauge
vault.merkle.saveCheckpoint Time taken to save the checkpoint ms summary
vault.merkle.saveCheckpoint.num_dirty Number of dirty pages at checkpoint pages gauge
vault.wal.deleteWALs Time taken to delete a Write Ahead Log (WAL) ms summary
vault.wal.gc.deleted Number of Write Ahead Logs (WAL) deleted during each garbage collection run WAL gauge
vault.wal.gc.total Total Number of Write Ahead Logs (WAL) on disk WAL gauge
vault.wal.loadWAL Time taken to load a Write Ahead Log (WAL) ms summary
vault.wal.persistWALs Time taken to persist a Write Ahead Log (WAL) ms summary
vault.wal.flushReady Time taken to flush a ready Write Ahead Log (WAL) to storage ms summary
vault.wal.flushReady.queue_len Size of the write queue in the WAL system WAL summary

Replication Metrics

These metrics relate to Vault Enterprise Replication. The following metrics are not available in telemetry unless replication is in an unhealthy state: replication.fetchRemoteKeys, replication.merkleDiff, and replication.merkleSync.

Metric Description Unit Type
vault.core.replication.performance.primary Set to 1 if this is a performance primary, 0 if not boolean gauge
vault.core.replication.performance.secondary Set to 1 if this is a performance secondary, 0 if not boolean gauge
vault.core.replication.dr.primary Set to 1 if this is a DR primary, 0 if not boolean gauge
vault.core.replication.dr.secondary Set to 1 if this is a DR secondary, 0 if not boolean gauge
vault.core.performance_standby Set to 1 if this is a performance standby, 0 if not boolean gauge
vault.logshipper.streamWALs.missing_guard Number of incidences where the starting Merkle Tree index used to begin streaming WAL entries is not matched/found missing guards counter
vault.logshipper.streamWALs.guard_found Number of incidences where the starting Merkle Tree index used to begin streaming WAL entries is matched/found found guards counter
vault.logshipper.streamWALs.scanned_entries Number of entries scanned in the buffer before the right one was found. scanned entries summary
vault.logshipper.buffer.length Current length of the log shipper buffer buffer entries gauge
vault.logshipper.buffer.size Current size in bytes of the log shipper buffer bytes gauge
vault.logshipper.buffer.max_length Maximum length of the log shipper buffer buffer entries gauge
vault.logshipper.buffer.max_size Maximum size in bytes of the log shipper buffer bytes gauge
vault.replication.fetchRemoteKeys Time taken to fetch keys from a remote cluster participating in replication prior to Merkle Tree based delta generation ms summary
vault.replication.merkleDiff Time taken to perform a Merkle Tree based delta generation between the clusters participating in replication ms summary
vault.replication.merkleSync Time taken to perform a Merkle Tree based synchronization using the last delta generated between the clusters participating in replication ms summary
vault.replication.merkle.commit_index The last committed index in the Merkle Tree. sequence number gauge
vault.replication.wal.last_wal The index of the last WAL sequence number gauge
vault.replication.wal.last_dr_wal The index of the last DR WAL sequence number gauge
vault.replication.wal.last_performance_wal The index of the last Performance WAL sequence number gauge
vault.replication.fsm.last_remote_wal The index of the last remote WAL sequence number gauge
vault.replication.wal.gc Time taken to complete one run of the WAL garbage collection process ms summary
vault.replication.rpc.server.auth_request Duration of time taken by auth request ms summary
vault.replication.rpc.server.bootstrap_request Duration of time taken by bootstrap request ms summary
vault.replication.rpc.server.conflicting_pages_request Duration of time taken by conflicting pages request ms summary
vault.replication.rpc.server.echo Duration of time taken by echo ms summary
vault.replication.rpc.server.forwarding_request Duration of time taken by forwarding request ms summary
vault.replication.rpc.server.guard_hash_request Duration of time taken by guard hash request ms summary
vault.replication.rpc.server.persist_alias_request Duration of time taken by persist alias request ms summary
vault.replication.rpc.server.persist_persona_request Duration of time taken by persist persona request ms summary
vault.replication.rpc.server.stream_wals_request Duration of time taken by stream wals request ms summary
vault.replication.rpc.server.sub_page_hashes_request Duration of time taken by sub page hashes request ms summary
vault.replication.rpc.server.sync_counter_request Duration of time taken by sync counter request ms summary
vault.replication.rpc.server.upsert_group_request Duration of time taken by upsert group request ms summary
vault.replication.rpc.client.conflicting_pages Duration of time taken by client conflicting pages request ms summary
vault.replication.rpc.client.fetch_keys Duration of time taken by client fetch keys request ms summary
vault.replication.rpc.client.forward Duration of time taken by client forward request ms summary
vault.replication.rpc.client.guard_hash Duration of time taken by client guard hash request ms summary
vault.replication.rpc.client.persist_alias Duration of time taken by ms summary
vault.replication.rpc.client.register_auth Duration of time taken by client register auth request ms summary
vault.replication.rpc.client.register_lease Duration of time taken by client register lease request ms summary
vault.replication.rpc.client.stream_wals Duration of time taken by client s ms summary
vault.replication.rpc.client.sub_page_hashes Duration of time taken by client sub page hashes request ms summary
vault.replication.rpc.client.sync_counter Duration of time taken by client sync counter request ms summary
vault.replication.rpc.client.upsert_group Duration of time taken by client upstert group request ms summary
vault.replication.rpc.client.wrap_in_cubbyhole Duration of time taken by client wrap in cubbyhole request ms summary
vault.replication.rpc.dr.server.echo Duration of time taken by DR echo request ms summary
vault.replication.rpc.dr.server.fetch_keys_request Duration of time taken by DR fetch keys request ms summary
vault.replication.rpc.standby.server.echo Duration of time taken by standby echo request ms summary
vault.replication.rpc.standby.server.register_auth_request Duration of time taken by standby register auth request ms summary
vault.replication.rpc.standby.server.register_lease_request Duration of time taken by standby register lease request ms summary
vault.replication.rpc.standby.server.wrap_token_request Duration of time taken by standby wrap token request ms summary

Secrets Engines Metrics

These metrics relate to the supported secrets engines.

Metric Description Unit Type
database.Initialize Time taken to initialize a database secret engine across all database secrets engines ms summary
database.<name>.Initialize Time taken to initialize a database secret engine for the named database secrets engine <name>, for example: database.postgresql-prod.Initialize ms summary
database.Initialize.error Number of database secrets engine initialization operation errors across all database secrets engines errors counter
database.<name>.Initialize.error Number of database secrets engine initialization operation errors for the named database secrets engine <name>, for example: database.postgresql-prod.Initialize.error errors counter
database.Close Time taken to close a database secret engine across all database secrets engines ms summary
database.<name>.Close Time taken to close a database secret engine for the named database secrets engine <name>, for example: database.postgresql-prod.Close ms summary
database.Close.error Number of database secrets engine close operation errors across all database secrets engines errors counter
database.<name>.Close.error Number of database secrets engine close operation errors for the named database secrets engine <name>, for example: database.postgresql-prod.Close.error errors counter
database.CreateUser Time taken to create a user across all database secrets engines ms summary
database.<name>.CreateUser Time taken to create a user for the named database secrets engine <name> ms summary
database.CreateUser.error Number of user creation operation errors across all database secrets engines errors counter
database.<name>.CreateUser.error Number of user creation operation errors for the named database secrets engine <name>, for example: database.postgresql-prod.CreateUser.error errors counter
database.RenewUser Time taken to renew a user across all database secrets engines ms summary
database.<name>.RenewUser Time taken to renew a user for the named database secrets engine <name>, for example: database.postgresql-prod.RenewUser ms summary
database.RenewUser.error Number of user renewal operation errors across all database secrets engines errors counter
database.<name>.RenewUser.error Number of user renewal operations for the named database secrets engine <name>, for example: database.postgresql-prod.RenewUser.error errors counter
database.RevokeUser Time taken to revoke a user across all database secrets engines ms summary
database.<name>.RevokeUser Time taken to revoke a user for the named database secrets engine <name>, for example: database.postgresql-prod.RevokeUser ms summary
database.RevokeUser.error Number of user revocation operation errors across all database secrets engines errors counter
database.<name>.RevokeUser.error Number of user revocation operations for the named database secrets engine <name>, for example: database.postgresql-prod.RevokeUser.error errors counter
vault.secret.kv.count (cluster, namespace, mount_point) Number of entries in each key-value secret engine. paths gauge
vault.secret.lease.creation (cluster, namespace, secret_engine, mount_point, creation_ttl) Counts the number of leases created by secret engines. leases counter

Storage Backend Metrics

These metrics relate to the supported storage backends.

Metric Description Unit Type
vault.azure.put Duration of a PUT operation against the Azure storage backend ms summary
vault.azure.get Duration of a GET operation against the Azure storage backend ms summary
vault.azure.delete Duration of a DELETE operation against the Azure storage backend ms summary
vault.azure.list Duration of a LIST operation against the Azure storage backend ms summary
vault.cassandra.put Duration of a PUT operation against the Cassandra storage backend ms summary
vault.cassandra.get Duration of a GET operation against the Cassandra storage backend ms summary
vault.cassandra.delete Duration of a DELETE operation against the Cassandra storage backend ms summary
vault.cassandra.list Duration of a LIST operation against the Cassandra storage backend ms summary
vault.cockroachdb.put Duration of a PUT operation against the CockroachDB storage backend ms summary
vault.cockroachdb.get Duration of a GET operation against the CockroachDB storage backend ms summary
vault.cockroachdb.delete Duration of a DELETE operation against the CockroachDB storage backend ms summary
vault.cockroachdb.list Duration of a LIST operation against the CockroachDB storage backend ms summary
vault.consul.put Duration of a PUT operation against the Consul storage backend ms summary
vault.consul.transaction Duration of a Txn operation against the Consul storage backend ms summary
vault.consul.get Duration of a GET operation against the Consul storage backend ms summary
vault.consul.delete Duration of a DELETE operation against the Consul storage backend ms summary
vault.consul.list Duration of a LIST operation against the Consul storage backend ms summary
vault.couchdb.put Duration of a PUT operation against the CouchDB storage backend ms summary
vault.couchdb.get Duration of a GET operation against the CouchDB storage backend ms summary
vault.couchdb.delete Duration of a DELETE operation against the CouchDB storage backend ms summary
vault.couchdb.list Duration of a LIST operation against the CouchDB storage backend ms summary
vault.dynamodb.put Duration of a PUT operation against the DynamoDB storage backend ms summary
vault.dynamodb.get Duration of a GET operation against the DynamoDB storage backend ms summary
vault.dynamodb.delete Duration of a DELETE operation against the DynamoDB storage backend ms summary
vault.dynamodb.list Duration of a LIST operation against the DynamoDB storage backend ms summary
vault.etcd.put Duration of a PUT operation against the etcd storage backend ms summary
vault.etcd.get Duration of a GET operation against the etcd storage backend ms summary
vault.etcd.delete Duration of a DELETE operation against the etcd storage backend ms summary
vault.etcd.list Duration of a LIST operation against the etcd storage backend ms summary
vault.gcs.put Duration of a PUT operation against the Google Cloud Storage storage backend ms summary
vault.gcs.get Duration of a GET operation against the Google Cloud Storage storage backend ms summary
vault.gcs.delete Duration of a DELETE operation against the Google Cloud Storage storage backend ms summary
vault.gcs.list Duration of a LIST operation against the Google Cloud Storage storage backend ms summary
vault.gcs.lock.unlock Duration of an UNLOCK operation against the Google Cloud Storage storage backend in HA mode ms summary
vault.gcs.lock.lock Duration of a LOCK operation against the Google Cloud Storage storage backend in HA mode ms summary
vault.gcs.lock.value Duration of a VALUE operation against the Google Cloud Storage storage backend in HA mode ms summary
vault.mssql.put Duration of a PUT operation against the MS-SQL storage backend ms summary
vault.mssql.get Duration of a GET operation against the MS-SQL storage backend ms summary
vault.mssql.delete Duration of a DELETE operation against the MS-SQL storage backend ms summary
vault.mssql.list Duration of a LIST operation against the MS-SQL storage backend ms summary
vault.mysql.put Duration of a PUT operation against the MySQL storage backend ms summary
vault.mysql.get Duration of a GET operation against the MySQL storage backend ms summary
vault.mysql.delete Duration of a DELETE operation against the MySQL storage backend ms summary
vault.mysql.list Duration of a LIST operation against the MySQL storage backend ms summary
vault.postgres.put Duration of a PUT operation against the PostgreSQL storage backend ms summary
vault.postgres.get Duration of a GET operation against the PostgreSQL storage backend ms summary
vault.postgres.delete Duration of a DELETE operation against the PostgreSQL storage backend ms summary
vault.postgres.list Duration of a LIST operation against the PostgreSQL storage backend ms summary
vault.s3.put Duration of a PUT operation against the Amazon S3 storage backend ms summary
vault.s3.get Duration of a GET operation against the Amazon S3 storage backend ms summary
vault.s3.delete Duration of a DELETE operation against the Amazon S3 storage backend ms summary
vault.s3.list Duration of a LIST operation against the Amazon S3 storage backend ms summary
vault.spanner.put Duration of a PUT operation against the Google Cloud Spanner storage backend ms summary
vault.spanner.get Duration of a GET operation against the Google Cloud Spanner storage backend ms summary
vault.spanner.delete Duration of a DELETE operation against the Google Cloud Spanner storage backend ms summary
vault.spanner.list Duration of a LIST operation against the Google Cloud Spanner storage backend ms summary
vault.spanner.lock.unlock Duration of an UNLOCK operation against the Google Cloud Spanner storage backend in HA mode ms summary
vault.spanner.lock.lock Duration of a LOCK operation against the Google Cloud Spanner storage backend in HA mode ms summary
vault.spanner.lock.value Duration of a VALUE operation against the Google Cloud Spanner storage backend in HA mode ms summary
vault.swift.put Duration of a PUT operation against the Swift storage backend ms summary
vault.swift.get Duration of a GET operation against the Swift storage backend ms summary
vault.swift.delete Duration of a DELETE operation against the Swift storage backend ms summary
vault.swift.list Duration of a LIST operation against the Swift storage backend ms summary
vault.zookeeper.put Duration of a PUT operation against the ZooKeeper storage backend ms summary
vault.zookeeper.get Duration of a GET operation against the ZooKeeper storage backend ms summary
vault.zookeeper.delete Duration of a DELETE operation against the ZooKeeper storage backend ms summary
vault.zookeeper.list Duration of a LIST operation against the ZooKeeper storage backend ms summary

Integrated Raft Storage Health

These metrics relate to raft based integrated storage.

Metric Description Unit Type
vault.raft.apply Number of Raft transactions occurring over the interval, which is a general indicator of the write load on the Raft servers. raft transactions / interval counter
vault.raft.barrier Number of times the node has started the barrier i.e the number of times it has issued a blocking call, to ensure that the node has all the pending operations that were queued, to be applied to the node's FSM. blocks / interval counter
vault.raft.candidate.electSelf Time to request for a vote from a peer. ms summary
vault.raft.commitNumLogs Number of logs processed for application to the FSM in a single batch. logs gauge
vault.raft.commitTime Time to commit a new entry to the Raft log on the leader. ms timer
vault.raft.compactLogs Time to trim the logs that are no longer needed. ms summary
vault.raft.delete Time to delete file from raft's underlying storage. ms summary
vault.raft.delete_prefix Time to delete files under a prefix from raft's underlying storage. ms summary
vault.raft.fsm.apply Number of logs committed since the last interval. commit logs / interval summary
vault.raft.fsm.applyBatch Time to apply batch of logs. ms summary
vault.raft.fsm.applyBatchNum Number of logs applied in batch. ms summary
vault.raft.fsm.enqueue Time to enqueue a batch of logs for the FSM to apply. ms timer
vault.raft.fsm.restore Time taken by the FSM to restore its state from a snapshot. ms summary
vault.raft.fsm.snapshot Time taken by the FSM to record the current state for the snapshot. ms summary
vault.raft.fsm.store_config Time to store the configuration. ms summary
vault.raft.get Time to retrieve file from raft's underlying storage. ms summary
vault.raft.leader.dispatchLog Time for the leader to write log entries to disk. ms timer
vault.raft.leader.dispatchNumLogs Number of logs committed to disk in a batch. logs gauge
vault.raft.list Time to retrieve list of keys from raft's underlying storage. ms summary
vault.raft.peers Number of peers in the raft cluster configuration. peers gauge
vault.raft.put Time to persist key in raft's underlying storage. ms summary
vault.raft.replication.appendEntries.log Number of logs replicated to a node, to bring it up to speed with the leader's logs. logs appended / interval counter
vault.raft.replication.appendEntries.rpc Time taken by the append entries RFC, to replicate the log entries of a leader node onto its follower node(s). ms timer
vault.raft.replication.heartbeat Time taken to invoke appendEntries on a peer, so that it doesn’t timeout on a periodic basis. ms timer
vault.raft.replication.installSnapshot Time taken to process the installSnapshot RPC call. This metric should only be seen on nodes which are currently in the follower state. ms timer
vault.raft.restore Number of times the restore operation has been performed by the node. Here, restore refers to the action of raft consuming an external snapshot to restore its state. operation invoked / interval counter
vault.raft.restoreUserSnapshot Time taken by the node to restore the FSM state from a user's snapshot. ms timer
vault.raft.rpc.appendEntries Time taken to process an append entries RPC call from a node. ms timer
vault.raft.rpc.appendEntries.processLogs Time taken to process the outstanding log entries of a node. ms timer
vault.raft.rpc.appendEntries.storeLogs Time taken to add any outstanding logs for a node, since the last appendEntries was invoked. ms timer
vault.raft.rpc.installSnapshot Time taken to process the installSnapshot RPC call. This metric should only be seen on nodes which are currently in the follower state. ms timer
vault.raft.rpc.processHeartbeat Time taken to process a heartbeat request. ms timer
vault.raft.rpc.requestVote Time taken to complete requestVote RPC call. ms summary
vault.raft.snapshot.create Time taken to initialize the snapshot process. ms timer
vault.raft.snapshot.persist Time taken to dump the current snapshot taken by the node to the disk. ms timer
vault.raft.snapshot.takeSnapshot Total time involved in taking the current snapshot (creating one and persisting it) by the node. ms timer
vault.raft.state.follower Number of times node has entered the follower mode. This happens when a new node joins the cluster or after the end of a leader election. follower state entered / interval counter
vault.raft.transition.heartbeat_timeout Number of times node has transitioned to the Candidate state, after receive no heartbeat messages from the last known leader. timeouts / interval counter
vault.raft.transition.leader_lease_timeout Number of times quorum of nodes were not able to be contacted. contact failures counter
vault.raft.verify_leader Number of times node checks whether it is still the leader or not. checks / interval counter
vault.raft-storage.delete Time to insert log entry to delete path. ms timer
vault.raft-storage.get Time to retrieve value for path from FSM. ms timer
vault.raft-storage.put Time to insert log entry to persist path. ms timer
vault.raft-storage.list Time to list all entries under the prefix from the FSM. ms timer
vault.raft-storage.transaction Time to insert operations into a single log. ms timer
vault.raft-storage.entry_size The total size of a Raft entry during log application in bytes. bytes summary
vault.raft_storage.bolt.freelist.
free_pages
Number of free pages in the freelist. pages gauge
vault.raft_storage.bolt.freelist.
pending_pages
Number of pending pages in the freelist. pages gauge
vault.raft_storage.bolt.freelist.
allocated_bytes
Total bytes allocated in free pages. bytes gauge
vault.raft_storage.bolt.freelist.
used_bytes
Total bytes used by the freelist. bytes gauge
vault.raft_storage.bolt.transaction.
started_read_transactions
Number of started read transactions. transactions gauge
vault.raft_storage.bolt.transaction.
currently_open_read_transactions
Number of currently open read transactions. transactions gauge
vault.raft_storage.bolt.page.count Number of page allocations. allocations gauge
vault.raft_storage.bolt.page.
bytes_allocated
Total bytes allocated. bytes gauge
vault.raft_storage.bolt.cursor.count Number of cursors created. cursors gauge
vault.raft_storage.bolt.node.count Number of node allocations. nodes gauge
vault.raft_storage.bolt.node.dereferences Number of node dereferences. dereferences gauge
vault.raft_storage.bolt.rebalance.count Number of node rebalances. rebalances gauge
vault.raft_storage.bolt.rebalance.time Time taken rebalancing. ms summary
vault.raft_storage.bolt.split.count Number of nodes split. nodes gauge
vault.raft_storage.bolt.spill.count Number of nodes spilled. nodes gauge
vault.raft_storage.bolt.spill.time Time taken spilling. ms summary
vault.raft_storage.bolt.write.count Number of writes performed. writes gauge
vault.raft_storage.bolt.write.time Time taken writing to disk. ms summary

Integrated Raft Storage Leadership Changes

Metric Description Unit Type
vault.raft.leader.lastContact Measures the time since the leader was last able to contact the follower nodes when checking its leader lease ms summary
vault.raft.state.candidate Increments whenever raft server starts an election Elections counter
vault.raft.state.leader Increments whenever raft server becomes a leader Leaders counter

Why they're important: Normally, your raft cluster should have a stable leader. If there are frequent elections or leadership changes, it would likely indicate network issues between the raft nodes, or that the raft servers themselves are unable to keep up with the load.

What to look for: For a healthy cluster, you're looking for a lastContact lower than 200ms, leader > 0 and candidate == 0. Deviations from this might indicate flapping leadership.

Integrated Raft Storage Automated Snapshots

These metrics related to the Enterprise feature Raft Automated Snapshots.

Metric Description Unit Type
vault.autosnapshots.total.snapshot.size For storage_type=local, space on disk used by saved snapshots bytes gauge
vault.autosnapshots.percent.maxspace.used For storage_type=local, percent used of maximum allocated space percentage gauge
vault.autosnapshots.save.errors Increments whenever an error occurs trying to save a snapshot n/a counter
vault.autosnapshots.save.duration Measures the time taken saving a snapshot ms summary
vault.autosnapshots.last.success.time Epoch time (seconds since 1970/01/01) of last successful snapshot save n/a gauge
vault.autosnapshots.snapshot.size Measures the size in bytes of snapshots bytes summary
vault.autosnapshots.rotate.duration Measures the time taken to rotate (i.e. delete) old snapshots to satisfy configured retention ms summary
vault.autosnapshots.snapshots.in.storage Number of snapshots in storage n/a gauge

Metric Labels

Metric Description Example
auth_method Authorization engine type . userpass
cluster The cluster name from which the metric originated; set in the configuration file, or automatically generated when a cluster is create vault-cluster-d54ad07
creation_ttl Time-to-live value assigned to a token or lease at creation. This value is rounded up to the next-highest bucket; the available buckets are 1m, 10m, 20m, 1h, 2h, 1d, 2d, 7d, and 30d. Any longer TTL is assigned the value +Inf. 7d
mount_point Path at which an auth method or secret engine is mounted. auth/userpass/
namespace A namespace path, or root for the root namespace ns1
policy A single named policy default
secret_engine The [secret engine][secrets-engine] type. aws
token_type Identifies whether the token is a batch token or a service token. service
peer_id Unique identifier of a peer. node-1
snapshot_config_name For automated snapshots, the name of the configuration config1