Skip to content

Latest commit

 

History

History
188 lines (157 loc) · 8.8 KB

integrated-storage.mdx

File metadata and controls

188 lines (157 loc) · 8.8 KB
layout page_title description
docs
Integrated Storage
Learn about the integrated raft storage in Vault.

Integrated Storage

Vault supports several storage options for the durable storage of Vault's information. Each backend offers pros, cons, advantages, and trade-offs. For example, some backends support high availability while others provide a more robust backup and restoration process.

As of Vault 1.4, an Integrated Storage option is offered. This storage backend does not rely on any third party systems; it implements high availability, supports Enterprise Replication features, and provides backup/restore workflows.

Consensus Protocol

Vault's Integrated Storage uses a consensus protocol to provide Consistency (as defined by CAP). The consensus protocol is based on "Raft: In search of an Understandable Consensus Algorithm". For a visual explanation of Raft, see The Secret Lives of Data.

Raft Protocol Overview

Raft is a consensus algorithm that is based on Paxos. Compared to Paxos, Raft is designed to have fewer states and a simpler, more understandable algorithm.

There are a few key terms to know when discussing Raft:

  • Leader - At any given time, the peer set elects a single node to be the leader. The leader is responsible for ingesting new log entries, replicating to followers, and managing when an entry is committed. The leader node is also the active Vault node and followers are standby nodes. Refer to the High Availability document for more information.

  • Log - An ordered sequence of entries (replicated log) to keep track of any cluster changes. The leader is responsible for log replication. When new data is written, for example, a new event creates a log entry. The leader then sends the new log entry to its followers. Any inconsistency within the replicated log entries will indicate an issue.

  • FSM - Finite State Machine. A collection of finite states with transitions between them. As new logs are applied, the FSM is allowed to transition between states. Application of the same sequence of logs must result in the same state, meaning behavior must be deterministic.

  • Peer set - The set of all members participating in log replication. All server nodes are in the peer set of the local cluster.

  • Quorum - A majority of members from a peer set: for a set of size n, quorum requires at least (n+1)/2 members. For example, if there are 5 members in the peer set, we would need 3 nodes to form a quorum. If a quorum of nodes is unavailable for any reason, the cluster becomes unavailable and no new logs can be committed.

  • Committed Entry - An entry is considered committed when it is durably stored on a quorum of nodes. An entry is applied once its committed.

Raft is a complex protocol and will not be covered here in detail. We will, however, attempt to provide a high level description which may be useful for building a mental model. For those who want a more comprehensive understanding of Raft, the full specification is available in this paper).

Raft nodes are always in one of three states: follower, candidate, or leader. All nodes initially start out as a follower. In this state, nodes can accept log entries from a leader and cast votes. If no entries are received for a period of time, nodes will self-promote to the candidate state. In the candidate state, nodes request votes from their peers. If a candidate receives a quorum of votes, then it is promoted to a leader. The leader must accept new log entries and replicate to all the other followers.

Once a cluster has a leader, it is able to accept new log entries. A client can request that a leader append a new log entry (from Raft's perspective, a log entry is an opaque binary blob). The leader then writes the entry to durable storage and attempts to replicate to a quorum of followers. Once the log entry is considered committed, it can be applied to a finite state machine. The finite state machine is application specific; in Vault's case, we use BoltDB to maintain a cluster state. Vault's writes are blocked until they are committed and applied.

It would be undesirable to allow a replicated log to grow in an unbounded fashion. Raft provides a mechanism by which the current state is snapshotted and the log is compacted. Because of the FSM abstraction, restoring the state of the FSM must result in the same state as a replay of old logs. This allows Raft to capture the FSM state at a point in time and then remove all the logs that were used to reach that state. This is performed automatically without user intervention and prevents unbounded disk usage while also minimizing the time spent replaying logs. One of the advantages of using BoltDB is that it allows Vault's snapshots to be very light weight. Since Vault's data is already persisted to disk in BoltDB, the snapshot process just needs to truncate the raft logs.

Consensus is fault-tolerant while a cluster has quorum. If a quorum of nodes is unavailable, it is impossible to process log entries or reason about peer membership. For example, suppose there are only 2 peers: A and B. The quorum size is also 2, meaning both nodes must agree to commit a log entry. If either A or B fails, it is now impossible to reach quorum. This means the cluster is unable to add or remove a node or to commit any additional log entries. This results in unavailability. At this point, manual intervention is required to remove either A or B and restart the remaining node in bootstrap mode.

A Raft cluster of 3 nodes can tolerate a single node failure while a cluster of 5 can tolerate 2 node failures. The recommended configuration is to either run 3 or 5 Vault servers per cluster. This maximizes availability without greatly sacrificing performance. The deployment table below summarizes the potential cluster size options and the fault tolerance of each.

In terms of performance, Raft is comparable to Paxos. Assuming stable leadership, committing a log entry requires a single round trip to half of the cluster. Thus, performance is bound by disk I/O and network latency.

Raft in Vault

When getting started, a single Vault server is initialized. At this point, the cluster is of size 1, which allows the node to self-elect as a leader. Once a leader is elected, other servers can be added to the peer set in a way that preserves consistency and safety.

The join process is how new nodes are added to the Vault cluster; it uses an encrypted challenge/answer workflow. To accomplish this, all nodes in a single Raft cluster must share the same seal configuration. If using an Auto Unseal, the join process can use the configured seal to automatically decrypt the challenge and respond with the answer. If using a Shamir seal, the unseal keys must be provided to the node attempting to join the cluster before it can decrypt the challenge and respond with the decrypted answer.

Since all servers participate as part of the peer set, they all know the current leader. When an API request arrives at a non-leader server, the request is forwarded to the leader.

Similar to other storage backends, data that is written to the Raft log and FSM will be encrypted by Vault's barrier.

Vault does not currently offer automated dead server cleanup. If you wish to decommission a node, or a node dies and must be replaced, the node must manually be removed from the cluster with the remove peer command.

Deployment Table

Below is a table that shows quorum size and failure tolerance for various cluster sizes. The recommended deployment is either 3 or 5 servers. A single server deployment is highly discouraged as data loss is inevitable in a failure scenario.

Servers Quorum Size Failure Tolerance
1 1 0
2 2 0
3 2 1
4 3 1
5 3 2
6 4 2
7 4 3