From 68e176cb725ff788d4514a81258129e19dd605a3 Mon Sep 17 00:00:00 2001 From: Daniel Nephin Date: Fri, 25 Feb 2022 13:07:05 -0500 Subject: [PATCH] First draft of developer docs Still lots of detail that could be added. --- docs/README.md | 106 +++++++++++++++++++++++++++++++++++++++++++++++++ docs/apply.md | 45 +++++++++++++++++++++ 2 files changed, 151 insertions(+) create mode 100644 docs/README.md create mode 100644 docs/apply.md diff --git a/docs/README.md b/docs/README.md new file mode 100644 index 000000000..0762c5994 --- /dev/null +++ b/docs/README.md @@ -0,0 +1,106 @@ +# Raft Developer Documentation + +This documentation provides a high level introduction to the `hashicorp/raft` +implementation. The intended audience is anyone interested in understanding +or contributing to the code. + +## Contents + +1. [Terminology](#terminology) +2. [Operations](#operations) + 1. [Apply](./apply.md) +3. [Threads](#threads) + + +## Terminology + +This documentation uses the following terms as defined. + +* **Cluster** - the set of peers in the raft configuration +* **Peer** - a node that participates in the consensus protocol using `hashicorp/raft`. A + peer may be in one of the following states: **follower**, **candidate**, or **leader**. +* **Log** - the full set of log entries. +* **Log Entry** - an entry in the log. Each entry has an index that is used to order it + relative to other log entries. + * **Committed** - A log entry is considered committed if it is safe for that entry to be + applied to state machines. A log entry is committed once the leader that created the + entry has replicated it on a majority of the peers. A peer has successfully + replicated the entry once it is persisted. + * **Applied** - log entry applied to the state machine (FSM) +* **Term** - raft divides time into terms of arbitrary length. Terms are numbered with + consecutive integers. Each term begins with an election, in which one or more candidates + attempt to become leader. If a candidate wins the election, then it serves as leader for + the rest of the term. If the election ends with a split vote, the term will end with no + leader. +* **FSM** - finite state machine, stores the cluster state +* **Client** - the application that uses the `hashicorp/raft` library + +## Operations + +### Leader Write + +Most write operations must be performed on the leader. + +* RequestConfigChange - update the raft peer list configuration +* Apply - apply a log entry to the log on a majority of peers, and the FSM. See [raft apply](apply.md) for more details. +* Barrier - a special Apply that does not modify the FSM, used to wait for previous logs to be applied +* LeadershipTransfer - stop accepting client requests, and tell a different peer to start a leadership election +* Restore (Snapshot) - overwrite the cluster state with the contents of the snapshot (excluding cluster configuration) +* VerifyLeader - send a heartbeat to all voters to confirm the peer is still the leader + +### Follower Write + +* BootstrapCluster - store the cluster configuration in the local log store + + +### Read + +Read operations can be performed on a peer in any state. + +* AppliedIndex - get the index of the last log entry applied to the FSM +* GetConfiguration - return the latest cluster configuration +* LastContact - get the last time this peer made contact with the leader +* LastIndex - get the index of the latest stored log entry +* Leader - get the address of the peer that is currently the leader +* Snapshot - snapshot the current state of the FSM into a file +* State - return the state of the peer +* Stats - return some stats about the peer and the cluster + +## Threads + +Raft uses the following threads to handle operations. The name of the thread is in bold, +and a short description of the operation handled by the thread follows. The main thread is +responsible for handling many operations. + +* **run** (main thread) - different behaviour based on peer state + * follower + * processRPC (from rpcCh) + * AppendEntries + * RequestVote + * InstallSnapshot + * TimeoutNow + * liveBootstrap (from bootstrapCh) + * periodic heartbeatTimer (HeartbeatTimeout) + * candidate - starts an election for itself when called + * processRPC (from rpcCh) - same as follower + * acceptVote (from askPeerForVote) + * leader - first starts replication to all peers, and applies a Noop log to ensure the new leader has committed up to the commit index + * processRPC (from rpcCh) - same as follower, however we don’t actually expect to receive any RPCs other than a RequestVote + * leadershipTransfer (from leadershipTransferCh) - + * commit (from commitCh) - + * verifyLeader (from verifyCh) - + * user restore snapshot (from userRestoreCh) - + * changeConfig (from configurationChangeCh) - + * dispatchLogs (from applyCh) - handle client Raft.Apply requests by persisting logs to disk, and notifying replication goroutines to replicate the new logs + * checkLease (periodically LeaseTimeout) - +* **runFSM** - has exclusive access to the FSM, all reads and writes must send a message to this thread. Commands: + * apply logs to the FSM, from the fsmMutateCh, from processLogs, from leaderLoop (leader) or appendEntries RPC (follower/candidate) + * restore a snapshot to the FSM, from the fsmMutateCh, from restoreUserSnapshot (leader) or installSnapshot RPC (follower/candidate) + * capture snapshot, from fsmSnapshotCh, from takeSnapshot (runSnapshot thread) +* **runSnapshot** - handles the slower part of taking a snapshot. From a pointer captured by the FSM.Snapshot operation, this thread persists the snapshot by calling FSMSnapshot.Persist. Also calls compactLogs to delete old logs. + * periodically (SnapshotInterval) takeSnapshot for log compaction + * user snapshot, from userSnapshotCh, takeSnapshot to return to the user +* **askPeerForVote (candidate only)** - short lived goroutine that synchronously sends a RequestVote RPC to all voting peers, and waits for the response. One goroutine per voting peer. +* **replicate (leader only)** - long running goroutine that synchronously sends log entry AppendEntry RPCs to all peers. Also starts the heartbeat thread, and possibly the pipelineDecode thread. Runs sendLatestSnapshot when AppendEntry fails. + * **heartbeat (leader only)** - long running goroutine that synchronously sends heartbeat AppendEntry RPCs to all peers. + * **pipelineDecode (leader only)** diff --git a/docs/apply.md b/docs/apply.md new file mode 100644 index 000000000..f4e35d956 --- /dev/null +++ b/docs/apply.md @@ -0,0 +1,45 @@ +# Raft Apply + +Apply is the primary operation provided by raft. + + +This sequence diagram shows the steps involved in a `raft.Apply` operation. Each box +across the top is a separate thread. The name in the box identifies the state of the peer +(leader or follower) and the thread (`:`). When there are +multiple copies of the thread, it is indicated with `(each peer)`. + +```mermaid +sequenceDiagram + autonumber + + participant client + participant leadermain as leader:main + participant leaderfsm as leader:fsm + participant leaderreplicate as leader:replicate (each peer) + participant followermain as follower:main (each peer) + participant followerfsm as follower:fsm (each peer) + + client-)leadermain: applyCh to dispatchLogs + leadermain->>leadermain: store logs to disk + + leadermain-)leaderreplicate: triggerCh + leaderreplicate-->>followermain: AppendEntries RPC + + followermain->>followermain: store logs to disk + + opt leader commit index is ahead of peer commit index + followermain-)followerfsm: fsmMutateCh
apply committed logs + followerfsm->>followerfsm: fsm.Apply + end + + followermain-->>leaderreplicate: respond success=true + leaderreplicate->>leaderreplicate: update commitment + + opt quorum commit index has increased + leaderreplicate-)leadermain: commitCh + leadermain-)leaderfsm: fsmMutateCh + leaderfsm->>leaderfsm: fsm.Apply + leaderfsm-)client: future.respond + end + +```