Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

The notion of Term is overloaded #144

Open
pav-kv opened this issue Jan 30, 2024 · 0 comments
Open

The notion of Term is overloaded #144

pav-kv opened this issue Jan 30, 2024 · 0 comments
Labels
enhancement New feature or request

Comments

@pav-kv
Copy link
Contributor

pav-kv commented Jan 30, 2024

Problem

Currently, a raft node has a Term field (and its storage counterpart in HardState) which is the latest term seen by this node.

  • This term may or may not have a vote from this/other nodes, and may or may not have a leader.
  • The local raftLog, correspondingly, may or may not be consistent with the leader's log at this term, and the leader's log might not exist in the first place (if the election at this term does not win).

This uncertainty is a source of workarounds and potential safety bugs, and is generally confusing. This also leads to unnecessary message rejects/drops in periods of leader change. To combat this, let's try to bring some structure, and then improve the codebase.

TODO: add action items.
TODO: add code links.
TODO: keep writing, this is a WIP text.
TODO: convert this to a tech note / design doc.

Background

Every leader in raft has a unique term. During election, a prospective leader (candidate) negotiates a starting point (prevIndex, prevTerm), and then appends a contiguous run of entries at index prevIndex+1 and higher. The set of all entries in the system thus forms a tree. Raft implements a bunch of safety rules that govern how the (prevIndex, prevTerm) point is chosen (it has to be in the subtree of the "committed" point in this tree), and when entries are considered committed (a quorum of replicas are in the same subtree).

Consider an example of such a "tree" representing the log evolution:

	term
	▲
	│
	│               ┌→5 5 5
	│               │
	│           ┌→4 4
	│           │
	│     ┌→3 3 3 3 3 3 3 3 3 3 3 3
	│     │
	│     │ ┌→2 2 2 2 2 2
	│     │ │
	│   1 1 1 1 1 1
	└──────────────────────────────────────────► index

Every node in the raft group is at some "point" in this tree. A node's log is the unique path in this tree from (0, 0) to entry (term, index) at the "tip" of the log.

The "leader log" is the path from (0, 0) to the latest entry that this leader appended. For the picture above, here are the "leader logs" corresponding to the leader term:

	t1: 1 1 1 1 1 1
	t2: 1 1 1 2 2 2 2 2 2
	t3: 1 1 3 3 3 3 3 3 3 3 3 3 3 3
	t4: 1 1 3 3 3 4 4
	t5: 1 1 3 3 3 4 4 5 5 5

Note that:

  • Two leaders can observe different entries at the same index. For example, the 3rd entry in t1 has term 1, while the leader t4 observes an entry with term 3.
  • The same entry can be observed by multiple leaders. For example, the 4th entry is the same in t3, t4, and t5 logs.

We have now established that there are two ways to identify entries in the system:

  1. The canonical unique ID is (term, index), where term identifies the leader who created this entry.
  2. The alternative non-unique ID is (term, index) where term identifies the leader who observes this entry in its log.

Different parts of raft codebase use one of these two ways, and there is no clear distinction between these 2 semantics. Some code uses just index and implicitly assumes the term part to be either the original entry term, or the leader log term. Such implicit assumptions complicate understanding, and may lead to correctness bugs. We may need some type safety to communicate these semantics in code.

Notably, the raft.Term field by default has no straightforward relation to term in (1) and (2). In the Leader state, it can be used with either definition; in the Candidate state, raft.Term is a number unrelated to any entries in the log; in the Follower state, raft.Term first has no relation to the log, but eventually becomes in sync with the log (when the first append message from the leader succeeds) and can be used with (2).

This suggests that raftLog data structure (which is agnostic to the node state and election) should track a more straightforward Term that can be used in coordinate system (2) unconditionally. A sensible approach is to say that raftLog.Term is the term of the leader whose append we accepted last. This is equivalent to behaviour of Paxos acceptors in phase one - the acceptor keeps track of the term for the latest accepted proposal. In raft world, the raftLog is the equivalent of a Paxos acceptor.

Currently, the raft.Term field plays two roles:

  1. The term for which an election is underway or already completed.
  2. The upper bound for the last accepted append term. I.e. raft.Term >= raftLog.Term at all times.

(2) makes it safe to use raft.Term as a substitute for the raftLog.Term for some safety checks. For example: the raftLog must reject appends coming from leaders < raftLog.Term. However, we currently reject terms < raft.Term, which is a superset of < raftLog.Term. We may reject appends unnecessarily (when raft.Term > raftLog.Term), but we will never accept an append incorrectly, so this is safe.

TODO: keep writing

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

1 participant