Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RFC: Follower replication #98

Open
wants to merge 9 commits into
base: master
Choose a base branch
from

Conversation

LintianShi
Copy link

A design of follower replication

LintianShi added 6 commits July 19, 2022 21:30
Signed-off-by: LintianShi lintian.shi@pingcap.com
Signed-off-by: LintianShi <lintian.shi@pingcap.com>
Signed-off-by: LintianShi <lintian.shi@pingcap.com>
Signed-off-by: LintianShi <lintian.shi@pingcap.com>
Signed-off-by: LintianShi <lintian.shi@pingcap.com>
Signed-off-by: LintianShi <lintian.shi@pingcap.com>
Signed-off-by: LintianShi <lintian.shi@pingcap.com>

### Preparation

Every peer must have a field indicating which AZ it belongs to. This field does not have to be persistent. As mentioned in the comment, TiKV has the knowledge which AZ the store locates. We need to design the interface so that `RawNode` in raft-rs can get the AZ information in TiKV. Maybe we can initialize Peer with AZ information, and then Peer initializes the `RawNode`.
Copy link
Member

@Connor1996 Connor1996 Aug 1, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need to design the interface so that `RawNode` in raft-rs can get the AZ information in TiKV. Maybe we can initialize Peer with AZ information, and then Peer initializes the `RawNode`.

This should be settled in the design.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will take it into consideration.

LintianShi added 2 commits August 10, 2022 10:58
Signed-off-by: LintianShi <lintian.shi@pingcap.com>
Signed-off-by: LintianShi <lintian.shi@pingcap.com>
@@ -0,0 +1,248 @@
# Follower Replication for Write-Flow

<!-- - RFC PR: https://github.com/tikv/rfcs/pull/0000
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should be updated.

We design a group broadcast schema which is proposed by this [rfc](https://github.com/tikv/rfcs/blob/362865abe36b5771d525310b449c07a57da7ef85/text/2019-11-13-follower-replication.md).

The main idea is that the leader only sends `MsgBroadcast` to the agent of a certain AZ. `MsgBroadcast` contains log entries that the leader replicates to the agent, and several ranges of log that the leader replicates to other followers and learners.
Once the agent receives `MsgBroadcast`, it appends log entries in the message and assembles `MsgAppend` according to the ranges in `MsgBroadcast` with its own log. Then the agent sends `MsgAppend` to other followers/learners in the same AZ. Thus, the leader can avoid sending `MsgAppend` across AZs.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Paragraph should be separate by one blank line.

We design a group broadcast schema which is proposed by this [rfc](https://github.com/tikv/rfcs/blob/362865abe36b5771d525310b449c07a57da7ef85/text/2019-11-13-follower-replication.md).

The main idea is that the leader only sends `MsgBroadcast` to the agent of a certain AZ. `MsgBroadcast` contains log entries that the leader replicates to the agent, and several ranges of log that the leader replicates to other followers and learners.
Once the agent receives `MsgBroadcast`, it appends log entries in the message and assembles `MsgAppend` according to the ranges in `MsgBroadcast` with its own log. Then the agent sends `MsgAppend` to other followers/learners in the same AZ. Thus, the leader can avoid sending `MsgAppend` across AZs.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This may be conflict with log async fetch.


In TiKV architecture, TiKV contains multiple peers that belong to different raft groups. So TiKV needs to update zone information to each peer.

We add an extra field `peer_zone` in the peer, a hashmap that records store_id -> AZ. Every time the zone information stored in TiKV is updated, a peer message `UpdataZoneInfo` is generated. Then it will be broadcast to all peer in this TiKV server. When a peer receives `UpdataZoneInfo`, it will update its `peer_zone`.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not put it to PollContext?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the zone information only flows from TiKV server to peers, and peers never modify it. So it is read-only for peers. If the zone information is updated, Node can use a PeerMsg to update zone information stored in each Peer. PollContext is more like a layer where peers interact with TiKV server. If the zone information is stored in PollContext and shared with multiple peers, we need to consider the concurrency conflict.

However, putting zone information in PollContext uses less memory.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think about it again. I think zone information should be put on PollContext. Otherwise, we have to handle zone information migration when split or merge region.

@BusyJay
Copy link
Member

BusyJay commented Aug 12, 2022

/cc @Fullstop000

Signed-off-by: LintianShi <lintian.shi@pingcap.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants