raftstore: gc abnormal snapshots and destroy peer if failed to apply snapshots. #16992

LykxSassinator · 2024-05-10T07:54:28Z

What is changed and how it works?

Issue Number: Close #15292

What's Changed:

Previously, there were pending tasks to address the scenario where TiKV would panic if applying snapshots failed due to abnormal conditions such as IO errors or unexpected issues.

This pull request resolves the issue by introducing additional traits tombstone: bool to SnapshotApplied, indicating whether the failure occurred due to abnormal snapshots.
Additionally, the abnormal peer will send ExtraMessageType::MsgGcPeerRequest to related leader of this region, trigger a new ConfChange with RemoveNode to gc the associated peer. Finally, this peer will be destroyed later to ensure the cluster will add one new peer by sending a fresh snapshot to the affected node.

Replace `SnapshotApplied` with `SnapshotApplied { peer_id: u64, tombstone: bool}`. And if `tombstone` == true, the
relative peer will be automatically GCed.

Related changes

PR to update pingcap/docs/pingcap/docs-cn:
Need to cherry-pick to the release branch

Check List

Tests

Unit test
Integration test
Manual test (add detailed scripts or steps below)
No code

Side effects

Performance regression: Consumes more CPU
Performance regression: Consumes more Memory
Breaking backward compatibility

Release note

None.

…a snapshot. Signed-off-by: lucasliang <nkcs_lykx@hotmail.com>

Signed-off-by: lucasliang <nkcs_lykx@hotmail.com>

ti-chi-bot · 2024-05-10T07:54:31Z

[REVIEW NOTIFICATION]

This pull request has not been approved.

To complete the pull request process, please ask the reviewers in the list to review by filling /cc @reviewer in the comment.
After your PR has acquired the required number of LGTMs, you can assign this pull request to the committer in the list by filling /assign @committer in the comment to help you merge this pull request.

The full list of commands accepted by this bot can be found here.

Reviewer can indicate their review by submitting an approval review.
Reviewer can cancel approval by submitting a request changes review.

ti-chi-bot · 2024-05-10T07:54:31Z

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

Signed-off-by: lucasliang <nkcs_lykx@hotmail.com>

glorv · 2024-05-14T08:08:04Z

I have 2 questions about this PR:

Is it possible to directly retry applying the snapshot if the failure is caused by unexpected(maybe some IO layer) error?
Why need to tombstone the peer after applying snapshot failed? Why not just switch the peer status to normal and so the leader should start a new snapshot automatically.

LykxSassinator · 2024-05-14T08:20:46Z

I have 2 questions about this PR:

Is it possible to directly retry applying the snapshot if the failure is caused by unexpected(maybe some IO layer) error?

Why need to tombstone the peer after applying snapshot failed? Why not just switch the peer status to normal and so the leader should start a new snapshot automatically.

Nope, it's caused by loading abnormal blocks. According to the logs in the issue tikv panic repeatedly after this tikv recover from io hang #15292 and tikv panic repeatedly with “\"[region 16697056] 19604003 applying snapshot failed\"” after down this tikv for 20mins and recover #16958, it shows that the given snapshot has abnormal file blocks, causing the failure of the applying.
For safety. Tombstone the peer and destroy it will assure that this TiKV node do not have any other remains data / meta data about this peer.

LykxSassinator · 2024-05-14T08:22:36Z

By the way, as for the point 2, I agree what u mentioned before. But for safety, this pr takes current implementation.

Why not just switch the peer status to normal and so the leader should start a new snapshot automatically.

I'll have a try for the point 2 u mentioned and do some extra tests on it to find the more appropriate way.

overvenus · 2024-05-14T08:43:06Z

I don't think we can directly mark it as Tombstone or Normal, because both options violate the current raft state machine protocol.

Tombstone indicates that the peer has been fully removed, including its Raft membership and data. However, in this case, the peer remains a valid member.
Normal means that the peer has all data up to its last commit index. But, in this case, it does not, as its last commit index has been updated to the snapshot index.

There are two ways to fix the panic:

Have PD remove this peer via a confchange, which I believe is the simplest solution.
Introduce a new RPC to instruct the leader to resend the snapshot, which may change lots of code.

glorv · 2024-05-14T09:21:06Z

Introduce a new RPC to instruct the leader to resend the snapshot, which may change lots of code.

Why need this extra RPC? At the leader side, it will switch the peer's state to normal after finishing send the snapshot. At the follower side, when apply snapshot failed, it is also doable to restore the raft state to its previous state before this snapshot. Thus, If I understand correctly, the leader should start another snapshot without any extra operation?

overvenus · 2024-05-14T09:58:17Z

At the follower side, when apply snapshot failed, it is also doable to restore the raft state to its previous state before this snapshot.

Do you mean persisting the previous state so it can be restored even after restarting TiKV?

That's doable (without introducing a new RPC), but it does add extra complexity to raftstore (and we'll need to review every code path related to snapshot handling).

glorv · 2024-05-14T10:59:23Z

Do you mean persisting the previous state so it can be restored even after restarting TiKV?

Yes. But I think just update the peer's state to its previous status should be simpler than tombstone or let pd remove this peer and later add it back.

LykxSassinator · 2024-05-15T04:58:59Z

Do you mean persisting the previous state so it can be restored even after restarting TiKV?

Yes. But I think just update the peer's state to its previous status should be simpler than tombstone or let pd remove this peer and later add it back.

Correct me if I'm wrong, but I think that removing the peer and making PD add one fresh peer sounds valid ? As the peer, which panics due to loading the abnormal snapshot, has already advanced its persisted_index and committed_index, rolling back the persisted state of this peer to a previous one will violate rules of raft protocol. Right ?

glorv · 2024-05-15T05:35:30Z

Correct me if I'm wrong, but I think that removing the peer and making PD add one fresh peer sounds valid ? As the peer, which panics due to loading the abnormal snapshot, has already advanced its persisted_index and committed_index, rolling back the persisted state of this peer to a previous one will violate rules of raft protocol. Right ?

You are right, the raft state is reset to the snapshot's state when receives the snapshot msg, so it not easy to revert to the previous state. And makes the raft state revert may cause other side-effects.

Thus, if it's doable, I prefer the current idea to just destroy the current peer and let newer raft message to trigger create it automatically.

glorv · 2024-05-15T06:12:54Z

@LykxSassinator from

tikv/components/raftstore/src/store/fsm/store.rs

Lines 2114 to 2130 in 06eed73

    
           if let Some(local_peer_id) = find_peer(region, self.ctx.store_id()).map(|r| r.get_id()) { 
        
               if to_peer_id <= local_peer_id { 
        
                   self.ctx 
        
                       .raft_metrics 
        
                       .message_dropped 
        
                       .region_tombstone_peer 
        
                       .inc(); 
        
                   info!( 
        
                       "tombstone peer receives a stale message, local_peer_id >= to_peer_id in msg"; 
        
                       "region_id" => region_id, 
        
                       "local_peer_id" => local_peer_id, 
        
                       "to_peer_id" => to_peer_id, 
        
                       "msg_type" => ?msg_type 
        
                   ); 
        
                   return Ok(CheckMsgStatus::DropMsg); 
        
               } 
        
           }

It seems after tombstoning the peer, the region will drop all following raft messages, thus the peer won't be recovered automatically.

LykxSassinator · 2024-05-15T11:36:41Z

@LykxSassinator from

tikv/components/raftstore/src/store/fsm/store.rs

Lines 2114 to 2130 in 06eed73

if let Some(local_peer_id) = find_peer(region, self.ctx.store_id()).map(|r| r.get_id()) {

if to_peer_id <= local_peer_id {

self.ctx

.raft_metrics

.message_dropped

.region_tombstone_peer

.inc();

info!(

"tombstone peer receives a stale message, local_peer_id >= to_peer_id in msg";

"region_id" => region_id,

"local_peer_id" => local_peer_id,

"to_peer_id" => to_peer_id,

"msg_type" => ?msg_type

);

return Ok(CheckMsgStatus::DropMsg);

}

}

It seems after tombstoning the peer, the region will drop all following raft messages, thus the peer won't be recovered automatically.

Yes, thank you for the reminder. It seems that we may need to utilize ConfChange for this task. I will investigate further, and this pull request will be put on hold until the necessary changes are implemented and ready for review.

Signed-off-by: lucasliang <nkcs_lykx@hotmail.com>

LykxSassinator · 2024-05-22T13:23:02Z

/cc @glorv @overvenus PTAL, now this pr is ready.

Connor1996 · 2024-05-23T08:46:48Z

components/raftstore/src/store/worker/region.rs

                status.swap(JOB_STATUS_FAILED, Ordering::SeqCst);
                SNAP_COUNTER.apply.fail.inc();
+                // As the snapshot failed, it should be cleared and the related peer should be


Why not handle it when JOB_STATUS_FAILED in check_applying_snap

IMO, check_applying_snap should be responsible solely for checking and returning the latest applying status, which is updated by the RegionRunner thread. The RegionRunner thread is the first to obtain the applying status.

glorv · 2024-05-23T09:24:29Z

components/raftstore/src/store/fsm/peer.rs

+                    tombstone && self.fsm.peer.peer_id() == peer_id && !self.fsm.peer.is_leader();
+                if apply_snap_failed {
+                    // Send ConfChange to the leader to make the region tombstone the peer.
+                    self.fsm.peer.send_tombstone_peer_msg(self.ctx);


What if send_tombstone_peer_msg failed or the message is dropped due to epoch change or leadership change. If current peer is not destroyed directly, we need some mechanism to mark that it is abnormal and should not be used for read/write anymore.

Make senses. The absence of messages can result in abnormal peer residue on this node.

Let me dive deeper in it.

ti-chi-bot · 2024-05-24T10:12:33Z

@LykxSassinator: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name	Commit	Details	Required	Rerun command
pull-unit-test	`b34a49f`	link	true	`/test pull-unit-test`

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

hbisheng · 2024-05-24T10:25:29Z

Out of curiosity and just trying to learn, what are the typical scenarios where applying snapshots may fail? In those cases, would it help if we retry a few times? If we destroy the peer and add a new one, it's possible that the new peer may hit the same issue again, right?

LykxSassinator · 2024-05-24T12:53:59Z

Out of curiosity and just trying to learn, what are the typical scenarios where applying snapshots may fail? In those cases, would it help if we retry a few times? If we destroy the peer and add a new one, it's possible that the new peer may hit the same issue again, right?

Yep. One typical case is that loading snapshots encounters IO errors. In this case, the issue can be attributed to physical errors on the disk or system errors, causing the TiKV panic.
And as for whether the retrying is success or not, it depends on the root cause of the IO error. If it's due to physical errors on the disk, retries may not always be successful. But if it's a system error or bugs, retries can have a higher chance of success.

LykxSassinator added 2 commits May 10, 2024 15:37

raftstore: gc abnormal snapshots and destroy peer if failed to apply …

d786f8a

…a snapshot. Signed-off-by: lucasliang <nkcs_lykx@hotmail.com>

Add logs.

963bcb5

Signed-off-by: lucasliang <nkcs_lykx@hotmail.com>

ti-chi-bot bot added do-not-merge/needs-linked-issue do-not-merge/work-in-progress do-not-merge/release-note-label-needed labels May 10, 2024

Merge branch 'master' into clear_if_apply_snap_failed

6623dfa

ti-chi-bot bot added the size/M label May 10, 2024

LykxSassinator mentioned this pull request May 13, 2024

tikv panic repeatedly with “\"[region 16697056] 19604003 applying snapshot failed\"” after down this tikv for 20mins and recover #16958

Closed

LykxSassinator added 2 commits May 14, 2024 15:21

Replace ForceDestroyPeer with Applied {..}

271310a

Signed-off-by: lucasliang <nkcs_lykx@hotmail.com>

Merge branch 'master' into clear_if_apply_snap_failed

ea31ab2

ti-chi-bot bot added release-note-none do-not-merge/needs-triage-completed and removed do-not-merge/needs-linked-issue do-not-merge/release-note-label-needed labels May 14, 2024

LykxSassinator marked this pull request as ready for review May 14, 2024 07:34

ti-chi-bot bot removed the do-not-merge/work-in-progress label May 14, 2024

LykxSassinator requested review from overvenus, glorv, tonyxuqqi and Connor1996 May 14, 2024 07:35

LykxSassinator marked this pull request as draft May 14, 2024 08:24

ti-chi-bot bot added the do-not-merge/work-in-progress label May 14, 2024

Refactor the gc peer with ConfChange request.

4145971

Signed-off-by: lucasliang <nkcs_lykx@hotmail.com>

ti-chi-bot bot added size/L and removed size/M labels May 20, 2024

LykxSassinator added 4 commits May 20, 2024 18:30

Merge branch 'master' into clear_if_apply_snap_failed

6a86c6a

Merge branch 'master' into clear_if_apply_snap_failed

bfce864

Polish codes and supply extra ut.

3441299

Signed-off-by: lucasliang <nkcs_lykx@hotmail.com>

Merge branch 'master' into clear_if_apply_snap_failed

4a4fa89

LykxSassinator marked this pull request as ready for review May 22, 2024 13:22

ti-chi-bot bot removed the do-not-merge/work-in-progress label May 22, 2024

Connor1996 reviewed May 23, 2024

View reviewed changes

glorv reviewed May 23, 2024

View reviewed changes

Merge branch 'master' into clear_if_apply_snap_failed

b34a49f

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

raftstore: gc abnormal snapshots and destroy peer if failed to apply snapshots. #16992

raftstore: gc abnormal snapshots and destroy peer if failed to apply snapshots. #16992

LykxSassinator commented May 10, 2024 •

edited

ti-chi-bot bot commented May 10, 2024

ti-chi-bot bot commented May 10, 2024

glorv commented May 14, 2024

LykxSassinator commented May 14, 2024

LykxSassinator commented May 14, 2024 •

edited

overvenus commented May 14, 2024

glorv commented May 14, 2024

overvenus commented May 14, 2024 •

edited

glorv commented May 14, 2024 •

edited

LykxSassinator commented May 15, 2024 •

edited

glorv commented May 15, 2024

glorv commented May 15, 2024

LykxSassinator commented May 15, 2024 •

edited

LykxSassinator commented May 22, 2024

Connor1996 May 23, 2024

LykxSassinator May 23, 2024

glorv May 23, 2024 •

edited

LykxSassinator May 23, 2024

ti-chi-bot bot commented May 24, 2024

hbisheng commented May 24, 2024

LykxSassinator commented May 24, 2024 •

edited

raftstore: gc abnormal snapshots and destroy peer if failed to apply snapshots. #16992

Are you sure you want to change the base?

raftstore: gc abnormal snapshots and destroy peer if failed to apply snapshots. #16992

Conversation

LykxSassinator commented May 10, 2024 • edited

What is changed and how it works?

Related changes

Check List

Release note

ti-chi-bot bot commented May 10, 2024

ti-chi-bot bot commented May 10, 2024

glorv commented May 14, 2024

LykxSassinator commented May 14, 2024

LykxSassinator commented May 14, 2024 • edited

overvenus commented May 14, 2024

glorv commented May 14, 2024

overvenus commented May 14, 2024 • edited

glorv commented May 14, 2024 • edited

LykxSassinator commented May 15, 2024 • edited

glorv commented May 15, 2024

glorv commented May 15, 2024

LykxSassinator commented May 15, 2024 • edited

LykxSassinator commented May 22, 2024

Connor1996 May 23, 2024

Choose a reason for hiding this comment

LykxSassinator May 23, 2024

Choose a reason for hiding this comment

glorv May 23, 2024 • edited

Choose a reason for hiding this comment

LykxSassinator May 23, 2024

Choose a reason for hiding this comment

ti-chi-bot bot commented May 24, 2024

hbisheng commented May 24, 2024

LykxSassinator commented May 24, 2024 • edited

LykxSassinator commented May 10, 2024 •

edited

LykxSassinator commented May 14, 2024 •

edited

overvenus commented May 14, 2024 •

edited

glorv commented May 14, 2024 •

edited

LykxSassinator commented May 15, 2024 •

edited

LykxSassinator commented May 15, 2024 •

edited

glorv May 23, 2024 •

edited

LykxSassinator commented May 24, 2024 •

edited