[3.5] etcdserver: call the OnPreCommitUnsafe in unsafeCommit #14733

ahrtr · 2022-11-11T09:37:53Z

Backport #14730 to 3.5.

unsafeCommit is called by both (*batchTxBuffered) commit and (*backend) defrag. When users perform the defragmentation operation, etcd doesn't update the consistent index. If etcd crashes(e.g. panicking) in the process for whatever reason, then etcd replays the WAL entries starting from the latest snapshot, accordingly it may re-apply entries which might have already been applied, eventually the revision isn't consistent with other members.

Refer to discussion in #14685

Signed-off-by: Benjamin Wang wachao@vmware.com

cc @mitake @ptabor @serathius @spzala

`unsafeCommit` is called by both `(*batchTxBuffered) commit` and `(*backend) defrag`. When users perform the defragmentation operation, etcd doesn't update the consistent index. If etcd crashes(e.g. panicking) in the process for whatever reason, then etcd replays the WAL entries starting from the latest snapshot, accordingly it may re-apply entries which might have already been applied, eventually the revision isn't consistent with other members. Refer to discussion in etcd-io#14685 Signed-off-by: Benjamin Wang <wachao@vmware.com>

serathius · 2022-11-11T10:26:36Z

This is pretty low level change, how sure we are that no other things are affected by this change? I think we should validate it before we merge the PR.

My suggestion, we should add more gofailpoints into the code and use linearizability tests to validate the fix.

@ahrtr Can you propose more failpoint location that would thoroughly test backend transation commit?

ahrtr · 2022-11-11T10:29:51Z

@ahrtr Can you propose more failpoint location that would thoroughly test backend transation commit?

Sure. Will think about it in separate discussion session.

ahrtr · 2022-11-11T10:38:23Z

This is pretty low level change, how sure we are that no other things are affected by this change? I

This is a simple & safe change to me. I do not see any risk. Which part you do not have confidence? There might be other issues which could be discovered by other failpoints, but this fix is straightforward and safe to me.

We already two important fixes, this one and the auth one. I think we need to release 3.5.6 soon. Of course, if you want to do more test, It's OK to me. But we shouldn't wait until all failpoints test are done.

serathius · 2022-11-11T11:25:16Z

This is a simple & safe change to me.

This is what worrying to me. It means that our rudiments were wrong, the fundamentals that we built a large part of code. This means that it might have far reaching impact as the change influences a large part of the codebase.

I just want us to be careful and consider what this change could be also impacting. Thus the idea to add more failpoints.

ahrtr · 2022-11-11T11:41:16Z

It means that our rudiments were wrong, the fundamentals that we built a large part of code. This means that it might have far reaching impact as the change influences a large part of the codebase.

I do not get your point. What's rudiments were wrong? This is just an corner case we missed previously.

serathius · 2022-11-13T12:00:10Z

I mean that we should take similar steps as we did in #13885. That time we introduced validation by checking stacktrace during runtime. I would like to double check if we can roll out similar validation. For this issue I would like to propose adding more gofailpoints in processes touching backend like defrag.

ahrtr · 2022-11-14T06:26:28Z

For this issue I would like to propose adding more gofailpoints in processes touching backend like defrag.

I agree that we can think about adding more failpoints, but it should be discussed and tracked in separate session (e.g. #14735 ). So all discussion in this PR should be focusing on what's the impact this PR might cause.

This is pretty low level change, how sure we are that no other things are affected by this change?

The unsafeCommit is only called in two places (see below). I just move the code of calling preCommitHood from commit to unsafeCommit, I don't see any risk here. Please let me know if you see a risk.

In commit;
In defrag.

I would like to double check if we can roll out similar validation.

I am afraid there is no similar validation this time. The defragment failpoints successfully discovered this issue, and it's good. But anyway, I added two other related failpoints in #14746.

serathius mentioned this pull request Nov 11, 2022

Add more gofailpoints to validate critical parts of the code #14735

Open

8 tasks

serathius approved these changes Nov 14, 2022

View reviewed changes

ahrtr merged commit 5f387e6 into etcd-io:release-3.5 Nov 14, 2022

serathius mentioned this pull request Nov 14, 2022

Release v3.5.6 #14750

Closed

22 tasks

ahrtr mentioned this pull request Nov 14, 2022

ETCD-3.5.5 : panic: failed to recover v3 backend from snapshot #14749

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[3.5] etcdserver: call the OnPreCommitUnsafe in unsafeCommit #14733

[3.5] etcdserver: call the OnPreCommitUnsafe in unsafeCommit #14733

ahrtr commented Nov 11, 2022

serathius commented Nov 11, 2022

ahrtr commented Nov 11, 2022

ahrtr commented Nov 11, 2022

serathius commented Nov 11, 2022

ahrtr commented Nov 11, 2022

serathius commented Nov 13, 2022

ahrtr commented Nov 14, 2022

[3.5] etcdserver: call the OnPreCommitUnsafe in unsafeCommit #14733

[3.5] etcdserver: call the OnPreCommitUnsafe in unsafeCommit #14733

Conversation

ahrtr commented Nov 11, 2022

serathius commented Nov 11, 2022

ahrtr commented Nov 11, 2022

ahrtr commented Nov 11, 2022

serathius commented Nov 11, 2022

ahrtr commented Nov 11, 2022

serathius commented Nov 13, 2022

ahrtr commented Nov 14, 2022