Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[3.5] etcdserver: call the OnPreCommitUnsafe in unsafeCommit #14733

Merged
merged 1 commit into from Nov 14, 2022

Conversation

ahrtr
Copy link
Member

@ahrtr ahrtr commented Nov 11, 2022

Backport #14730 to 3.5.

unsafeCommit is called by both (*batchTxBuffered) commit and (*backend) defrag. When users perform the defragmentation operation, etcd doesn't update the consistent index. If etcd crashes(e.g. panicking) in the process for whatever reason, then etcd replays the WAL entries starting from the latest snapshot, accordingly it may re-apply entries which might have already been applied, eventually the revision isn't consistent with other members.

Refer to discussion in #14685

Signed-off-by: Benjamin Wang wachao@vmware.com

cc @mitake @ptabor @serathius @spzala

`unsafeCommit` is called by both `(*batchTxBuffered) commit` and
`(*backend) defrag`. When users perform the defragmentation
operation, etcd doesn't update the consistent index. If etcd
crashes(e.g. panicking) in the process for whatever reason, then
etcd replays the WAL entries starting from the latest snapshot,
accordingly it may re-apply entries which might have already been
applied, eventually the revision isn't consistent with other members.

Refer to discussion in etcd-io#14685

Signed-off-by: Benjamin Wang <wachao@vmware.com>
@serathius
Copy link
Member

This is pretty low level change, how sure we are that no other things are affected by this change? I think we should validate it before we merge the PR.

My suggestion, we should add more gofailpoints into the code and use linearizability tests to validate the fix.

@ahrtr Can you propose more failpoint location that would thoroughly test backend transation commit?

@ahrtr
Copy link
Member Author

ahrtr commented Nov 11, 2022

@ahrtr Can you propose more failpoint location that would thoroughly test backend transation commit?

Sure. Will think about it in separate discussion session.

@ahrtr
Copy link
Member Author

ahrtr commented Nov 11, 2022

This is pretty low level change, how sure we are that no other things are affected by this change? I

This is a simple & safe change to me. I do not see any risk. Which part you do not have confidence? There might be other issues which could be discovered by other failpoints, but this fix is straightforward and safe to me.

We already two important fixes, this one and the auth one. I think we need to release 3.5.6 soon. Of course, if you want to do more test, It's OK to me. But we shouldn't wait until all failpoints test are done.

@serathius
Copy link
Member

This is a simple & safe change to me.

This is what worrying to me. It means that our rudiments were wrong, the fundamentals that we built a large part of code. This means that it might have far reaching impact as the change influences a large part of the codebase.

I just want us to be careful and consider what this change could be also impacting. Thus the idea to add more failpoints.

@ahrtr
Copy link
Member Author

ahrtr commented Nov 11, 2022

It means that our rudiments were wrong, the fundamentals that we built a large part of code. This means that it might have far reaching impact as the change influences a large part of the codebase.

I do not get your point. What's rudiments were wrong? This is just an corner case we missed previously.

@serathius
Copy link
Member

I mean that we should take similar steps as we did in #13885. That time we introduced validation by checking stacktrace during runtime. I would like to double check if we can roll out similar validation. For this issue I would like to propose adding more gofailpoints in processes touching backend like defrag.

@ahrtr
Copy link
Member Author

ahrtr commented Nov 14, 2022

For this issue I would like to propose adding more gofailpoints in processes touching backend like defrag.

I agree that we can think about adding more failpoints, but it should be discussed and tracked in separate session (e.g. #14735 ). So all discussion in this PR should be focusing on what's the impact this PR might cause.

This is pretty low level change, how sure we are that no other things are affected by this change?

The unsafeCommit is only called in two places (see below). I just move the code of calling preCommitHood from commit to unsafeCommit, I don't see any risk here. Please let me know if you see a risk.

I would like to double check if we can roll out similar validation.

I am afraid there is no similar validation this time. The defragment failpoints successfully discovered this issue, and it's good. But anyway, I added two other related failpoints in #14746.

@ahrtr ahrtr merged commit 5f387e6 into etcd-io:release-3.5 Nov 14, 2022
@serathius serathius mentioned this pull request Nov 14, 2022
22 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants