Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

server: don't panic in readonly serializable txn #14149

Merged
merged 1 commit into from Aug 13, 2022

Conversation

lavacat
Copy link

@lavacat lavacat commented Jun 23, 2022

Problem: We pass grpc context down to applier in readonly serializable txn.
This context can be cancelled for example due to timeout. This will trigger panic.

Solution: provide different error handler for readonly serializable txn.

fixes #14110

Please read https://github.com/etcd-io/etcd/blob/main/CONTRIBUTING.md#contribution-flow.

@lavacat lavacat force-pushed the main-txn-panic branch 2 times, most recently from 75f7a81 to e5fc1de Compare June 24, 2022 00:12
@ahrtr
Copy link
Member

ahrtr commented Jun 24, 2022

Please sign off the commit.

@@ -36,7 +36,7 @@ const (
maxSendBytes = math.MaxInt32
)

func Server(s *etcdserver.EtcdServer, tls *tls.Config, interceptor grpc.UnaryServerInterceptor, gopts ...grpc.ServerOption) *grpc.Server {
func Server(s *etcdserver.EtcdServer, tls *tls.Config, interceptors []grpc.UnaryServerInterceptor, gopts ...grpc.ServerOption) *grpc.Server {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

unrelated change?

Copy link
Author

@lavacat lavacat Jun 27, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is to make test work. These interceptors are used only in integrations tests. There is already grpc_testing.GrpcRecorder. I've added another one. See changes in intergration/cluster.go

@@ -311,7 +322,7 @@ func newTxnResp(rt *pb.TxnRequest, txnPath []bool) (txnResp *pb.TxnResponse, txn
return txnResp, txnCount
}

func applyTxn(ctx context.Context, lg *zap.Logger, kv mvcc.KV, lessor lease.Lessor, txnWrite mvcc.TxnWrite, rt *pb.TxnRequest, txnPath []bool, tresp *pb.TxnResponse) (txns int) {
func applyTxn(ctx context.Context, lg *zap.Logger, kv mvcc.KV, lessor lease.Lessor, txnWrite mvcc.TxnWrite, rt *pb.TxnRequest, txnPath []bool, tresp *pb.TxnResponse, errHander CriticalErrHandler) (txns int, err error) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The concern is it's hacking the implementation too much. The clearer way is to resolve it out of the box, we can just add a panic-recover, just as I mentioned in issuecomment-1154738363

if err != nil && isWrite {
// When txn with write operations starts it has to be successful
// We don't have a way to recover state in case of write failure
lg.Panic("unexpected error during txnWrite", zap.Error(err))
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ptabor @ahrtr I've tried passing an error handler (see outdated version of this PR) but it seams a bit cumbersome. @ahrtr suggested recover. But it's only going to be for readonly txn not all requests, also seams cumbersome.

How about just panic when txn has write ops?

For ref see, original PR when panic was introduced https://github.com/etcd-io/etcd/pull/3534/files#diff-26b2effd11719aa79af125ed4e17234ef08970839d4f83fb4d802e04614c3cd4R161

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have small preference towards apply workflow being extremely strict and deterministic and any unexpected error causing premature exit from function (even in RO) code is something that should lead to investigation and fixing.

But I don't see any exact problem with the proposed approach.

@codecov-commenter
Copy link

codecov-commenter commented Jun 27, 2022

Codecov Report

Merging #14149 (99f9260) into main (4977877) will decrease coverage by 0.15%.
The diff coverage is 63.15%.

@@            Coverage Diff             @@
##             main   #14149      +/-   ##
==========================================
- Coverage   75.42%   75.26%   -0.16%     
==========================================
  Files         456      456              
  Lines       36919    36926       +7     
==========================================
- Hits        27846    27793      -53     
- Misses       7340     7394      +54     
- Partials     1733     1739       +6     
Flag Coverage Δ
all 75.26% <63.15%> (-0.16%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
server/etcdserver/txn/txn.go 90.11% <53.33%> (-0.56%) ⬇️
server/etcdserver/api/v3rpc/grpc.go 91.42% <100.00%> (ø)
server/storage/mvcc/kvstore_txn.go 79.12% <100.00%> (ø)
client/v3/namespace/watch.go 87.87% <0.00%> (-6.07%) ⬇️
client/v3/leasing/cache.go 87.22% <0.00%> (-4.45%) ⬇️
client/v3/experimental/recipes/double_barrier.go 71.87% <0.00%> (-3.13%) ⬇️
client/v3/retry_interceptor.go 64.54% <0.00%> (-2.73%) ⬇️
client/v3/concurrency/election.go 79.68% <0.00%> (-2.35%) ⬇️
raft/storage.go 91.66% <0.00%> (-1.86%) ⬇️
server/etcdserver/api/v3rpc/watch.go 86.91% <0.00%> (-1.68%) ⬇️
... and 11 more

Help us with your feedback. Take ten seconds to tell us how you rate us.

server/etcdserver/txn/txn.go Outdated Show resolved Hide resolved
@@ -1952,3 +1953,77 @@ func waitForRestart(t *testing.T, kvc pb.KVClient) {
t.Fatalf("timed out waiting for restart: %v", err)
}
}

func TestV3ReadonlyTxnCancelledContext(t *testing.T) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Have you considered to add a common test which is executed on both e2e and integration environments? And also use the new interface ?

cc @serathius

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems we can only add integration test here because we need to inject an interceptor, but we do not support that in command line (and also e2e).

But please consider to use the new interface in https://github.com/etcd-io/etcd/blob/main/tests/framework/integration.go

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One problem, new interface is not available in v3.5, so using it now would make backport much harder.

clus := integration.NewCluster(t, &integration.ClusterConfig{
Size: 1,
// Context should be cancelled on the second check that happens inside rangeKeys
GrpcInterceptors: []grpc.UnaryServerInterceptor{injectMockContextForTxn(newMockContext(2))},
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The 2 here is a little tricky.

I suggest to simplify the test case for better readability:

  1. Remove the Put (line 1968), and let's test the readonly transaction only;
  2. always return a closed channel in (ctx *mockCtx) Done();

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree that counting calls on context to trigger reaction is not maintainable.
Thinking about alternatives, how about intercepting zap logs, using zap.hooks:

https://stackoverflow.com/questions/64803634/how-to-add-a-hook-into-a-zap-logger

Still not perfect, but allows to e.g. disable context when specific transaction is debug-logged.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ahrtr Put is there to prevent short circuit in storeTxnRead.Range. We need some data to trigger case <-ctx.Done():
Always returning closed channel doesn't work because it fails early in EtcdServer.linearizableReadNotify and never gets to txn.Txn

@ptabor how do you trigger context cancelled inside storeTxnRead.Range if you don't let the first check in EtcdServer.linearizableReadNotify to pass? Maybe I'm missing something, but how will intercepting zap logs help? As far as I can tell only failure inside storeTxnRead.Range triggers panic, so that's what we want to model in the test.

Another approach I see is to add test only on txn.Txn level and not replicate panic as integration test. Maybe not ideal, but integrations tests don't have to cover every failure scenario.

txn := &pb.TxnRequest{Success: []*pb.RequestOp{txnget}}
_, err = kvc.Txn(context.TODO(), txn)
if err == nil || !strings.Contains(err.Error(), "range context cancelled: mock context error") {
t.Fatal(err)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems that the err has nothing to do with applyTxn. Note that it's returned by the API instead of the applying workflow.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

wrapped err in applyTxn

tests/integration/v3_grpc_test.go Outdated Show resolved Hide resolved
tests/integration/v3_grpc_test.go Outdated Show resolved Hide resolved
tests/integration/v3_grpc_test.go Outdated Show resolved Hide resolved
clus := integration.NewCluster(t, &integration.ClusterConfig{
Size: 1,
// Context should be cancelled on the second check that happens inside rangeKeys
GrpcInterceptors: []grpc.UnaryServerInterceptor{injectMockContextForTxn(newMockContext(2))},
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree that counting calls on context to trigger reaction is not maintainable.
Thinking about alternatives, how about intercepting zap logs, using zap.hooks:

https://stackoverflow.com/questions/64803634/how-to-add-a-hook-into-a-zap-logger

Still not perfect, but allows to e.g. disable context when specific transaction is debug-logged.

@lavacat
Copy link
Author

lavacat commented Jul 26, 2022

@ahrtr @ptabor updated PR with alternative version of the test that uses reflection to check the name of the caller function and cancel context for rangeKeys.
Added more comments about test setup.
Addressed other small comments.

The only thing that isn't done is migration to new client interface (comment). If this approach looks good, I'll try the new client for main and keep this version for 3.5


txnget := &pb.RequestOp{Request: &pb.RequestOp_RequestRange{RequestRange: &pb.RangeRequest{Key: []byte("abc")}}}
txn := &pb.TxnRequest{Success: []*pb.RequestOp{txnget}}
_, err = kvc.Txn(context.TODO(), txn)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am a little concerned about the readability of the test case. Can you try the following proposal?

  1. pass a context with a deadline, i.e. 5s;
  2. register an unary interceptor, and just sleep a couple of seconds (5+1=6s) in the interceptor;
  3. The kvc.Txn should return an error with something like context timeout

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am also thinking probably we can just add a unit test to accurately test the different behaviors:

  1. For readonly transaction, it should NOT panic due to context timeout;
  2. For non-readonly trancation, it should panic if the context timeout

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Deadline approach doesn't work, because it's early terminated here (tried it out).

Unit tests are fine with me. I've mentioned this earlier.

Another approach I see is to add test only on txn.Txn level and not replicate panic as integration test. Maybe not ideal, but integrations tests don't have to cover every failure scenario.

Will work on PR to convert to unit tests and remove this test.

@lavacat lavacat force-pushed the main-txn-panic branch 2 times, most recently from f41515b to 2a504ef Compare August 3, 2022 07:22
@lavacat
Copy link
Author

lavacat commented Aug 5, 2022

@ahrtr this is ready for review with unit tests instead of integration

func cleanupStore(s mvcc.KV, b backend.Backend, path string) {
s.Close()
b.Close()
os.Remove(path)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It should be unnecessary, because the golang build-in test lib can remove the directory returned by t.TempDir() automatically.


func cleanupStore(s mvcc.KV, b backend.Backend, path string) {
s.Close()
b.Close()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please use betesting.Close(t, b) instead.

Problem: We pass grpc context down to applier in readonly serializable txn.
This context can be cancelled for example due to timeout.
This will trigger panic inside applyTxn

Solution: Only panic for transactions with write operations

fixes etcd-io#14110

Signed-off-by: Bogdan Kanivets <bkanivets@apple.com>
Copy link
Member

@ahrtr ahrtr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Thank you @lavacat

cc @spzala @ptabor @serathius to double confirm. This is the core applying workflow, there must be at least two approvals.

Copy link
Member

@spzala spzala left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm
Thanks @lavacat @ahrtr

@ahrtr
Copy link
Member

ahrtr commented Aug 13, 2022

Thanks @spzala

Great work! @lavacat

@ahrtr ahrtr merged commit fff5d00 into etcd-io:main Aug 13, 2022
lavacat pushed a commit to lavacat/etcd that referenced this pull request Sep 1, 2022
Problem: We pass grpc context down to applier in readonly serializable txn.
This context can be cancelled for example due to timeout.
This will trigger panic inside applyTxn

Solution: Only panic for transactions with write operations

fixes etcd-io#14110
main PR etcd-io#14149
lavacat pushed a commit to lavacat/etcd that referenced this pull request Sep 1, 2022
Problem: We pass grpc context down to applier in readonly serializable txn.
This context can be cancelled for example due to timeout.
This will trigger panic inside applyTxn

Solution: Only panic for transactions with write operations

fixes etcd-io#14110
main PR etcd-io#14149

Signed-off-by: Bogdan Kanivets <bkanivets@apple.com>
openshift-cherrypick-robot pushed a commit to openshift-cherrypick-robot/etcd that referenced this pull request Oct 7, 2022
Problem: We pass grpc context down to applier in readonly serializable txn.
This context can be cancelled for example due to timeout.
This will trigger panic inside applyTxn

Solution: Only panic for transactions with write operations

fixes etcd-io#14110
main PR etcd-io#14149

Signed-off-by: Bogdan Kanivets <bkanivets@apple.com>
tjungblu pushed a commit to tjungblu/etcd that referenced this pull request Jul 26, 2023
Problem: We pass grpc context down to applier in readonly serializable txn.
This context can be cancelled for example due to timeout.
This will trigger panic inside applyTxn

Solution: Only panic for transactions with write operations

fixes etcd-io#14110
main PR etcd-io#14149

Signed-off-by: Bogdan Kanivets <bkanivets@apple.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging this pull request may close these issues.

panic: unexpected error during txn
6 participants