New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
server: don't panic in readonly serializable txn #14149
Conversation
75f7a81
to
e5fc1de
Compare
Please sign off the commit. |
server/etcdserver/api/v3rpc/grpc.go
Outdated
@@ -36,7 +36,7 @@ const ( | |||
maxSendBytes = math.MaxInt32 | |||
) | |||
|
|||
func Server(s *etcdserver.EtcdServer, tls *tls.Config, interceptor grpc.UnaryServerInterceptor, gopts ...grpc.ServerOption) *grpc.Server { | |||
func Server(s *etcdserver.EtcdServer, tls *tls.Config, interceptors []grpc.UnaryServerInterceptor, gopts ...grpc.ServerOption) *grpc.Server { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
unrelated change?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is to make test work. These interceptors are used only in integrations tests. There is already grpc_testing.GrpcRecorder. I've added another one. See changes in intergration/cluster.go
server/etcdserver/txn/txn.go
Outdated
@@ -311,7 +322,7 @@ func newTxnResp(rt *pb.TxnRequest, txnPath []bool) (txnResp *pb.TxnResponse, txn | |||
return txnResp, txnCount | |||
} | |||
|
|||
func applyTxn(ctx context.Context, lg *zap.Logger, kv mvcc.KV, lessor lease.Lessor, txnWrite mvcc.TxnWrite, rt *pb.TxnRequest, txnPath []bool, tresp *pb.TxnResponse) (txns int) { | |||
func applyTxn(ctx context.Context, lg *zap.Logger, kv mvcc.KV, lessor lease.Lessor, txnWrite mvcc.TxnWrite, rt *pb.TxnRequest, txnPath []bool, tresp *pb.TxnResponse, errHander CriticalErrHandler) (txns int, err error) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The concern is it's hacking the implementation too much. The clearer way is to resolve it out of the box, we can just add a panic-recover, just as I mentioned in issuecomment-1154738363
server/etcdserver/txn/txn.go
Outdated
if err != nil && isWrite { | ||
// When txn with write operations starts it has to be successful | ||
// We don't have a way to recover state in case of write failure | ||
lg.Panic("unexpected error during txnWrite", zap.Error(err)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ptabor @ahrtr I've tried passing an error handler (see outdated version of this PR) but it seams a bit cumbersome. @ahrtr suggested recover. But it's only going to be for readonly txn not all requests, also seams cumbersome.
How about just panic when txn has write ops?
For ref see, original PR when panic was introduced https://github.com/etcd-io/etcd/pull/3534/files#diff-26b2effd11719aa79af125ed4e17234ef08970839d4f83fb4d802e04614c3cd4R161
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have small preference towards apply workflow being extremely strict and deterministic and any unexpected error causing premature exit from function (even in RO) code is something that should lead to investigation and fixing.
But I don't see any exact problem with the proposed approach.
Codecov Report
@@ Coverage Diff @@
## main #14149 +/- ##
==========================================
- Coverage 75.42% 75.26% -0.16%
==========================================
Files 456 456
Lines 36919 36926 +7
==========================================
- Hits 27846 27793 -53
- Misses 7340 7394 +54
- Partials 1733 1739 +6
Flags with carried forward coverage won't be shown. Click here to find out more.
Help us with your feedback. Take ten seconds to tell us how you rate us. |
tests/integration/v3_grpc_test.go
Outdated
@@ -1952,3 +1953,77 @@ func waitForRestart(t *testing.T, kvc pb.KVClient) { | |||
t.Fatalf("timed out waiting for restart: %v", err) | |||
} | |||
} | |||
|
|||
func TestV3ReadonlyTxnCancelledContext(t *testing.T) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Have you considered to add a common test which is executed on both e2e and integration environments? And also use the new interface ?
cc @serathius
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems we can only add integration test here because we need to inject an interceptor, but we do not support that in command line (and also e2e).
But please consider to use the new interface in https://github.com/etcd-io/etcd/blob/main/tests/framework/integration.go
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One problem, new interface is not available in v3.5, so using it now would make backport much harder.
tests/integration/v3_grpc_test.go
Outdated
clus := integration.NewCluster(t, &integration.ClusterConfig{ | ||
Size: 1, | ||
// Context should be cancelled on the second check that happens inside rangeKeys | ||
GrpcInterceptors: []grpc.UnaryServerInterceptor{injectMockContextForTxn(newMockContext(2))}, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The 2
here is a little tricky.
I suggest to simplify the test case for better readability:
- Remove the Put (line 1968), and let's test the
readonly transaction
only; - always return a closed channel in
(ctx *mockCtx) Done();
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agree that counting calls on context to trigger reaction is not maintainable.
Thinking about alternatives, how about intercepting zap logs, using zap.hooks:
https://stackoverflow.com/questions/64803634/how-to-add-a-hook-into-a-zap-logger
Still not perfect, but allows to e.g. disable context when specific transaction is debug-logged.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ahrtr Put is there to prevent short circuit in storeTxnRead.Range. We need some data to trigger case <-ctx.Done():
Always returning closed channel doesn't work because it fails early in EtcdServer.linearizableReadNotify
and never gets to txn.Txn
@ptabor how do you trigger context cancelled
inside storeTxnRead.Range
if you don't let the first check in EtcdServer.linearizableReadNotify
to pass? Maybe I'm missing something, but how will intercepting zap logs help? As far as I can tell only failure inside storeTxnRead.Range triggers panic, so that's what we want to model in the test.
Another approach I see is to add test only on txn.Txn level and not replicate panic as integration test. Maybe not ideal, but integrations tests don't have to cover every failure scenario.
tests/integration/v3_grpc_test.go
Outdated
txn := &pb.TxnRequest{Success: []*pb.RequestOp{txnget}} | ||
_, err = kvc.Txn(context.TODO(), txn) | ||
if err == nil || !strings.Contains(err.Error(), "range context cancelled: mock context error") { | ||
t.Fatal(err) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems that the err
has nothing to do with applyTxn
. Note that it's returned by the API instead of the applying workflow.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
wrapped err in applyTxn
tests/integration/v3_grpc_test.go
Outdated
clus := integration.NewCluster(t, &integration.ClusterConfig{ | ||
Size: 1, | ||
// Context should be cancelled on the second check that happens inside rangeKeys | ||
GrpcInterceptors: []grpc.UnaryServerInterceptor{injectMockContextForTxn(newMockContext(2))}, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agree that counting calls on context to trigger reaction is not maintainable.
Thinking about alternatives, how about intercepting zap logs, using zap.hooks:
https://stackoverflow.com/questions/64803634/how-to-add-a-hook-into-a-zap-logger
Still not perfect, but allows to e.g. disable context when specific transaction is debug-logged.
@ahrtr @ptabor updated PR with alternative version of the test that uses reflection to check the name of the caller function and cancel context for rangeKeys. The only thing that isn't done is migration to new client interface (comment). If this approach looks good, I'll try the new client for main and keep this version for 3.5 |
tests/integration/v3_grpc_test.go
Outdated
|
||
txnget := &pb.RequestOp{Request: &pb.RequestOp_RequestRange{RequestRange: &pb.RangeRequest{Key: []byte("abc")}}} | ||
txn := &pb.TxnRequest{Success: []*pb.RequestOp{txnget}} | ||
_, err = kvc.Txn(context.TODO(), txn) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am a little concerned about the readability of the test case. Can you try the following proposal?
- pass a context with a deadline, i.e. 5s;
- register an unary interceptor, and just sleep a couple of seconds (5+1=6s) in the interceptor;
- The
kvc.Txn
should return an error with something like context timeout
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am also thinking probably we can just add a unit test to accurately test the different behaviors:
- For readonly transaction, it should NOT panic due to context timeout;
- For non-readonly trancation, it should panic if the context timeout
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Deadline approach doesn't work, because it's early terminated here (tried it out).
Unit tests are fine with me. I've mentioned this earlier.
Another approach I see is to add test only on txn.Txn level and not replicate panic as integration test. Maybe not ideal, but integrations tests don't have to cover every failure scenario.
Will work on PR to convert to unit tests and remove this test.
f41515b
to
2a504ef
Compare
@ahrtr this is ready for review with unit tests instead of integration |
server/etcdserver/txn/txn_test.go
Outdated
func cleanupStore(s mvcc.KV, b backend.Backend, path string) { | ||
s.Close() | ||
b.Close() | ||
os.Remove(path) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It should be unnecessary, because the golang build-in test lib can remove the directory returned by t.TempDir()
automatically.
server/etcdserver/txn/txn_test.go
Outdated
|
||
func cleanupStore(s mvcc.KV, b backend.Backend, path string) { | ||
s.Close() | ||
b.Close() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please use betesting.Close(t, b) instead.
Problem: We pass grpc context down to applier in readonly serializable txn. This context can be cancelled for example due to timeout. This will trigger panic inside applyTxn Solution: Only panic for transactions with write operations fixes etcd-io#14110 Signed-off-by: Bogdan Kanivets <bkanivets@apple.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Thank you @lavacat
cc @spzala @ptabor @serathius to double confirm. This is the core applying workflow, there must be at least two approvals.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Problem: We pass grpc context down to applier in readonly serializable txn. This context can be cancelled for example due to timeout. This will trigger panic inside applyTxn Solution: Only panic for transactions with write operations fixes etcd-io#14110 main PR etcd-io#14149
Problem: We pass grpc context down to applier in readonly serializable txn. This context can be cancelled for example due to timeout. This will trigger panic inside applyTxn Solution: Only panic for transactions with write operations fixes etcd-io#14110 main PR etcd-io#14149 Signed-off-by: Bogdan Kanivets <bkanivets@apple.com>
Problem: We pass grpc context down to applier in readonly serializable txn. This context can be cancelled for example due to timeout. This will trigger panic inside applyTxn Solution: Only panic for transactions with write operations fixes etcd-io#14110 main PR etcd-io#14149 Signed-off-by: Bogdan Kanivets <bkanivets@apple.com>
Problem: We pass grpc context down to applier in readonly serializable txn. This context can be cancelled for example due to timeout. This will trigger panic inside applyTxn Solution: Only panic for transactions with write operations fixes etcd-io#14110 main PR etcd-io#14149 Signed-off-by: Bogdan Kanivets <bkanivets@apple.com>
Problem: We pass grpc context down to applier in readonly serializable txn.
This context can be cancelled for example due to timeout. This will trigger panic.
Solution: provide different error handler for readonly serializable txn.
fixes #14110
Please read https://github.com/etcd-io/etcd/blob/main/CONTRIBUTING.md#contribution-flow.