Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

panic: unexpected error during txn #14110

Closed
Tracked by #14138
aphyr opened this issue Jun 13, 2022 · 17 comments · Fixed by #14149
Closed
Tracked by #14138

panic: unexpected error during txn #14110

aphyr opened this issue Jun 13, 2022 · 17 comments · Fixed by #14149

Comments

@aphyr
Copy link

aphyr commented Jun 13, 2022

What happened?

In a Jepsen test run of five etcd 3.5.3 nodes, with process pauses (sending processes SIGSTOP and SIGCONT), compaction (performed via the admin API), and defragmentation (via etcdctl), one etcd process crashed with the following error message:

{"level":"panic","ts":"2022-06-13T14:11:27.870Z","caller":"etcdserver/apply.go:638","msg":"unexpected error during txn","error":"context canceled","stacktrace":"go.etcd.io/etcd/server/v3/etcdserver.(*applierV3backend).applyTxn\n\t/go/src/go.etcd.io/etcd/release/etcd/server/etcdserver/apply.go:638\ngo.etcd.io/etcd/server/v3/etcdserver.(*applierV3backend).Txn\n\t/go/src/go.etcd.io/etcd/release/etcd/server/etcdserver/apply.go:477\ngo.etcd.io/etcd/server/v3/etcdserver.(*EtcdServer).Txn.func3\n\t/go/src/go.etcd.io/etcd/release/etcd/server/etcdserver/v3_server.go:181\ngo.etcd.io/etcd/server/v3/etcdserver.(*EtcdServer).doSerialize\n\t/go/src/go.etcd.io/etcd/release/etcd/server/etcdserver/v3_server.go:659\ngo.etcd.io/etcd/server/v3/etcdserver.(*EtcdServer).Txn\n\t/go/src/go.etcd.io/etcd/release/etcd/server/etcdserver/v3_server.go:182\ngo.etcd.io/etcd/server/v3/etcdserver/api/v3rpc.(*kvServer).Txn\n\t/go/src/go.etcd.io/etcd/release/etcd/server/etcdserver/api/v3rpc/key.go:95\ngo.etcd.io/etcd/server/v3/etcdserver/api/v3rpc.(*quotaKVServer).Txn\n\t/go/src/go.etcd.io/etcd/release/etcd/server/etcdserver/api/v3rpc/quota.go:70\ngo.etcd.io/etcd/api/v3/etcdserverpb._KV_Txn_Handler.func1\n\t/go/src/go.etcd.io/etcd/release/etcd/api/etcdserverpb/rpc.pb.go:6617\ngithub.com/grpc-ecosystem/go-grpc-prometheus.(*ServerMetrics).UnaryServerInterceptor.func1\n\t/go/pkg/mod/github.com/grpc-ecosystem/go-grpc-prometheus@v1.2.0/server_metrics.go:107\ngithub.com/grpc-ecosystem/go-grpc-middleware.ChainUnaryServer.func1.1.1\n\t/go/pkg/mod/github.com/grpc-ecosystem/go-grpc-middleware@v1.3.0/chain.go:25\ngo.etcd.io/etcd/server/v3/etcdserver/api/v3rpc.newUnaryInterceptor.func1\n\t/go/src/go.etcd.io/etcd/release/etcd/server/etcdserver/api/v3rpc/interceptor.go:71\ngithub.com/grpc-ecosystem/go-grpc-middleware.ChainUnaryServer.func1.1.1\n\t/go/pkg/mod/github.com/grpc-ecosystem/go-grpc-middleware@v1.3.0/chain.go:25\ngo.etcd.io/etcd/server/v3/etcdserver/api/v3rpc.newLogUnaryInterceptor.func1\n\t/go/src/go.etcd.io/etcd/release/etcd/server/etcdserver/api/v3rpc/interceptor.go:78\ngithub.com/grpc-ecosystem/go-grpc-middleware.ChainUnaryServer.func1.1.1\n\t/go/pkg/mod/github.com/grpc-ecosystem/go-grpc-middleware@v1.3.0/chain.go:25\ngithub.com/grpc-ecosystem/go-grpc-middleware.ChainUnaryServer.func1\n\t/go/pkg/mod/github.com/grpc-ecosystem/go-grpc-middleware@v1.3.0/chain.go:34\ngo.etcd.io/etcd/api/v3/etcdserverpb._KV_Txn_Handler\n\t/go/src/go.etcd.io/etcd/release/etcd/api/etcdserverpb/rpc.pb.go:6619\ngoogle.golang.org/grpc.(*Server).processUnaryRPC\n\t/go/pkg/mod/google.golang.org/grpc@v1.38.0/server.go:1286\ngoogle.golang.org/grpc.(*Server).handleStream\n\t/go/pkg/mod/google.golang.org/grpc@v1.38.0/server.go:1609\ngoogle.golang.org/grpc.(*Server).serveStreams.func1.2\n\t/go/pkg/mod/google.golang.org/grpc@v1.38.0/server.go:934"}
{"level":"info","ts":"2022-06-13T14:11:27.875Z","caller":"traceutil/trace.go:171","msg":"trace[430686560] transaction","detail":"{read_only:true; }","duration":"5.770188852s","start":"2022-06-13T14:11:22.105Z","end":"2022-06-13T14:11:27.875Z","steps":["trace[430686560] 'agreement among raft nodes before linearized reading'  (duration: 5.763577986s)"],"step_count":1}
panic: unexpected error during txn

goroutine 10896 [running]:
go.uber.org/zap/zapcore.(*CheckedEntry).Write(0xc00011ec00, 0xc0035542c0, 0x1, 0x1)
	/go/pkg/mod/go.uber.org/zap@v1.17.0/zapcore/entry.go:234 +0x58d
go.uber.org/zap.(*Logger).Panic(0xc00048c280, 0x1220287, 0x1b, 0xc0035542c0, 0x1, 0x1)
	/go/pkg/mod/go.uber.org/zap@v1.17.0/logger.go:227 +0x85
go.etcd.io/etcd/server/v3/etcdserver.(*applierV3backend).applyTxn(0xc00000d1e8, 0x13a1cb8, 0xc002fe7cb0, 0x13ab818, 0xc003cfe550, 0xc003d07030, 0xc003b2ab00, 0x1, 0x1, 0xc003cec730, ...)
	/go/src/go.etcd.io/etcd/release/etcd/server/etcdserver/apply.go:638 +0x930
go.etcd.io/etcd/server/v3/etcdserver.(*applierV3backend).Txn(0xc00000d1e8, 0x13a1cb8, 0xc002fe7cb0, 0xc003d07030, 0x2, 0x48, 0x493c26, 0x2)
	/go/src/go.etcd.io/etcd/release/etcd/server/etcdserver/apply.go:477 +0x38a
go.etcd.io/etcd/server/v3/etcdserver.(*EtcdServer).Txn.func3()
	/go/src/go.etcd.io/etcd/release/etcd/server/etcdserver/v3_server.go:181 +0x70
go.etcd.io/etcd/server/v3/etcdserver.(*EtcdServer).doSerialize(0xc000442a00, 0x13a1cb8, 0xc002fe7cb0, 0xc0030314a8, 0xc0030314d8, 0xc003e1c500, 0x0)
	/go/src/go.etcd.io/etcd/release/etcd/server/etcdserver/v3_server.go:659 +0xf8
go.etcd.io/etcd/server/v3/etcdserver.(*EtcdServer).Txn(0xc000442a00, 0x13a1cb8, 0xc002fe7cb0, 0xc003d07030, 0x0, 0x0, 0x0)
	/go/src/go.etcd.io/etcd/release/etcd/server/etcdserver/v3_server.go:182 +0x40e
go.etcd.io/etcd/server/v3/etcdserver/api/v3rpc.(*kvServer).Txn(0xc000324fc0, 0x13a1cb8, 0xc002fe7b60, 0xc003d07030, 0xc003d07030, 0x0, 0x0)
	/go/src/go.etcd.io/etcd/release/etcd/server/etcdserver/api/v3rpc/key.go:95 +0xfc
go.etcd.io/etcd/server/v3/etcdserver/api/v3rpc.(*quotaKVServer).Txn(0xc000325000, 0x13a1cb8, 0xc002fe7b60, 0xc003d07030, 0xc000325000, 0x13a8a68, 0xc0002b7e60)
	/go/src/go.etcd.io/etcd/release/etcd/server/etcdserver/api/v3rpc/quota.go:70 +0xc2
go.etcd.io/etcd/api/v3/etcdserverpb._KV_Txn_Handler.func1(0x13a1cb8, 0xc002fe7b60, 0x11b1020, 0xc003d07030, 0x14, 0xc003b89d60, 0x2, 0xc0000f2818)
	/go/src/go.etcd.io/etcd/release/etcd/api/etcdserverpb/rpc.pb.go:6617 +0x89
github.com/grpc-ecosystem/go-grpc-prometheus.(*ServerMetrics).UnaryServerInterceptor.func1(0x13a1cb8, 0xc002fe7b60, 0x11b1020, 0xc003d07030, 0xc003a7bac0, 0xc003a718a8, 0xc003d746e0, 0x0, 0x0, 0x0)
	/go/pkg/mod/github.com/grpc-ecosystem/go-grpc-prometheus@v1.2.0/server_metrics.go:107 +0xb0
github.com/grpc-ecosystem/go-grpc-middleware.ChainUnaryServer.func1.1.1(0x13a1cb8, 0xc002fe7b60, 0x11b1020, 0xc003d07030, 0x1ab2d20, 0xc000437200, 0x8bda0c, 0x13881c0)
	/go/pkg/mod/github.com/grpc-ecosystem/go-grpc-middleware@v1.3.0/chain.go:25 +0x63
go.etcd.io/etcd/server/v3/etcdserver/api/v3rpc.newUnaryInterceptor.func1(0x13a1cb8, 0xc002fe7b60, 0x11b1020, 0xc003d07030, 0xc003a7bac0, 0xc003a7bae0, 0x493c26, 0x62a7458a, 0x6450608, 0x3bbfb531619)
	/go/src/go.etcd.io/etcd/release/etcd/server/etcdserver/api/v3rpc/interceptor.go:71 +0x15c
github.com/grpc-ecosystem/go-grpc-middleware.ChainUnaryServer.func1.1.1(0x13a1cb8, 0xc002fe7b60, 0x11b1020, 0xc003d07030, 0x0, 0x203000, 0x13b1c70, 0xc003e10710)
	/go/pkg/mod/github.com/grpc-ecosystem/go-grpc-middleware@v1.3.0/chain.go:25 +0x63
go.etcd.io/etcd/server/v3/etcdserver/api/v3rpc.newLogUnaryInterceptor.func1(0x13a1cb8, 0xc002fe7b60, 0x11b1020, 0xc003d07030, 0xc003a7bac0, 0xc003a7bb00, 0x0, 0x0, 0x0, 0x0)
	/go/src/go.etcd.io/etcd/release/etcd/server/etcdserver/api/v3rpc/interceptor.go:78 +0xd9
github.com/grpc-ecosystem/go-grpc-middleware.ChainUnaryServer.func1.1.1(0x13a1cb8, 0xc002fe7b60, 0x11b1020, 0xc003d07030, 0xc00021c800, 0x0, 0xc0000f2b30, 0x40e318)
	/go/pkg/mod/github.com/grpc-ecosystem/go-grpc-middleware@v1.3.0/chain.go:25 +0x63
github.com/grpc-ecosystem/go-grpc-middleware.ChainUnaryServer.func1(0x13a1cb8, 0xc002fe7b60, 0x11b1020, 0xc003d07030, 0xc003a7bac0, 0xc003a718a8, 0xc0000f2ba0, 0x573fa6, 0x1158b40, 0xc002fe7b60)
	/go/pkg/mod/github.com/grpc-ecosystem/go-grpc-middleware@v1.3.0/chain.go:34 +0xd7
go.etcd.io/etcd/api/v3/etcdserverpb._KV_Txn_Handler(0x10fa540, 0xc000325000, 0x13a1cb8, 0xc002fe7b60, 0xc003e18000, 0xc0003098c0, 0x13a1cb8, 0xc002fe7b60, 0xc003a9f640, 0x9)
	/go/src/go.etcd.io/etcd/release/etcd/api/etcdserverpb/rpc.pb.go:6619 +0x150
google.golang.org/grpc.(*Server).processUnaryRPC(0xc0004e7340, 0x13ae2d8, 0xc00379ec00, 0xc003e17200, 0xc000309a10, 0x1a6e268, 0x0, 0x0, 0x0)
	/go/pkg/mod/google.golang.org/grpc@v1.38.0/server.go:1286 +0x52b
google.golang.org/grpc.(*Server).handleStream(0xc0004e7340, 0x13ae2d8, 0xc00379ec00, 0xc003e17200, 0x0)
	/go/pkg/mod/google.golang.org/grpc@v1.38.0/server.go:1609 +0xd0c
google.golang.org/grpc.(*Server).serveStreams.func1.2(0xc002a90770, 0xc0004e7340, 0x13ae2d8, 0xc00379ec00, 0xc003e17200)
	/go/pkg/mod/google.golang.org/grpc@v1.38.0/server.go:934 +0xab
created by google.golang.org/grpc.(*Server).serveStreams.func1
	/go/pkg/mod/google.golang.org/grpc@v1.38.0/server.go:932 +0x1fd

This is the first time this has cropped up, so I'm not sure whether pauses, compaction, and defrag are all necessary to trigger the bug. The only nemesis operations on this particular node before the crash appear to be pause & resume, but I imagine it's possible that other nodes performing compaction or defragmentation might have caused them to exchange (or fail to exchange!) messages with this node which caused it to crash

Here's the full logs and data files from the cluster: 20220613T140950.000Z.zip.

What did you expect to happen?

I expected etcd not to crash.

How can we reproduce it (as minimally and precisely as possible)?

Check out jepsen-etcd 84d7d54698c387ed467dd5dfb8ca4bebc2ee46d5, and with a five-node cluster, run:

lein run test-all --username admin --nodes-file /home/admin/nodes --concurrency 4n --rate 500 --time-limit 300 --workload append --nemesis pause,admin --test-count 20

I'm not sure how long this will take to reproduce yet--still collecting evidence.

Anything else we need to know?

No response

Etcd version (please run commands below)

$ etcd --version
etcd Version: 3.5.3
Git SHA: 0452feec7
Go Version: go1.16.15
Go OS/Arch: linux/amd64

$ etcdctl version
etcdctl version: 3.5.3
API version: 3.5

Etcd configuration (command line flags or environment variables)

etcd --enable-v2 --log-outputs stderr --logger zap --name n1 --listen-peer-urls http://192.168.122.101:2380/ --listen-client-urls http://192.168.122.101:2379/ --advertise-client-urls http://192.168.122.101:2379/ --initial-cluster-state new --initial-advertise-peer-urls http://192.168.122.101:2380/ --initial-cluster n1=[http://192.168.122.101:2380,n2=http://192.168.122.102:2380,n3=http://192.168.122.103:2380,n4=http://192.168.122.104:2380,n5=http://192.168.122.105:2380](http://192.168.122.101:2380%2Cn2%3Dhttp//192.168.122.102:2380,n3=http://192.168.122.103:2380,n4=http://192.168.122.104:2380,n5=http://192.168.122.105:2380) --snapshot-count 100 --experimental-initial-corrupt-check

Etcd debug information (please run commands blow, feel free to obfuscate the IP address or FQDN in the output)

These node IDs aren't going to match, because I collected them from a different test run, but it should give you a general idea of the cluster topology:

admin@ip-172-31-9-53:/opt/etcd$ ./etcdctl --endpoints http://172.31.9.53:2379  member list -w table
+------------------+---------+--------------------------------------------+---------------------------+---------------------------+------------+
|        ID        | STATUS  |                    NAME                    |        PEER ADDRS         |       CLIENT ADDRS        | IS LEARNER |
+------------------+---------+--------------------------------------------+---------------------------+---------------------------+------------+
| 6f0318b69646b10a | started |  ec2-54-234-45-183.compute-1.amazonaws.com | http://172.31.13.117:2380 | http://172.31.13.117:2379 |      false |
| 851bb95a115a0651 | started |   ec2-54-90-123-74.compute-1.amazonaws.com |   http://172.31.9.53:2380 |   http://172.31.9.53:2379 |      false |
| 8ce655fee64e8888 | started |    ec2-3-88-142-40.compute-1.amazonaws.com |  http://172.31.7.132:2380 |  http://172.31.7.132:2379 |      false |
| 9a228ee3e4899e1e | started | ec2-54-167-119-185.compute-1.amazonaws.com |   http://172.31.9.60:2380 |   http://172.31.9.60:2379 |      false |
| dfed32f696db96d2 | started |  ec2-54-204-125-79.compute-1.amazonaws.com | http://172.31.12.135:2380 | http://172.31.12.135:2379 |      false |
+------------------+---------+--------------------------------------------+---------------------------+---------------------------+------------+

admin@ip-172-31-9-53:/opt/etcd$ {"level":"warn","ts":"2022-06-13T14:48:00.119Z","logger":"etcd-client","caller":"v3/retry_interceptor.go:62","msg":"retrying of unary invoker failed","target":"etcd-endpoints://0xc0004fa1c0/n1:2379","attempt":0,"error":"rpc error: code = DeadlineExceeded desc = context deadline exceeded"}
Failed to get the status of endpoint http://n4:2379 (context deadline exceeded)
+----------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+
|    ENDPOINT    |        ID        | VERSION | DB SIZE | IS LEADER | IS LEARNER | RAFT TERM | RAFT INDEX | RAFT APPLIED INDEX | ERRORS |
+----------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+
| http://n1:2379 | 851bb95a115a0651 |   3.5.3 |  164 kB |     false |      false |         2 |       2302 |               2299 |        |
| http://n2:2379 | 6f0318b69646b10a |   3.5.3 |  164 kB |     false |      false |         3 |       2306 |               2306 |        |
| http://n5:2379 | 8ce655fee64e8888 |   3.5.3 |  168 kB |     false |      false |         3 |       4966 |               4966 |        |
+----------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+
{"level":"warn","ts":"2022-06-13T14:47:38.147Z","logger":"etcd-client","caller":"v3/retry_interceptor.go:62","msg":"retrying of unary invoker failed","target":"etcd-endpoints://0xc0002fee00/n1:2379","attempt":0,"error":"rpc error: code = DeadlineExceeded desc = context deadline exceeded"}
Failed to get the status of endpoint http://n4:2379 (cont

Note that since these tests involve process pauses, it's tough to reach in and grab a coherent cluster view--pretty much every attempt I made involved timeouts.

Relevant log output

No response

@aphyr aphyr added the type/bug label Jun 13, 2022
@aphyr
Copy link
Author

aphyr commented Jun 13, 2022

I'm not sure if this is precisely the same bug or a different one, but I've managed to reproduce a crash with a slightly different error message purely from process pauses--no compaction or defrag operations involved. Here's the full logs.

{"level":"panic","ts":"2022-06-13T17:45:54.601Z","caller":"etcdserver/apply.go:638","msg":"unexpected error during txn","error":"context canceled","stacktrace":"go.etcd.io/etcd/server/v3/etcdserver.(*applierV3backend).applyTxn\n\t/go/src/go.etcd.io/etcd/release/etcd/server/etcdserver/apply.go:638\ngo.etcd.io/etcd/server/v3/etcdserver.(*applierV3backend).Txn\n\t/go/src/go.etcd.io/etcd/release/etcd/server/etcdserver/apply.go:477\ngo.etcd.io/etcd/server/v3/etcdserver.(*EtcdServer).Txn.func3\n\t/go/src/go.etcd.io/etcd/release/etcd/server/etcdserver/v3_server.go:181\ngo.etcd.io/etcd/server/v3/etcdserver.(*EtcdServer).doSerialize\n\t/go/src/go.etcd.io/etcd/release/etcd/server/etcdserver/v3_server.go:659\ngo.etcd.io/etcd/server/v3/etcdserver.(*EtcdServer).Txn\n\t/go/src/go.etcd.io/etcd/release/etcd/server/etcdserver/v3_server.go:182\ngo.etcd.io/etcd/server/v3/etcdserver/api/v3rpc.(*kvServer).Txn\n\t/go/src/go.etcd.io/etcd/release/etcd/server/etcdserver/api/v3rpc/key.go:95\ngo.etcd.io/etcd/server/v3/etcdserver/api/v3rpc.(*quotaKVServer).Txn\n\t/go/src/go.etcd.io/etcd/release/etcd/server/etcdserver/api/v3rpc/quota.go:70\ngo.etcd.io/etcd/api/v3/etcdserverpb._KV_Txn_Handler.func1\n\t/go/src/go.etcd.io/etcd/release/etcd/api/etcdserverpb/rpc.pb.go:6617\ngithub.com/grpc-ecosystem/go-grpc-prometheus.(*ServerMetrics).UnaryServerInterceptor.func1\n\t/go/pkg/mod/github.com/grpc-ecosystem/go-grpc-prometheus@v1.2.0/server_metrics.go:107\ngithub.com/grpc-ecosystem/go-grpc-middleware.ChainUnaryServer.func1.1.1\n\t/go/pkg/mod/github.com/grpc-ecosystem/go-grpc-middleware@v1.3.0/chain.go:25\ngo.etcd.io/etcd/server/v3/etcdserver/api/v3rpc.newUnaryInterceptor.func1\n\t/go/src/go.etcd.io/etcd/release/etcd/server/etcdserver/api/v3rpc/interceptor.go:71\ngithub.com/grpc-ecosystem/go-grpc-middleware.ChainUnaryServer.func1.1.1\n\t/go/pkg/mod/github.com/grpc-ecosystem/go-grpc-middleware@v1.3.0/chain.go:25\ngo.etcd.io/etcd/server/v3/etcdserver/api/v3rpc.newLogUnaryInterceptor.func1\n\t/go/src/go.etcd.io/etcd/release/etcd/server/etcdserver/api/v3rpc/interceptor.go:78\ngithub.com/grpc-ecosystem/go-grpc-middleware.ChainUnaryServer.func1.1.1\n\t/go/pkg/mod/github.com/grpc-ecosystem/go-grpc-middleware@v1.3.0/chain.go:25\ngithub.com/grpc-ecosystem/go-grpc-middleware.ChainUnaryServer.func1\n\t/go/pkg/mod/github.com/grpc-ecosystem/go-grpc-middleware@v1.3.0/chain.go:34\ngo.etcd.io/etcd/api/v3/etcdserverpb._KV_Txn_Handler\n\t/go/src/go.etcd.io/etcd/release/etcd/api/etcdserverpb/rpc.pb.go:6619\ngoogle.golang.org/grpc.(*Server).processUnaryRPC\n\t/go/pkg/mod/google.golang.org/grpc@v1.38.0/server.go:1286\ngoogle.golang.org/grpc.(*Server).handleStream\n\t/go/pkg/mod/google.golang.org/grpc@v1.38.0/server.go:1609\ngoogle.golang.org/grpc.(*Server).serveStreams.func1.2\n\t/go/pkg/mod/google.golang.org/grpc@v1.38.0/server.go:934"}
{"level":"warn","ts":"2022-06-13T17:45:54.603Z","caller":"etcdserver/util.go:166","msg":"apply request took too long","took":"6.263437345s","expected-duration":"100ms","prefix":"read-only txn ","request":"compare:<> success:<request_range:<key:\"2783\" >  request_range:<key:\"2785\" > > failure:<>","response":""}
{"level":"info","ts":"2022-06-13T17:45:54.603Z","caller":"traceutil/trace.go:171","msg":"trace[2082798263] transaction","detail":"{read_only:true; }","duration":"6.26467016s","start":"2022-06-13T17:45:48.338Z","end":"2022-06-13T17:45:54.603Z","steps":["trace[2082798263] 'compare'  (duration: 6.261489853s)"],"step_count":1}
panic: unexpected error during txn

goroutine 32669 [running]:
go.uber.org/zap/zapcore.(*CheckedEntry).Write(0xc0030dda40, 0xc0031dbd40, 0x1, 0x1)
	/go/pkg/mod/go.uber.org/zap@v1.17.0/zapcore/entry.go:234 +0x58d
go.uber.org/zap.(*Logger).Panic(0xc00040c3c0, 0x1220287, 0x1b, 0xc0031dbd40, 0x1, 0x1)
	/go/pkg/mod/go.uber.org/zap@v1.17.0/logger.go:227 +0x85
go.etcd.io/etcd/server/v3/etcdserver.(*applierV3backend).applyTxn(0xc0003dc9d8, 0x13a1cb8, 0xc003c4e540, 0x13ab818, 0xc0031be650, 0xc002dc5f80, 0xc0028095f0, 0x1, 0x1, 0xc0032e0f50, ...)
	/go/src/go.etcd.io/etcd/release/etcd/server/etcdserver/apply.go:638 +0x930
go.etcd.io/etcd/server/v3/etcdserver.(*applierV3backend).Txn(0xc0003dc9d8, 0x13a1cb8, 0xc003c4e540, 0xc002dc5f80, 0x2, 0x48, 0x493c26, 0x2)
	/go/src/go.etcd.io/etcd/release/etcd/server/etcdserver/apply.go:477 +0x38a
go.etcd.io/etcd/server/v3/etcdserver.(*EtcdServer).Txn.func3()
	/go/src/go.etcd.io/etcd/release/etcd/server/etcdserver/v3_server.go:181 +0x70
go.etcd.io/etcd/server/v3/etcdserver.(*EtcdServer).doSerialize(0xc000464300, 0x13a1cb8, 0xc003c4e540, 0xc0003094a8, 0xc0003094d8, 0xc0031f2d70, 0x0)
	/go/src/go.etcd.io/etcd/release/etcd/server/etcdserver/v3_server.go:659 +0xf8
go.etcd.io/etcd/server/v3/etcdserver.(*EtcdServer).Txn(0xc000464300, 0x13a1cb8, 0xc003c4e540, 0xc002dc5f80, 0x0, 0x0, 0x0)
	/go/src/go.etcd.io/etcd/release/etcd/server/etcdserver/v3_server.go:182 +0x40e
go.etcd.io/etcd/server/v3/etcdserver/api/v3rpc.(*kvServer).Txn(0xc0003ef300, 0x13a1cb8, 0xc003c4e3f0, 0xc002dc5f80, 0xc002dc5f80, 0x0, 0x0)
	/go/src/go.etcd.io/etcd/release/etcd/server/etcdserver/api/v3rpc/key.go:95 +0xfc
go.etcd.io/etcd/server/v3/etcdserver/api/v3rpc.(*quotaKVServer).Txn(0xc0003ef400, 0x13a1cb8, 0xc003c4e3f0, 0xc002dc5f80, 0xc0003ef400, 0x13a8a68, 0xc0002913e0)
	/go/src/go.etcd.io/etcd/release/etcd/server/etcdserver/api/v3rpc/quota.go:70 +0xc2
go.etcd.io/etcd/api/v3/etcdserverpb._KV_Txn_Handler.func1(0x13a1cb8, 0xc003c4e3f0, 0x11b1020, 0xc002dc5f80, 0x14, 0xc0031f2b90, 0x2, 0xc000340818)
	/go/src/go.etcd.io/etcd/release/etcd/api/etcdserverpb/rpc.pb.go:6617 +0x89
github.com/grpc-ecosystem/go-grpc-prometheus.(*ServerMetrics).UnaryServerInterceptor.func1(0x13a1cb8, 0xc003c4e3f0, 0x11b1020, 0xc002dc5f80, 0xc003228060, 0xc000627308, 0xc00046da20, 0x0, 0x0, 0x0)
	/go/pkg/mod/github.com/grpc-ecosystem/go-grpc-prometheus@v1.2.0/server_metrics.go:107 +0xb0
github.com/grpc-ecosystem/go-grpc-middleware.ChainUnaryServer.func1.1.1(0x13a1cb8, 0xc003c4e3f0, 0x11b1020, 0xc002dc5f80, 0x1ab2d20, 0xc00061e600, 0x8bda0c, 0x13881c0)
	/go/pkg/mod/github.com/grpc-ecosystem/go-grpc-middleware@v1.3.0/chain.go:25 +0x63
go.etcd.io/etcd/server/v3/etcdserver/api/v3rpc.newUnaryInterceptor.func1(0x13a1cb8, 0xc003c4e3f0, 0x11b1020, 0xc002dc5f80, 0xc003228060, 0xc003228080, 0x493c26, 0x62a777cc, 0x14323d4a, 0xf6fa2f301f7)
	/go/src/go.etcd.io/etcd/release/etcd/server/etcdserver/api/v3rpc/interceptor.go:71 +0x15c
github.com/grpc-ecosystem/go-grpc-middleware.ChainUnaryServer.func1.1.1(0x13a1cb8, 0xc003c4e3f0, 0x11b1020, 0xc002dc5f80, 0x0, 0x203000, 0x13b1c70, 0xc0022fbc20)
	/go/pkg/mod/github.com/grpc-ecosystem/go-grpc-middleware@v1.3.0/chain.go:25 +0x63
go.etcd.io/etcd/server/v3/etcdserver/api/v3rpc.newLogUnaryInterceptor.func1(0x13a1cb8, 0xc003c4e3f0, 0x11b1020, 0xc002dc5f80, 0xc003228060, 0xc0032280a0, 0x0, 0x0, 0x0, 0x0)
	/go/src/go.etcd.io/etcd/release/etcd/server/etcdserver/api/v3rpc/interceptor.go:78 +0xd9
github.com/grpc-ecosystem/go-grpc-middleware.ChainUnaryServer.func1.1.1(0x13a1cb8, 0xc003c4e3f0, 0x11b1020, 0xc002dc5f80, 0x1a82cc0, 0x0, 0xc000340b30, 0x40e318)
	/go/pkg/mod/github.com/grpc-ecosystem/go-grpc-middleware@v1.3.0/chain.go:25 +0x63
github.com/grpc-ecosystem/go-grpc-middleware.ChainUnaryServer.func1(0x13a1cb8, 0xc003c4e3f0, 0x11b1020, 0xc002dc5f80, 0xc003228060, 0xc000627308, 0xc000340ba0, 0x573fa6, 0x1158b40, 0xc003c4e3f0)
	/go/pkg/mod/github.com/grpc-ecosystem/go-grpc-middleware@v1.3.0/chain.go:34 +0xd7
go.etcd.io/etcd/api/v3/etcdserverpb._KV_Txn_Handler(0x10fa540, 0xc0003ef400, 0x13a1cb8, 0xc003c4e3f0, 0xc0021f2000, 0xc0002f15f0, 0x13a1cb8, 0xc003c4e3f0, 0xc003a187c8, 0x14)
	/go/src/go.etcd.io/etcd/release/etcd/api/etcdserverpb/rpc.pb.go:6619 +0x150
google.golang.org/grpc.(*Server).processUnaryRPC(0xc000517340, 0x13ae2d8, 0xc00345d200, 0xc002529320, 0xc0002f1740, 0x1a6e268, 0x0, 0x0, 0x0)
	/go/pkg/mod/google.golang.org/grpc@v1.38.0/server.go:1286 +0x52b
google.golang.org/grpc.(*Server).handleStream(0xc000517340, 0x13ae2d8, 0xc00345d200, 0xc002529320, 0x0)
	/go/pkg/mod/google.golang.org/grpc@v1.38.0/server.go:1609 +0xd0c
google.golang.org/grpc.(*Server).serveStreams.func1.2(0xc003ca7820, 0xc000517340, 0x13ae2d8, 0xc00345d200, 0xc002529320)
	/go/pkg/mod/google.golang.org/grpc@v1.38.0/server.go:934 +0xab
created by google.golang.org/grpc.(*Server).serveStreams.func1
	/go/pkg/mod/google.golang.org/grpc@v1.38.0/server.go:932 +0x1fd	

@lavacat
Copy link

lavacat commented Jun 14, 2022

I think this is a good catch and a relevant bug.
TLTR: etcd transaction shouldn't panic based on client context errors (timeouts, cancelations)

Here is how this happened:

  1. Etcd had panic in transactions from very early days. https://github.com/etcd-io/etcd/pull/3534/files#diff-26b2effd11719aa79af125ed4e17234ef08970839d4f83fb4d802e04614c3cd4R161
  2. Then we've added an optimization to bypass raft for serializable transactions. https://github.com/etcd-io/etcd/pull/4956/files#diff-26b2effd11719aa79af125ed4e17234ef08970839d4f83fb4d802e04614c3cd4R91
  3. Then it was limited to only readOnly transactions https://github.com/etcd-io/etcd/pull/6543/files#diff-583048be54c04f9555a60191ec8767757fc7f0c477d5bc7f2191af8a3c1bf920R173
  4. Next we've started passing context and using it for recording tracing info. https://github.com/etcd-io/etcd/pull/11491/files#diff-583048be54c04f9555a60191ec8767757fc7f0c477d5bc7f2191af8a3c1bf920R172
  5. Lastly we've added context checking inside range implementation https://github.com/etcd-io/etcd/pull/12459/files#diff-c4842260ff90edd130afd1d3ac19cb782c4a59c98483070a7a7f817e4d2f13fdR151

I think minimal fix should be to use empty context for tracing in txn (this is a quick change). Maybe we should have special class for tracing context to prevent confusion.

But Ideally we should match behavior of readonly operations wrapped in serializable transactions with normal readonly operations. Correct me if I'm wrong here.

@ahrtr
Copy link
Member

ahrtr commented Jun 14, 2022

This is a valid issue.

Based on current implementation, once a transaction starts, it must finishes without any error. Otherwise, it will end up with partially committed data. The workaround (the current implementation) is to crash the process to prevent it (partially commit) from happening.

Probably we should use the BoltDB transaction to perform the transaction. But it would need big refactoring.

@lavacat
Copy link

lavacat commented Jun 14, 2022

@ahrtr I think this issue is only for readonly serializable transactions that aren't going through raftRequestOnce. For normal transaction this won't happen because context isn't passed down to applier

@ahrtr
Copy link
Member

ahrtr commented Jun 14, 2022

Yes, this case happens for readonly (not necessary to be serializable) transaction. But the applyTxn is a generic implementation, and it doesn't know it's coming from readonly transaction.

Note that we can't use empty context (context.TODO()) either, otherwise the readonly transaction will not timeout.

Probably we have two options, one for short-term, and one for long-term,

  1. short-term fix: add a panic-recover to catch the panic for the readonly transaction case. We need to evaluate the impact.
  2. long-term fix: Refactor the existing transaction design/implemention to use BoltDB transaction, and rollback the transaction on error. Just as I mentioned in previous comment, this may need big effort, and accordingly has big impact.

@ahrtr
Copy link
Member

ahrtr commented Jun 14, 2022

Please anyone feel free to deliver a PR for the short-term fix for now.

@ptabor
Copy link
Contributor

ptabor commented Jun 14, 2022

Yes, this case happens for readonly (not necessary to be serializable) transaction. But the applyTxn is a generic implementation, and it doesn't know it's coming from readonly transaction.

Note that we can't use empty context (context.TODO()) either, otherwise the readonly transaction will not timeout.

Probably we have two options, one for short-term, and one for long-term,

  1. short-term fix: add a panic-recover to catch the panic for the readonly transaction case. We need to evaluate the impact.

That sounds right to me - as long as it's truly RO code-paths. I think we could do this as grpc interceptor.

  1. long-term fix: Refactor the existing transaction design/implemention to use BoltDB transaction, and rollback the transaction on error. Just as I mentioned in previous comment, this may need big effort, and accordingly has big impact.

We do batching of the bolt transactions to minimize numbers of b-tree nodes that are being rewritten at the top of the tree. If we did true 1:1 transactions it would have negative performance impact.

As Context is an interface (https://pkg.go.dev/context#Context), I would propose implementing a wrapper around it that
overrides the:

  • Done() <-chan struct{}
  • Err() error
    methods to mask the underlying context methods. Apply should continue independently from the request's context... The only thing it can check periodically is the server's context.

@lavacat
Copy link

lavacat commented Jun 14, 2022

Here is a different approach - create a separate apply function for readonly txn. It's basically a partial copy of current applyTxn for range but without recursion, because isTxnReadonly will return false for nested readonly tnx (maybe it's a bug).

I think this approach might be a bit more flexible for future changes to readonly txn. And no need to run recover.

If this looks good, I can add some tests.

#14113

@mcginne
Copy link

mcginne commented Jun 20, 2022

We are also hitting this fairy regularly (Several times per day) on both v3.5.2 & v3.5.4.

@ahrtr
Copy link
Member

ahrtr commented Jun 20, 2022

We are also hitting this fairy regularly (Several times per day) on both v3.5.2 & v3.5.4.

Would you mind provide the complete log and the endpoint info (see command below)?

etcdctl endpoint status -w json --cluster

Could you also explain how do you produce this issue? Did you see this issue in a testing environment or a production environment?

What's the volume of the request per second?

@mcginne
Copy link

mcginne commented Jun 21, 2022

@ahrtr endpoint info:

[{"Endpoint":"https://etcd-501-armada-stage5-south-sbhnpvg72b.etcd-501-armada-stage5-south.armada.svc:2379","Status":{"header":{"cluster_id":8861204074038396446,"member_id":5440534516375719315,"revision":5729930287,"raft_term":12923},"version":"3.5.4","dbSize":1037332480,"leader":8376786025734177010,"raftIndex":4073315546,"raftTerm":12923,"raftAppliedIndex":4073315546,"dbSizeInUse":963850240}},{"Endpoint":"https://etcd-501-armada-stage5-south-5zflxmmcz2.etcd-501-armada-stage5-south.armada.svc:2379","Status":{"header":{"cluster_id":8861204074038396446,"member_id":8376786025734177010,"revision":5729930289,"raft_term":12923},"version":"3.5.4","dbSize":1037058048,"leader":8376786025734177010,"raftIndex":4073315551,"raftTerm":12923,"raftAppliedIndex":4073315551,"dbSizeInUse":963796992}},{"Endpoint":"https://etcd-501-armada-stage5-south-7n25f4hms2.etcd-501-armada-stage5-south.armada.svc:2379","Status":{"header":{"cluster_id":8861204074038396446,"member_id":11064341971799022067,"revision":5729930301,"raft_term":12923},"version":"3.5.4","dbSize":1040326656,"leader":8376786025734177010,"raftIndex":4073315567,"raftTerm":12923,"raftAppliedIndex":4073315567,"dbSizeInUse":963985408}},{"Endpoint":"https://etcd-501-armada-stage5-south-prfhrxqx5k.etcd-501-armada-stage5-south.armada.svc:2379","Status":{"header":{"cluster_id":8861204074038396446,"member_id":11815188087669886612,"revision":5729930309,"raft_term":12923},"version":"3.5.4","dbSize":1037373440,"leader":8376786025734177010,"raftIndex":4073315585,"raftTerm":12923,"raftAppliedIndex":4073315585,"dbSizeInUse":964038656}},{"Endpoint":"https://etcd-501-armada-stage5-south-m5zlm8l8l6.etcd-501-armada-stage5-south.armada.svc:2379","Status":{"header":{"cluster_id":8861204074038396446,"member_id":12614509488604193654,"revision":5729930313,"raft_term":12923},"version":"3.5.4","dbSize":1040461824,"leader":8376786025734177010,"raftIndex":4073315590,"raftTerm":12923,"raftAppliedIndex":4073315590,"dbSizeInUse":964067328}}]

Log ->
etcd_panic.log.zip

Request rate is in total around 1000 req/sec - it's a very varied workload with many Microservices making different requests against etcd. I'm afraid I don't know exactly what triggers it to occur. The workload runs constantly, and should be fairly similar in nature at all times - I don't see any big spikes in requests or response times or anything when we see the panic. We have 5 member cluster and it seems to affect all 5 members.

This is our performance/scale test environment, so we are holding off moving to prod with v3.5 until this is resolved.

@aphyr
Copy link
Author

aphyr commented Jun 21, 2022

If it helps narrow things down, the workload I'm using that crashes etcd is kv reads, writes, and transactions over an exponentially-distributed pool of keys. Roughly 500-1000 ops/sec as well, though I haven't looked to see if it crashes with fewer.

@ahrtr
Copy link
Member

ahrtr commented Jun 21, 2022

Thanks @aphyr , we already know the root cause in your case. I just want to double confirm whether @mcginne ran into the same issue, and was caused by the same reason. Will take a look at the log provided by @mcginne today or tomorrow.

@ahrtr
Copy link
Member

ahrtr commented Jun 22, 2022

Thanks @mcginne . Confirmed that it's the same issue, and we will fix this issue soon in both main (3.6) and 3.5. Hopefully the fix will be included in 3.5.5.

@lavacat
Copy link

lavacat commented Jun 22, 2022

FYI, I'll post a PR for main with the approach outlined here
If somebody else is working on the fix, please state here, so there is no duplication of effort.

I'll try to get this in tomorrow June 22

@mcginne
Copy link

mcginne commented Jun 22, 2022

Thank you @ahrtr and @lavacat for confirming and the work on the fix - much appreciated.

lavacat pushed a commit to lavacat/etcd that referenced this issue Jun 23, 2022
Problem: We pass grpc context down to applier in readonly serializable txn.
This context can be cancelled for example due to timeout. This will trigger panic.

Solution: provide different error handler for readonly serializable txn.

fixes etcd-io#14110
lavacat pushed a commit to lavacat/etcd that referenced this issue Jun 24, 2022
Problem: We pass grpc context down to applier in readonly serializable txn.
This context can be cancelled for example due to timeout. This will trigger panic.

Solution: provide different error handler for readonly serializable txn.

fixes etcd-io#14110
lavacat pushed a commit to lavacat/etcd that referenced this issue Jun 24, 2022
Problem: We pass grpc context down to applier in readonly serializable txn.
This context can be cancelled for example due to timeout. This will trigger panic.

Solution: provide different error handler for readonly serializable txn.

fixes etcd-io#14110
lavacat pushed a commit to lavacat/etcd that referenced this issue Jun 27, 2022
Problem: We pass grpc context down to applier in readonly serializable txn.
This context can be cancelled for example due to timeout.
This will trigger panic inside applyTxn

Solution: Only panic for transactions with write operations

fixes etcd-io#14110

Signed-off-by: Bogdan Kanivets <bkanivets@apple.com>
lavacat pushed a commit to lavacat/etcd that referenced this issue Jun 27, 2022
Problem: We pass grpc context down to applier in readonly serializable txn.
This context can be cancelled for example due to timeout.
This will trigger panic inside applyTxn

Solution: Only panic for transactions with write operations

fixes etcd-io#14110

Signed-off-by: Bogdan Kanivets <bkanivets@apple.com>
lavacat pushed a commit to lavacat/etcd that referenced this issue Jun 29, 2022
Problem: We pass grpc context down to applier in readonly serializable txn.
This context can be cancelled for example due to timeout.
This will trigger panic inside applyTxn

Solution: Only panic for transactions with write operations

fixes etcd-io#14110
backported from main

Signed-off-by: Bogdan Kanivets <bkanivets@apple.com>
lavacat pushed a commit to lavacat/etcd that referenced this issue Jul 1, 2022
Problem: We pass grpc context down to applier in readonly serializable txn.
This context can be cancelled for example due to timeout.
This will trigger panic inside applyTxn

Solution: Only panic for transactions with write operations

fixes etcd-io#14110

Signed-off-by: Bogdan Kanivets <bkanivets@apple.com>
lavacat pushed a commit to lavacat/etcd that referenced this issue Jul 1, 2022
Problem: We pass grpc context down to applier in readonly serializable txn.
This context can be cancelled for example due to timeout.
This will trigger panic inside applyTxn

Solution: Only panic for transactions with write operations

fixes etcd-io#14110
backported from main

Signed-off-by: Bogdan Kanivets <bkanivets@apple.com>
@lavacat
Copy link

lavacat commented Jul 6, 2022

@serathius @ptabor reminder to review the PRs
main #14149
port to 3.5 #14178

lavacat pushed a commit to lavacat/etcd that referenced this issue Jul 26, 2022
Problem: We pass grpc context down to applier in readonly serializable txn.
This context can be cancelled for example due to timeout.
This will trigger panic inside applyTxn

Solution: Only panic for transactions with write operations

fixes etcd-io#14110

Signed-off-by: Bogdan Kanivets <bkanivets@apple.com>
lavacat pushed a commit to lavacat/etcd that referenced this issue Aug 3, 2022
Problem: We pass grpc context down to applier in readonly serializable txn.
This context can be cancelled for example due to timeout.
This will trigger panic inside applyTxn

Solution: Only panic for transactions with write operations

fixes etcd-io#14110

Signed-off-by: Bogdan Kanivets <bkanivets@apple.com>
lavacat pushed a commit to lavacat/etcd that referenced this issue Aug 3, 2022
Problem: We pass grpc context down to applier in readonly serializable txn.
This context can be cancelled for example due to timeout.
This will trigger panic inside applyTxn

Solution: Only panic for transactions with write operations

fixes etcd-io#14110

Signed-off-by: Bogdan Kanivets <bkanivets@apple.com>
lavacat pushed a commit to lavacat/etcd that referenced this issue Aug 9, 2022
Problem: We pass grpc context down to applier in readonly serializable txn.
This context can be cancelled for example due to timeout.
This will trigger panic inside applyTxn

Solution: Only panic for transactions with write operations

fixes etcd-io#14110

Signed-off-by: Bogdan Kanivets <bkanivets@apple.com>
lavacat pushed a commit to lavacat/etcd that referenced this issue Sep 1, 2022
Problem: We pass grpc context down to applier in readonly serializable txn.
This context can be cancelled for example due to timeout.
This will trigger panic inside applyTxn

Solution: Only panic for transactions with write operations

fixes etcd-io#14110
main PR etcd-io#14149
lavacat pushed a commit to lavacat/etcd that referenced this issue Sep 1, 2022
Problem: We pass grpc context down to applier in readonly serializable txn.
This context can be cancelled for example due to timeout.
This will trigger panic inside applyTxn

Solution: Only panic for transactions with write operations

fixes etcd-io#14110
main PR etcd-io#14149

Signed-off-by: Bogdan Kanivets <bkanivets@apple.com>
openshift-cherrypick-robot pushed a commit to openshift-cherrypick-robot/etcd that referenced this issue Oct 7, 2022
Problem: We pass grpc context down to applier in readonly serializable txn.
This context can be cancelled for example due to timeout.
This will trigger panic inside applyTxn

Solution: Only panic for transactions with write operations

fixes etcd-io#14110
main PR etcd-io#14149

Signed-off-by: Bogdan Kanivets <bkanivets@apple.com>
tjungblu pushed a commit to tjungblu/etcd that referenced this issue Jul 26, 2023
Problem: We pass grpc context down to applier in readonly serializable txn.
This context can be cancelled for example due to timeout.
This will trigger panic inside applyTxn

Solution: Only panic for transactions with write operations

fixes etcd-io#14110
main PR etcd-io#14149

Signed-off-by: Bogdan Kanivets <bkanivets@apple.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Development

Successfully merging a pull request may close this issue.

6 participants