panic: unexpected error during txn #14110

aphyr · 2022-06-13T14:48:55Z

What happened?

In a Jepsen test run of five etcd 3.5.3 nodes, with process pauses (sending processes SIGSTOP and SIGCONT), compaction (performed via the admin API), and defragmentation (via etcdctl), one etcd process crashed with the following error message:

{"level":"panic","ts":"2022-06-13T14:11:27.870Z","caller":"etcdserver/apply.go:638","msg":"unexpected error during txn","error":"context canceled","stacktrace":"go.etcd.io/etcd/server/v3/etcdserver.(*applierV3backend).applyTxn\n\t/go/src/go.etcd.io/etcd/release/etcd/server/etcdserver/apply.go:638\ngo.etcd.io/etcd/server/v3/etcdserver.(*applierV3backend).Txn\n\t/go/src/go.etcd.io/etcd/release/etcd/server/etcdserver/apply.go:477\ngo.etcd.io/etcd/server/v3/etcdserver.(*EtcdServer).Txn.func3\n\t/go/src/go.etcd.io/etcd/release/etcd/server/etcdserver/v3_server.go:181\ngo.etcd.io/etcd/server/v3/etcdserver.(*EtcdServer).doSerialize\n\t/go/src/go.etcd.io/etcd/release/etcd/server/etcdserver/v3_server.go:659\ngo.etcd.io/etcd/server/v3/etcdserver.(*EtcdServer).Txn\n\t/go/src/go.etcd.io/etcd/release/etcd/server/etcdserver/v3_server.go:182\ngo.etcd.io/etcd/server/v3/etcdserver/api/v3rpc.(*kvServer).Txn\n\t/go/src/go.etcd.io/etcd/release/etcd/server/etcdserver/api/v3rpc/key.go:95\ngo.etcd.io/etcd/server/v3/etcdserver/api/v3rpc.(*quotaKVServer).Txn\n\t/go/src/go.etcd.io/etcd/release/etcd/server/etcdserver/api/v3rpc/quota.go:70\ngo.etcd.io/etcd/api/v3/etcdserverpb._KV_Txn_Handler.func1\n\t/go/src/go.etcd.io/etcd/release/etcd/api/etcdserverpb/rpc.pb.go:6617\ngithub.com/grpc-ecosystem/go-grpc-prometheus.(*ServerMetrics).UnaryServerInterceptor.func1\n\t/go/pkg/mod/github.com/grpc-ecosystem/go-grpc-prometheus@v1.2.0/server_metrics.go:107\ngithub.com/grpc-ecosystem/go-grpc-middleware.ChainUnaryServer.func1.1.1\n\t/go/pkg/mod/github.com/grpc-ecosystem/go-grpc-middleware@v1.3.0/chain.go:25\ngo.etcd.io/etcd/server/v3/etcdserver/api/v3rpc.newUnaryInterceptor.func1\n\t/go/src/go.etcd.io/etcd/release/etcd/server/etcdserver/api/v3rpc/interceptor.go:71\ngithub.com/grpc-ecosystem/go-grpc-middleware.ChainUnaryServer.func1.1.1\n\t/go/pkg/mod/github.com/grpc-ecosystem/go-grpc-middleware@v1.3.0/chain.go:25\ngo.etcd.io/etcd/server/v3/etcdserver/api/v3rpc.newLogUnaryInterceptor.func1\n\t/go/src/go.etcd.io/etcd/release/etcd/server/etcdserver/api/v3rpc/interceptor.go:78\ngithub.com/grpc-ecosystem/go-grpc-middleware.ChainUnaryServer.func1.1.1\n\t/go/pkg/mod/github.com/grpc-ecosystem/go-grpc-middleware@v1.3.0/chain.go:25\ngithub.com/grpc-ecosystem/go-grpc-middleware.ChainUnaryServer.func1\n\t/go/pkg/mod/github.com/grpc-ecosystem/go-grpc-middleware@v1.3.0/chain.go:34\ngo.etcd.io/etcd/api/v3/etcdserverpb._KV_Txn_Handler\n\t/go/src/go.etcd.io/etcd/release/etcd/api/etcdserverpb/rpc.pb.go:6619\ngoogle.golang.org/grpc.(*Server).processUnaryRPC\n\t/go/pkg/mod/google.golang.org/grpc@v1.38.0/server.go:1286\ngoogle.golang.org/grpc.(*Server).handleStream\n\t/go/pkg/mod/google.golang.org/grpc@v1.38.0/server.go:1609\ngoogle.golang.org/grpc.(*Server).serveStreams.func1.2\n\t/go/pkg/mod/google.golang.org/grpc@v1.38.0/server.go:934"}
{"level":"info","ts":"2022-06-13T14:11:27.875Z","caller":"traceutil/trace.go:171","msg":"trace[430686560] transaction","detail":"{read_only:true; }","duration":"5.770188852s","start":"2022-06-13T14:11:22.105Z","end":"2022-06-13T14:11:27.875Z","steps":["trace[430686560] 'agreement among raft nodes before linearized reading'  (duration: 5.763577986s)"],"step_count":1}
panic: unexpected error during txn

goroutine 10896 [running]:
go.uber.org/zap/zapcore.(*CheckedEntry).Write(0xc00011ec00, 0xc0035542c0, 0x1, 0x1)
	/go/pkg/mod/go.uber.org/zap@v1.17.0/zapcore/entry.go:234 +0x58d
go.uber.org/zap.(*Logger).Panic(0xc00048c280, 0x1220287, 0x1b, 0xc0035542c0, 0x1, 0x1)
	/go/pkg/mod/go.uber.org/zap@v1.17.0/logger.go:227 +0x85
go.etcd.io/etcd/server/v3/etcdserver.(*applierV3backend).applyTxn(0xc00000d1e8, 0x13a1cb8, 0xc002fe7cb0, 0x13ab818, 0xc003cfe550, 0xc003d07030, 0xc003b2ab00, 0x1, 0x1, 0xc003cec730, ...)
	/go/src/go.etcd.io/etcd/release/etcd/server/etcdserver/apply.go:638 +0x930
go.etcd.io/etcd/server/v3/etcdserver.(*applierV3backend).Txn(0xc00000d1e8, 0x13a1cb8, 0xc002fe7cb0, 0xc003d07030, 0x2, 0x48, 0x493c26, 0x2)
	/go/src/go.etcd.io/etcd/release/etcd/server/etcdserver/apply.go:477 +0x38a
go.etcd.io/etcd/server/v3/etcdserver.(*EtcdServer).Txn.func3()
	/go/src/go.etcd.io/etcd/release/etcd/server/etcdserver/v3_server.go:181 +0x70
go.etcd.io/etcd/server/v3/etcdserver.(*EtcdServer).doSerialize(0xc000442a00, 0x13a1cb8, 0xc002fe7cb0, 0xc0030314a8, 0xc0030314d8, 0xc003e1c500, 0x0)
	/go/src/go.etcd.io/etcd/release/etcd/server/etcdserver/v3_server.go:659 +0xf8
go.etcd.io/etcd/server/v3/etcdserver.(*EtcdServer).Txn(0xc000442a00, 0x13a1cb8, 0xc002fe7cb0, 0xc003d07030, 0x0, 0x0, 0x0)
	/go/src/go.etcd.io/etcd/release/etcd/server/etcdserver/v3_server.go:182 +0x40e
go.etcd.io/etcd/server/v3/etcdserver/api/v3rpc.(*kvServer).Txn(0xc000324fc0, 0x13a1cb8, 0xc002fe7b60, 0xc003d07030, 0xc003d07030, 0x0, 0x0)
	/go/src/go.etcd.io/etcd/release/etcd/server/etcdserver/api/v3rpc/key.go:95 +0xfc
go.etcd.io/etcd/server/v3/etcdserver/api/v3rpc.(*quotaKVServer).Txn(0xc000325000, 0x13a1cb8, 0xc002fe7b60, 0xc003d07030, 0xc000325000, 0x13a8a68, 0xc0002b7e60)
	/go/src/go.etcd.io/etcd/release/etcd/server/etcdserver/api/v3rpc/quota.go:70 +0xc2
go.etcd.io/etcd/api/v3/etcdserverpb._KV_Txn_Handler.func1(0x13a1cb8, 0xc002fe7b60, 0x11b1020, 0xc003d07030, 0x14, 0xc003b89d60, 0x2, 0xc0000f2818)
	/go/src/go.etcd.io/etcd/release/etcd/api/etcdserverpb/rpc.pb.go:6617 +0x89
github.com/grpc-ecosystem/go-grpc-prometheus.(*ServerMetrics).UnaryServerInterceptor.func1(0x13a1cb8, 0xc002fe7b60, 0x11b1020, 0xc003d07030, 0xc003a7bac0, 0xc003a718a8, 0xc003d746e0, 0x0, 0x0, 0x0)
	/go/pkg/mod/github.com/grpc-ecosystem/go-grpc-prometheus@v1.2.0/server_metrics.go:107 +0xb0
github.com/grpc-ecosystem/go-grpc-middleware.ChainUnaryServer.func1.1.1(0x13a1cb8, 0xc002fe7b60, 0x11b1020, 0xc003d07030, 0x1ab2d20, 0xc000437200, 0x8bda0c, 0x13881c0)
	/go/pkg/mod/github.com/grpc-ecosystem/go-grpc-middleware@v1.3.0/chain.go:25 +0x63
go.etcd.io/etcd/server/v3/etcdserver/api/v3rpc.newUnaryInterceptor.func1(0x13a1cb8, 0xc002fe7b60, 0x11b1020, 0xc003d07030, 0xc003a7bac0, 0xc003a7bae0, 0x493c26, 0x62a7458a, 0x6450608, 0x3bbfb531619)
	/go/src/go.etcd.io/etcd/release/etcd/server/etcdserver/api/v3rpc/interceptor.go:71 +0x15c
github.com/grpc-ecosystem/go-grpc-middleware.ChainUnaryServer.func1.1.1(0x13a1cb8, 0xc002fe7b60, 0x11b1020, 0xc003d07030, 0x0, 0x203000, 0x13b1c70, 0xc003e10710)
	/go/pkg/mod/github.com/grpc-ecosystem/go-grpc-middleware@v1.3.0/chain.go:25 +0x63
go.etcd.io/etcd/server/v3/etcdserver/api/v3rpc.newLogUnaryInterceptor.func1(0x13a1cb8, 0xc002fe7b60, 0x11b1020, 0xc003d07030, 0xc003a7bac0, 0xc003a7bb00, 0x0, 0x0, 0x0, 0x0)
	/go/src/go.etcd.io/etcd/release/etcd/server/etcdserver/api/v3rpc/interceptor.go:78 +0xd9
github.com/grpc-ecosystem/go-grpc-middleware.ChainUnaryServer.func1.1.1(0x13a1cb8, 0xc002fe7b60, 0x11b1020, 0xc003d07030, 0xc00021c800, 0x0, 0xc0000f2b30, 0x40e318)
	/go/pkg/mod/github.com/grpc-ecosystem/go-grpc-middleware@v1.3.0/chain.go:25 +0x63
github.com/grpc-ecosystem/go-grpc-middleware.ChainUnaryServer.func1(0x13a1cb8, 0xc002fe7b60, 0x11b1020, 0xc003d07030, 0xc003a7bac0, 0xc003a718a8, 0xc0000f2ba0, 0x573fa6, 0x1158b40, 0xc002fe7b60)
	/go/pkg/mod/github.com/grpc-ecosystem/go-grpc-middleware@v1.3.0/chain.go:34 +0xd7
go.etcd.io/etcd/api/v3/etcdserverpb._KV_Txn_Handler(0x10fa540, 0xc000325000, 0x13a1cb8, 0xc002fe7b60, 0xc003e18000, 0xc0003098c0, 0x13a1cb8, 0xc002fe7b60, 0xc003a9f640, 0x9)
	/go/src/go.etcd.io/etcd/release/etcd/api/etcdserverpb/rpc.pb.go:6619 +0x150
google.golang.org/grpc.(*Server).processUnaryRPC(0xc0004e7340, 0x13ae2d8, 0xc00379ec00, 0xc003e17200, 0xc000309a10, 0x1a6e268, 0x0, 0x0, 0x0)
	/go/pkg/mod/google.golang.org/grpc@v1.38.0/server.go:1286 +0x52b
google.golang.org/grpc.(*Server).handleStream(0xc0004e7340, 0x13ae2d8, 0xc00379ec00, 0xc003e17200, 0x0)
	/go/pkg/mod/google.golang.org/grpc@v1.38.0/server.go:1609 +0xd0c
google.golang.org/grpc.(*Server).serveStreams.func1.2(0xc002a90770, 0xc0004e7340, 0x13ae2d8, 0xc00379ec00, 0xc003e17200)
	/go/pkg/mod/google.golang.org/grpc@v1.38.0/server.go:934 +0xab
created by google.golang.org/grpc.(*Server).serveStreams.func1
	/go/pkg/mod/google.golang.org/grpc@v1.38.0/server.go:932 +0x1fd

This is the first time this has cropped up, so I'm not sure whether pauses, compaction, and defrag are all necessary to trigger the bug. The only nemesis operations on this particular node before the crash appear to be pause & resume, but I imagine it's possible that other nodes performing compaction or defragmentation might have caused them to exchange (or fail to exchange!) messages with this node which caused it to crash

Here's the full logs and data files from the cluster: 20220613T140950.000Z.zip.

What did you expect to happen?

I expected etcd not to crash.

How can we reproduce it (as minimally and precisely as possible)?

Check out jepsen-etcd 84d7d54698c387ed467dd5dfb8ca4bebc2ee46d5, and with a five-node cluster, run:

lein run test-all --username admin --nodes-file /home/admin/nodes --concurrency 4n --rate 500 --time-limit 300 --workload append --nemesis pause,admin --test-count 20

I'm not sure how long this will take to reproduce yet--still collecting evidence.

Anything else we need to know?

No response

Etcd version (please run commands below)

$ etcd --version
etcd Version: 3.5.3
Git SHA: 0452feec7
Go Version: go1.16.15
Go OS/Arch: linux/amd64

$ etcdctl version
etcdctl version: 3.5.3
API version: 3.5

Etcd configuration (command line flags or environment variables)

etcd --enable-v2 --log-outputs stderr --logger zap --name n1 --listen-peer-urls http://192.168.122.101:2380/ --listen-client-urls http://192.168.122.101:2379/ --advertise-client-urls http://192.168.122.101:2379/ --initial-cluster-state new --initial-advertise-peer-urls http://192.168.122.101:2380/ --initial-cluster n1=[http://192.168.122.101:2380,n2=http://192.168.122.102:2380,n3=http://192.168.122.103:2380,n4=http://192.168.122.104:2380,n5=http://192.168.122.105:2380](http://192.168.122.101:2380%2Cn2%3Dhttp//192.168.122.102:2380,n3=http://192.168.122.103:2380,n4=http://192.168.122.104:2380,n5=http://192.168.122.105:2380) --snapshot-count 100 --experimental-initial-corrupt-check

Etcd debug information (please run commands blow, feel free to obfuscate the IP address or FQDN in the output)

These node IDs aren't going to match, because I collected them from a different test run, but it should give you a general idea of the cluster topology:

admin@ip-172-31-9-53:/opt/etcd$ ./etcdctl --endpoints http://172.31.9.53:2379  member list -w table
+------------------+---------+--------------------------------------------+---------------------------+---------------------------+------------+
|        ID        | STATUS  |                    NAME                    |        PEER ADDRS         |       CLIENT ADDRS        | IS LEARNER |
+------------------+---------+--------------------------------------------+---------------------------+---------------------------+------------+
| 6f0318b69646b10a | started |  ec2-54-234-45-183.compute-1.amazonaws.com | http://172.31.13.117:2380 | http://172.31.13.117:2379 |      false |
| 851bb95a115a0651 | started |   ec2-54-90-123-74.compute-1.amazonaws.com |   http://172.31.9.53:2380 |   http://172.31.9.53:2379 |      false |
| 8ce655fee64e8888 | started |    ec2-3-88-142-40.compute-1.amazonaws.com |  http://172.31.7.132:2380 |  http://172.31.7.132:2379 |      false |
| 9a228ee3e4899e1e | started | ec2-54-167-119-185.compute-1.amazonaws.com |   http://172.31.9.60:2380 |   http://172.31.9.60:2379 |      false |
| dfed32f696db96d2 | started |  ec2-54-204-125-79.compute-1.amazonaws.com | http://172.31.12.135:2380 | http://172.31.12.135:2379 |      false |
+------------------+---------+--------------------------------------------+---------------------------+---------------------------+------------+

admin@ip-172-31-9-53:/opt/etcd$ {"level":"warn","ts":"2022-06-13T14:48:00.119Z","logger":"etcd-client","caller":"v3/retry_interceptor.go:62","msg":"retrying of unary invoker failed","target":"etcd-endpoints://0xc0004fa1c0/n1:2379","attempt":0,"error":"rpc error: code = DeadlineExceeded desc = context deadline exceeded"}
Failed to get the status of endpoint http://n4:2379 (context deadline exceeded)
+----------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+
|    ENDPOINT    |        ID        | VERSION | DB SIZE | IS LEADER | IS LEARNER | RAFT TERM | RAFT INDEX | RAFT APPLIED INDEX | ERRORS |
+----------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+
| http://n1:2379 | 851bb95a115a0651 |   3.5.3 |  164 kB |     false |      false |         2 |       2302 |               2299 |        |
| http://n2:2379 | 6f0318b69646b10a |   3.5.3 |  164 kB |     false |      false |         3 |       2306 |               2306 |        |
| http://n5:2379 | 8ce655fee64e8888 |   3.5.3 |  168 kB |     false |      false |         3 |       4966 |               4966 |        |
+----------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+
{"level":"warn","ts":"2022-06-13T14:47:38.147Z","logger":"etcd-client","caller":"v3/retry_interceptor.go:62","msg":"retrying of unary invoker failed","target":"etcd-endpoints://0xc0002fee00/n1:2379","attempt":0,"error":"rpc error: code = DeadlineExceeded desc = context deadline exceeded"}
Failed to get the status of endpoint http://n4:2379 (cont

Note that since these tests involve process pauses, it's tough to reach in and grab a coherent cluster view--pretty much every attempt I made involved timeouts.

Relevant log output

No response

The text was updated successfully, but these errors were encountered:

aphyr · 2022-06-13T17:53:06Z

I'm not sure if this is precisely the same bug or a different one, but I've managed to reproduce a crash with a slightly different error message purely from process pauses--no compaction or defrag operations involved. Here's the full logs.

{"level":"panic","ts":"2022-06-13T17:45:54.601Z","caller":"etcdserver/apply.go:638","msg":"unexpected error during txn","error":"context canceled","stacktrace":"go.etcd.io/etcd/server/v3/etcdserver.(*applierV3backend).applyTxn\n\t/go/src/go.etcd.io/etcd/release/etcd/server/etcdserver/apply.go:638\ngo.etcd.io/etcd/server/v3/etcdserver.(*applierV3backend).Txn\n\t/go/src/go.etcd.io/etcd/release/etcd/server/etcdserver/apply.go:477\ngo.etcd.io/etcd/server/v3/etcdserver.(*EtcdServer).Txn.func3\n\t/go/src/go.etcd.io/etcd/release/etcd/server/etcdserver/v3_server.go:181\ngo.etcd.io/etcd/server/v3/etcdserver.(*EtcdServer).doSerialize\n\t/go/src/go.etcd.io/etcd/release/etcd/server/etcdserver/v3_server.go:659\ngo.etcd.io/etcd/server/v3/etcdserver.(*EtcdServer).Txn\n\t/go/src/go.etcd.io/etcd/release/etcd/server/etcdserver/v3_server.go:182\ngo.etcd.io/etcd/server/v3/etcdserver/api/v3rpc.(*kvServer).Txn\n\t/go/src/go.etcd.io/etcd/release/etcd/server/etcdserver/api/v3rpc/key.go:95\ngo.etcd.io/etcd/server/v3/etcdserver/api/v3rpc.(*quotaKVServer).Txn\n\t/go/src/go.etcd.io/etcd/release/etcd/server/etcdserver/api/v3rpc/quota.go:70\ngo.etcd.io/etcd/api/v3/etcdserverpb._KV_Txn_Handler.func1\n\t/go/src/go.etcd.io/etcd/release/etcd/api/etcdserverpb/rpc.pb.go:6617\ngithub.com/grpc-ecosystem/go-grpc-prometheus.(*ServerMetrics).UnaryServerInterceptor.func1\n\t/go/pkg/mod/github.com/grpc-ecosystem/go-grpc-prometheus@v1.2.0/server_metrics.go:107\ngithub.com/grpc-ecosystem/go-grpc-middleware.ChainUnaryServer.func1.1.1\n\t/go/pkg/mod/github.com/grpc-ecosystem/go-grpc-middleware@v1.3.0/chain.go:25\ngo.etcd.io/etcd/server/v3/etcdserver/api/v3rpc.newUnaryInterceptor.func1\n\t/go/src/go.etcd.io/etcd/release/etcd/server/etcdserver/api/v3rpc/interceptor.go:71\ngithub.com/grpc-ecosystem/go-grpc-middleware.ChainUnaryServer.func1.1.1\n\t/go/pkg/mod/github.com/grpc-ecosystem/go-grpc-middleware@v1.3.0/chain.go:25\ngo.etcd.io/etcd/server/v3/etcdserver/api/v3rpc.newLogUnaryInterceptor.func1\n\t/go/src/go.etcd.io/etcd/release/etcd/server/etcdserver/api/v3rpc/interceptor.go:78\ngithub.com/grpc-ecosystem/go-grpc-middleware.ChainUnaryServer.func1.1.1\n\t/go/pkg/mod/github.com/grpc-ecosystem/go-grpc-middleware@v1.3.0/chain.go:25\ngithub.com/grpc-ecosystem/go-grpc-middleware.ChainUnaryServer.func1\n\t/go/pkg/mod/github.com/grpc-ecosystem/go-grpc-middleware@v1.3.0/chain.go:34\ngo.etcd.io/etcd/api/v3/etcdserverpb._KV_Txn_Handler\n\t/go/src/go.etcd.io/etcd/release/etcd/api/etcdserverpb/rpc.pb.go:6619\ngoogle.golang.org/grpc.(*Server).processUnaryRPC\n\t/go/pkg/mod/google.golang.org/grpc@v1.38.0/server.go:1286\ngoogle.golang.org/grpc.(*Server).handleStream\n\t/go/pkg/mod/google.golang.org/grpc@v1.38.0/server.go:1609\ngoogle.golang.org/grpc.(*Server).serveStreams.func1.2\n\t/go/pkg/mod/google.golang.org/grpc@v1.38.0/server.go:934"}
{"level":"warn","ts":"2022-06-13T17:45:54.603Z","caller":"etcdserver/util.go:166","msg":"apply request took too long","took":"6.263437345s","expected-duration":"100ms","prefix":"read-only txn ","request":"compare:<> success:<request_range:<key:\"2783\" >  request_range:<key:\"2785\" > > failure:<>","response":""}
{"level":"info","ts":"2022-06-13T17:45:54.603Z","caller":"traceutil/trace.go:171","msg":"trace[2082798263] transaction","detail":"{read_only:true; }","duration":"6.26467016s","start":"2022-06-13T17:45:48.338Z","end":"2022-06-13T17:45:54.603Z","steps":["trace[2082798263] 'compare'  (duration: 6.261489853s)"],"step_count":1}
panic: unexpected error during txn

goroutine 32669 [running]:
go.uber.org/zap/zapcore.(*CheckedEntry).Write(0xc0030dda40, 0xc0031dbd40, 0x1, 0x1)
	/go/pkg/mod/go.uber.org/zap@v1.17.0/zapcore/entry.go:234 +0x58d
go.uber.org/zap.(*Logger).Panic(0xc00040c3c0, 0x1220287, 0x1b, 0xc0031dbd40, 0x1, 0x1)
	/go/pkg/mod/go.uber.org/zap@v1.17.0/logger.go:227 +0x85
go.etcd.io/etcd/server/v3/etcdserver.(*applierV3backend).applyTxn(0xc0003dc9d8, 0x13a1cb8, 0xc003c4e540, 0x13ab818, 0xc0031be650, 0xc002dc5f80, 0xc0028095f0, 0x1, 0x1, 0xc0032e0f50, ...)
	/go/src/go.etcd.io/etcd/release/etcd/server/etcdserver/apply.go:638 +0x930
go.etcd.io/etcd/server/v3/etcdserver.(*applierV3backend).Txn(0xc0003dc9d8, 0x13a1cb8, 0xc003c4e540, 0xc002dc5f80, 0x2, 0x48, 0x493c26, 0x2)
	/go/src/go.etcd.io/etcd/release/etcd/server/etcdserver/apply.go:477 +0x38a
go.etcd.io/etcd/server/v3/etcdserver.(*EtcdServer).Txn.func3()
	/go/src/go.etcd.io/etcd/release/etcd/server/etcdserver/v3_server.go:181 +0x70
go.etcd.io/etcd/server/v3/etcdserver.(*EtcdServer).doSerialize(0xc000464300, 0x13a1cb8, 0xc003c4e540, 0xc0003094a8, 0xc0003094d8, 0xc0031f2d70, 0x0)
	/go/src/go.etcd.io/etcd/release/etcd/server/etcdserver/v3_server.go:659 +0xf8
go.etcd.io/etcd/server/v3/etcdserver.(*EtcdServer).Txn(0xc000464300, 0x13a1cb8, 0xc003c4e540, 0xc002dc5f80, 0x0, 0x0, 0x0)
	/go/src/go.etcd.io/etcd/release/etcd/server/etcdserver/v3_server.go:182 +0x40e
go.etcd.io/etcd/server/v3/etcdserver/api/v3rpc.(*kvServer).Txn(0xc0003ef300, 0x13a1cb8, 0xc003c4e3f0, 0xc002dc5f80, 0xc002dc5f80, 0x0, 0x0)
	/go/src/go.etcd.io/etcd/release/etcd/server/etcdserver/api/v3rpc/key.go:95 +0xfc
go.etcd.io/etcd/server/v3/etcdserver/api/v3rpc.(*quotaKVServer).Txn(0xc0003ef400, 0x13a1cb8, 0xc003c4e3f0, 0xc002dc5f80, 0xc0003ef400, 0x13a8a68, 0xc0002913e0)
	/go/src/go.etcd.io/etcd/release/etcd/server/etcdserver/api/v3rpc/quota.go:70 +0xc2
go.etcd.io/etcd/api/v3/etcdserverpb._KV_Txn_Handler.func1(0x13a1cb8, 0xc003c4e3f0, 0x11b1020, 0xc002dc5f80, 0x14, 0xc0031f2b90, 0x2, 0xc000340818)
	/go/src/go.etcd.io/etcd/release/etcd/api/etcdserverpb/rpc.pb.go:6617 +0x89
github.com/grpc-ecosystem/go-grpc-prometheus.(*ServerMetrics).UnaryServerInterceptor.func1(0x13a1cb8, 0xc003c4e3f0, 0x11b1020, 0xc002dc5f80, 0xc003228060, 0xc000627308, 0xc00046da20, 0x0, 0x0, 0x0)
	/go/pkg/mod/github.com/grpc-ecosystem/go-grpc-prometheus@v1.2.0/server_metrics.go:107 +0xb0
github.com/grpc-ecosystem/go-grpc-middleware.ChainUnaryServer.func1.1.1(0x13a1cb8, 0xc003c4e3f0, 0x11b1020, 0xc002dc5f80, 0x1ab2d20, 0xc00061e600, 0x8bda0c, 0x13881c0)
	/go/pkg/mod/github.com/grpc-ecosystem/go-grpc-middleware@v1.3.0/chain.go:25 +0x63
go.etcd.io/etcd/server/v3/etcdserver/api/v3rpc.newUnaryInterceptor.func1(0x13a1cb8, 0xc003c4e3f0, 0x11b1020, 0xc002dc5f80, 0xc003228060, 0xc003228080, 0x493c26, 0x62a777cc, 0x14323d4a, 0xf6fa2f301f7)
	/go/src/go.etcd.io/etcd/release/etcd/server/etcdserver/api/v3rpc/interceptor.go:71 +0x15c
github.com/grpc-ecosystem/go-grpc-middleware.ChainUnaryServer.func1.1.1(0x13a1cb8, 0xc003c4e3f0, 0x11b1020, 0xc002dc5f80, 0x0, 0x203000, 0x13b1c70, 0xc0022fbc20)
	/go/pkg/mod/github.com/grpc-ecosystem/go-grpc-middleware@v1.3.0/chain.go:25 +0x63
go.etcd.io/etcd/server/v3/etcdserver/api/v3rpc.newLogUnaryInterceptor.func1(0x13a1cb8, 0xc003c4e3f0, 0x11b1020, 0xc002dc5f80, 0xc003228060, 0xc0032280a0, 0x0, 0x0, 0x0, 0x0)
	/go/src/go.etcd.io/etcd/release/etcd/server/etcdserver/api/v3rpc/interceptor.go:78 +0xd9
github.com/grpc-ecosystem/go-grpc-middleware.ChainUnaryServer.func1.1.1(0x13a1cb8, 0xc003c4e3f0, 0x11b1020, 0xc002dc5f80, 0x1a82cc0, 0x0, 0xc000340b30, 0x40e318)
	/go/pkg/mod/github.com/grpc-ecosystem/go-grpc-middleware@v1.3.0/chain.go:25 +0x63
github.com/grpc-ecosystem/go-grpc-middleware.ChainUnaryServer.func1(0x13a1cb8, 0xc003c4e3f0, 0x11b1020, 0xc002dc5f80, 0xc003228060, 0xc000627308, 0xc000340ba0, 0x573fa6, 0x1158b40, 0xc003c4e3f0)
	/go/pkg/mod/github.com/grpc-ecosystem/go-grpc-middleware@v1.3.0/chain.go:34 +0xd7
go.etcd.io/etcd/api/v3/etcdserverpb._KV_Txn_Handler(0x10fa540, 0xc0003ef400, 0x13a1cb8, 0xc003c4e3f0, 0xc0021f2000, 0xc0002f15f0, 0x13a1cb8, 0xc003c4e3f0, 0xc003a187c8, 0x14)
	/go/src/go.etcd.io/etcd/release/etcd/api/etcdserverpb/rpc.pb.go:6619 +0x150
google.golang.org/grpc.(*Server).processUnaryRPC(0xc000517340, 0x13ae2d8, 0xc00345d200, 0xc002529320, 0xc0002f1740, 0x1a6e268, 0x0, 0x0, 0x0)
	/go/pkg/mod/google.golang.org/grpc@v1.38.0/server.go:1286 +0x52b
google.golang.org/grpc.(*Server).handleStream(0xc000517340, 0x13ae2d8, 0xc00345d200, 0xc002529320, 0x0)
	/go/pkg/mod/google.golang.org/grpc@v1.38.0/server.go:1609 +0xd0c
google.golang.org/grpc.(*Server).serveStreams.func1.2(0xc003ca7820, 0xc000517340, 0x13ae2d8, 0xc00345d200, 0xc002529320)
	/go/pkg/mod/google.golang.org/grpc@v1.38.0/server.go:934 +0xab
created by google.golang.org/grpc.(*Server).serveStreams.func1
	/go/pkg/mod/google.golang.org/grpc@v1.38.0/server.go:932 +0x1fd

lavacat · 2022-06-14T00:25:39Z

I think this is a good catch and a relevant bug.
TLTR: etcd transaction shouldn't panic based on client context errors (timeouts, cancelations)

Here is how this happened:

Etcd had panic in transactions from very early days. https://github.com/etcd-io/etcd/pull/3534/files#diff-26b2effd11719aa79af125ed4e17234ef08970839d4f83fb4d802e04614c3cd4R161
Then we've added an optimization to bypass raft for serializable transactions. https://github.com/etcd-io/etcd/pull/4956/files#diff-26b2effd11719aa79af125ed4e17234ef08970839d4f83fb4d802e04614c3cd4R91
Then it was limited to only readOnly transactions https://github.com/etcd-io/etcd/pull/6543/files#diff-583048be54c04f9555a60191ec8767757fc7f0c477d5bc7f2191af8a3c1bf920R173
Next we've started passing context and using it for recording tracing info. https://github.com/etcd-io/etcd/pull/11491/files#diff-583048be54c04f9555a60191ec8767757fc7f0c477d5bc7f2191af8a3c1bf920R172
Lastly we've added context checking inside range implementation https://github.com/etcd-io/etcd/pull/12459/files#diff-c4842260ff90edd130afd1d3ac19cb782c4a59c98483070a7a7f817e4d2f13fdR151

I think minimal fix should be to use empty context for tracing in txn (this is a quick change). Maybe we should have special class for tracing context to prevent confusion.

But Ideally we should match behavior of readonly operations wrapped in serializable transactions with normal readonly operations. Correct me if I'm wrong here.

ahrtr · 2022-06-14T03:05:51Z

This is a valid issue.

Based on current implementation, once a transaction starts, it must finishes without any error. Otherwise, it will end up with partially committed data. The workaround (the current implementation) is to crash the process to prevent it (partially commit) from happening.

Probably we should use the BoltDB transaction to perform the transaction. But it would need big refactoring.

lavacat · 2022-06-14T04:52:21Z

@ahrtr I think this issue is only for readonly serializable transactions that aren't going through raftRequestOnce. For normal transaction this won't happen because context isn't passed down to applier

ahrtr · 2022-06-14T05:39:34Z

Yes, this case happens for readonly (not necessary to be serializable) transaction. But the applyTxn is a generic implementation, and it doesn't know it's coming from readonly transaction.

Note that we can't use empty context (context.TODO()) either, otherwise the readonly transaction will not timeout.

Probably we have two options, one for short-term, and one for long-term,

short-term fix: add a panic-recover to catch the panic for the readonly transaction case. We need to evaluate the impact.
long-term fix: Refactor the existing transaction design/implemention to use BoltDB transaction, and rollback the transaction on error. Just as I mentioned in previous comment, this may need big effort, and accordingly has big impact.

ahrtr · 2022-06-14T07:52:22Z

Please anyone feel free to deliver a PR for the short-term fix for now.

ptabor · 2022-06-14T14:09:42Z

Yes, this case happens for readonly (not necessary to be serializable) transaction. But the applyTxn is a generic implementation, and it doesn't know it's coming from readonly transaction.

Note that we can't use empty context (context.TODO()) either, otherwise the readonly transaction will not timeout.

Probably we have two options, one for short-term, and one for long-term,

short-term fix: add a panic-recover to catch the panic for the readonly transaction case. We need to evaluate the impact.

That sounds right to me - as long as it's truly RO code-paths. I think we could do this as grpc interceptor.

long-term fix: Refactor the existing transaction design/implemention to use BoltDB transaction, and rollback the transaction on error. Just as I mentioned in previous comment, this may need big effort, and accordingly has big impact.

We do batching of the bolt transactions to minimize numbers of b-tree nodes that are being rewritten at the top of the tree. If we did true 1:1 transactions it would have negative performance impact.

As Context is an interface (https://pkg.go.dev/context#Context), I would propose implementing a wrapper around it that
overrides the:

Done() <-chan struct{}
Err() error
methods to mask the underlying context methods. Apply should continue independently from the request's context... The only thing it can check periodically is the server's context.

etcd-io#14110

lavacat · 2022-06-14T19:16:42Z

Here is a different approach - create a separate apply function for readonly txn. It's basically a partial copy of current applyTxn for range but without recursion, because isTxnReadonly will return false for nested readonly tnx (maybe it's a bug).

I think this approach might be a bit more flexible for future changes to readonly txn. And no need to run recover.

If this looks good, I can add some tests.

#14113

mcginne · 2022-06-20T13:30:19Z

We are also hitting this fairy regularly (Several times per day) on both v3.5.2 & v3.5.4.

ahrtr · 2022-06-20T21:21:03Z

We are also hitting this fairy regularly (Several times per day) on both v3.5.2 & v3.5.4.

Would you mind provide the complete log and the endpoint info (see command below)?

etcdctl endpoint status -w json --cluster

Could you also explain how do you produce this issue? Did you see this issue in a testing environment or a production environment?

What's the volume of the request per second?

mcginne · 2022-06-21T08:23:09Z

@ahrtr endpoint info:

[{"Endpoint":"https://etcd-501-armada-stage5-south-sbhnpvg72b.etcd-501-armada-stage5-south.armada.svc:2379","Status":{"header":{"cluster_id":8861204074038396446,"member_id":5440534516375719315,"revision":5729930287,"raft_term":12923},"version":"3.5.4","dbSize":1037332480,"leader":8376786025734177010,"raftIndex":4073315546,"raftTerm":12923,"raftAppliedIndex":4073315546,"dbSizeInUse":963850240}},{"Endpoint":"https://etcd-501-armada-stage5-south-5zflxmmcz2.etcd-501-armada-stage5-south.armada.svc:2379","Status":{"header":{"cluster_id":8861204074038396446,"member_id":8376786025734177010,"revision":5729930289,"raft_term":12923},"version":"3.5.4","dbSize":1037058048,"leader":8376786025734177010,"raftIndex":4073315551,"raftTerm":12923,"raftAppliedIndex":4073315551,"dbSizeInUse":963796992}},{"Endpoint":"https://etcd-501-armada-stage5-south-7n25f4hms2.etcd-501-armada-stage5-south.armada.svc:2379","Status":{"header":{"cluster_id":8861204074038396446,"member_id":11064341971799022067,"revision":5729930301,"raft_term":12923},"version":"3.5.4","dbSize":1040326656,"leader":8376786025734177010,"raftIndex":4073315567,"raftTerm":12923,"raftAppliedIndex":4073315567,"dbSizeInUse":963985408}},{"Endpoint":"https://etcd-501-armada-stage5-south-prfhrxqx5k.etcd-501-armada-stage5-south.armada.svc:2379","Status":{"header":{"cluster_id":8861204074038396446,"member_id":11815188087669886612,"revision":5729930309,"raft_term":12923},"version":"3.5.4","dbSize":1037373440,"leader":8376786025734177010,"raftIndex":4073315585,"raftTerm":12923,"raftAppliedIndex":4073315585,"dbSizeInUse":964038656}},{"Endpoint":"https://etcd-501-armada-stage5-south-m5zlm8l8l6.etcd-501-armada-stage5-south.armada.svc:2379","Status":{"header":{"cluster_id":8861204074038396446,"member_id":12614509488604193654,"revision":5729930313,"raft_term":12923},"version":"3.5.4","dbSize":1040461824,"leader":8376786025734177010,"raftIndex":4073315590,"raftTerm":12923,"raftAppliedIndex":4073315590,"dbSizeInUse":964067328}}]

Log ->
etcd_panic.log.zip

Request rate is in total around 1000 req/sec - it's a very varied workload with many Microservices making different requests against etcd. I'm afraid I don't know exactly what triggers it to occur. The workload runs constantly, and should be fairly similar in nature at all times - I don't see any big spikes in requests or response times or anything when we see the panic. We have 5 member cluster and it seems to affect all 5 members.

This is our performance/scale test environment, so we are holding off moving to prod with v3.5 until this is resolved.

aphyr · 2022-06-21T12:54:52Z

If it helps narrow things down, the workload I'm using that crashes etcd is kv reads, writes, and transactions over an exponentially-distributed pool of keys. Roughly 500-1000 ops/sec as well, though I haven't looked to see if it crashes with fewer.

ahrtr · 2022-06-21T20:49:24Z

Thanks @aphyr , we already know the root cause in your case. I just want to double confirm whether @mcginne ran into the same issue, and was caused by the same reason. Will take a look at the log provided by @mcginne today or tomorrow.

ahrtr · 2022-06-22T02:59:12Z

Thanks @mcginne . Confirmed that it's the same issue, and we will fix this issue soon in both main (3.6) and 3.5. Hopefully the fix will be included in 3.5.5.

lavacat · 2022-06-22T04:25:53Z

FYI, I'll post a PR for main with the approach outlined here
If somebody else is working on the fix, please state here, so there is no duplication of effort.

I'll try to get this in tomorrow June 22

mcginne · 2022-06-22T13:38:22Z

Thank you @ahrtr and @lavacat for confirming and the work on the fix - much appreciated.

Problem: We pass grpc context down to applier in readonly serializable txn. This context can be cancelled for example due to timeout. This will trigger panic. Solution: provide different error handler for readonly serializable txn. fixes etcd-io#14110

Problem: We pass grpc context down to applier in readonly serializable txn. This context can be cancelled for example due to timeout. This will trigger panic inside applyTxn Solution: Only panic for transactions with write operations fixes etcd-io#14110 Signed-off-by: Bogdan Kanivets <bkanivets@apple.com>

Problem: We pass grpc context down to applier in readonly serializable txn. This context can be cancelled for example due to timeout. This will trigger panic inside applyTxn Solution: Only panic for transactions with write operations fixes etcd-io#14110 backported from main Signed-off-by: Bogdan Kanivets <bkanivets@apple.com>

Problem: We pass grpc context down to applier in readonly serializable txn. This context can be cancelled for example due to timeout. This will trigger panic inside applyTxn Solution: Only panic for transactions with write operations fixes etcd-io#14110 Signed-off-by: Bogdan Kanivets <bkanivets@apple.com>

Problem: We pass grpc context down to applier in readonly serializable txn. This context can be cancelled for example due to timeout. This will trigger panic inside applyTxn Solution: Only panic for transactions with write operations fixes etcd-io#14110 backported from main Signed-off-by: Bogdan Kanivets <bkanivets@apple.com>

lavacat · 2022-07-06T17:03:55Z

@serathius @ptabor reminder to review the PRs
main #14149
port to 3.5 #14178

Problem: We pass grpc context down to applier in readonly serializable txn. This context can be cancelled for example due to timeout. This will trigger panic inside applyTxn Solution: Only panic for transactions with write operations fixes etcd-io#14110 Signed-off-by: Bogdan Kanivets <bkanivets@apple.com>

Problem: We pass grpc context down to applier in readonly serializable txn. This context can be cancelled for example due to timeout. This will trigger panic inside applyTxn Solution: Only panic for transactions with write operations fixes etcd-io#14110 main PR etcd-io#14149

Problem: We pass grpc context down to applier in readonly serializable txn. This context can be cancelled for example due to timeout. This will trigger panic inside applyTxn Solution: Only panic for transactions with write operations fixes etcd-io#14110 main PR etcd-io#14149 Signed-off-by: Bogdan Kanivets <bkanivets@apple.com>

aphyr added the type/bug label Jun 13, 2022

ahrtr added the stage/tracked label Jun 14, 2022

ahrtr added the good first issue label Jun 14, 2022

serathius added the help wanted label Jun 14, 2022

aphyr mentioned this issue Jun 14, 2022

Introduce etcd linearizability tests #14045

Closed

33 tasks

lavacat pushed a commit to lavacat/etcd that referenced this issue Jun 14, 2022

etcdserver: special apply for readonly transactions

da41518

etcd-io#14110

lavacat mentioned this issue Jun 14, 2022

etcdserver: special apply for readonly transactions #14113

Closed

serathius mentioned this issue Jun 21, 2022

Plans for v3.5.5 release #14138

Closed

16 tasks

lavacat mentioned this issue Jun 23, 2022

server: don't panic in readonly serializable txn #14149

Merged

lavacat mentioned this issue Jun 29, 2022

[3.5] server: don't panic in readonly serializable txn #14178

Merged

ahrtr closed this as completed in #14149 Aug 13, 2022

ahrtr mentioned this issue Sep 16, 2022

Changelog: add item into 3.6 for the fix on panic in readonly serializable txn #14476

Closed

ptabor mentioned this issue Apr 13, 2023

Membership nomination for @lavacat (Bogdan Kanivets) #15712

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

panic: unexpected error during txn #14110

panic: unexpected error during txn #14110

aphyr commented Jun 13, 2022 •

edited

aphyr commented Jun 13, 2022 •

edited

lavacat commented Jun 14, 2022

ahrtr commented Jun 14, 2022

lavacat commented Jun 14, 2022

ahrtr commented Jun 14, 2022 •

edited

ahrtr commented Jun 14, 2022

ptabor commented Jun 14, 2022

lavacat commented Jun 14, 2022

mcginne commented Jun 20, 2022

ahrtr commented Jun 20, 2022

mcginne commented Jun 21, 2022

aphyr commented Jun 21, 2022

ahrtr commented Jun 21, 2022

ahrtr commented Jun 22, 2022

lavacat commented Jun 22, 2022 •

edited

mcginne commented Jun 22, 2022

lavacat commented Jul 6, 2022

panic: unexpected error during txn #14110

panic: unexpected error during txn #14110

Comments

aphyr commented Jun 13, 2022 • edited

What happened?

What did you expect to happen?

How can we reproduce it (as minimally and precisely as possible)?

Anything else we need to know?

Etcd version (please run commands below)

Etcd configuration (command line flags or environment variables)

Etcd debug information (please run commands blow, feel free to obfuscate the IP address or FQDN in the output)

Relevant log output

aphyr commented Jun 13, 2022 • edited

lavacat commented Jun 14, 2022

ahrtr commented Jun 14, 2022

lavacat commented Jun 14, 2022

ahrtr commented Jun 14, 2022 • edited

ahrtr commented Jun 14, 2022

ptabor commented Jun 14, 2022

lavacat commented Jun 14, 2022

mcginne commented Jun 20, 2022

ahrtr commented Jun 20, 2022

mcginne commented Jun 21, 2022

aphyr commented Jun 21, 2022

ahrtr commented Jun 21, 2022

ahrtr commented Jun 22, 2022

lavacat commented Jun 22, 2022 • edited

mcginne commented Jun 22, 2022

lavacat commented Jul 6, 2022

aphyr commented Jun 13, 2022 •

edited

aphyr commented Jun 13, 2022 •

edited

ahrtr commented Jun 14, 2022 •

edited

lavacat commented Jun 22, 2022 •

edited