New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
panic: unexpected error during txn #14110
Comments
I'm not sure if this is precisely the same bug or a different one, but I've managed to reproduce a crash with a slightly different error message purely from process pauses--no compaction or defrag operations involved. Here's the full logs.
|
I think this is a good catch and a relevant bug. Here is how this happened:
I think minimal fix should be to use empty context for tracing in txn (this is a quick change). Maybe we should have special class for tracing context to prevent confusion. But Ideally we should match behavior of readonly operations wrapped in serializable transactions with normal readonly operations. Correct me if I'm wrong here. |
This is a valid issue. Based on current implementation, once a transaction starts, it must finishes without any error. Otherwise, it will end up with partially committed data. The workaround (the current implementation) is to crash the process to prevent it (partially commit) from happening. Probably we should use the BoltDB transaction to perform the transaction. But it would need big refactoring. |
Yes, this case happens for readonly (not necessary to be serializable) transaction. But the applyTxn is a generic implementation, and it doesn't know it's coming from readonly transaction. Note that we can't use empty context ( Probably we have two options, one for short-term, and one for long-term,
|
Please anyone feel free to deliver a PR for the short-term fix for now. |
That sounds right to me - as long as it's truly RO code-paths. I think we could do this as grpc interceptor.
We do batching of the bolt transactions to minimize numbers of b-tree nodes that are being rewritten at the top of the tree. If we did true 1:1 transactions it would have negative performance impact. As Context is an interface (https://pkg.go.dev/context#Context), I would propose implementing a wrapper around it that
|
Here is a different approach - create a separate apply function for readonly txn. It's basically a partial copy of current applyTxn for range but without recursion, because isTxnReadonly will return false for nested readonly tnx (maybe it's a bug). I think this approach might be a bit more flexible for future changes to readonly txn. And no need to run recover. If this looks good, I can add some tests. |
We are also hitting this fairy regularly (Several times per day) on both v3.5.2 & v3.5.4. |
Would you mind provide the complete log and the endpoint info (see command below)?
Could you also explain how do you produce this issue? Did you see this issue in a testing environment or a production environment? What's the volume of the request per second? |
@ahrtr endpoint info:
Log -> Request rate is in total around 1000 req/sec - it's a very varied workload with many Microservices making different requests against etcd. I'm afraid I don't know exactly what triggers it to occur. The workload runs constantly, and should be fairly similar in nature at all times - I don't see any big spikes in requests or response times or anything when we see the panic. We have 5 member cluster and it seems to affect all 5 members. This is our performance/scale test environment, so we are holding off moving to prod with v3.5 until this is resolved. |
If it helps narrow things down, the workload I'm using that crashes etcd is kv reads, writes, and transactions over an exponentially-distributed pool of keys. Roughly 500-1000 ops/sec as well, though I haven't looked to see if it crashes with fewer. |
Thanks @mcginne . Confirmed that it's the same issue, and we will fix this issue soon in both main (3.6) and 3.5. Hopefully the fix will be included in 3.5.5. |
FYI, I'll post a PR for main with the approach outlined here I'll try to get this in tomorrow June 22 |
Problem: We pass grpc context down to applier in readonly serializable txn. This context can be cancelled for example due to timeout. This will trigger panic. Solution: provide different error handler for readonly serializable txn. fixes etcd-io#14110
Problem: We pass grpc context down to applier in readonly serializable txn. This context can be cancelled for example due to timeout. This will trigger panic. Solution: provide different error handler for readonly serializable txn. fixes etcd-io#14110
Problem: We pass grpc context down to applier in readonly serializable txn. This context can be cancelled for example due to timeout. This will trigger panic. Solution: provide different error handler for readonly serializable txn. fixes etcd-io#14110
Problem: We pass grpc context down to applier in readonly serializable txn. This context can be cancelled for example due to timeout. This will trigger panic inside applyTxn Solution: Only panic for transactions with write operations fixes etcd-io#14110 Signed-off-by: Bogdan Kanivets <bkanivets@apple.com>
Problem: We pass grpc context down to applier in readonly serializable txn. This context can be cancelled for example due to timeout. This will trigger panic inside applyTxn Solution: Only panic for transactions with write operations fixes etcd-io#14110 Signed-off-by: Bogdan Kanivets <bkanivets@apple.com>
Problem: We pass grpc context down to applier in readonly serializable txn. This context can be cancelled for example due to timeout. This will trigger panic inside applyTxn Solution: Only panic for transactions with write operations fixes etcd-io#14110 backported from main Signed-off-by: Bogdan Kanivets <bkanivets@apple.com>
Problem: We pass grpc context down to applier in readonly serializable txn. This context can be cancelled for example due to timeout. This will trigger panic inside applyTxn Solution: Only panic for transactions with write operations fixes etcd-io#14110 Signed-off-by: Bogdan Kanivets <bkanivets@apple.com>
Problem: We pass grpc context down to applier in readonly serializable txn. This context can be cancelled for example due to timeout. This will trigger panic inside applyTxn Solution: Only panic for transactions with write operations fixes etcd-io#14110 backported from main Signed-off-by: Bogdan Kanivets <bkanivets@apple.com>
@serathius @ptabor reminder to review the PRs |
Problem: We pass grpc context down to applier in readonly serializable txn. This context can be cancelled for example due to timeout. This will trigger panic inside applyTxn Solution: Only panic for transactions with write operations fixes etcd-io#14110 Signed-off-by: Bogdan Kanivets <bkanivets@apple.com>
Problem: We pass grpc context down to applier in readonly serializable txn. This context can be cancelled for example due to timeout. This will trigger panic inside applyTxn Solution: Only panic for transactions with write operations fixes etcd-io#14110 Signed-off-by: Bogdan Kanivets <bkanivets@apple.com>
Problem: We pass grpc context down to applier in readonly serializable txn. This context can be cancelled for example due to timeout. This will trigger panic inside applyTxn Solution: Only panic for transactions with write operations fixes etcd-io#14110 Signed-off-by: Bogdan Kanivets <bkanivets@apple.com>
Problem: We pass grpc context down to applier in readonly serializable txn. This context can be cancelled for example due to timeout. This will trigger panic inside applyTxn Solution: Only panic for transactions with write operations fixes etcd-io#14110 Signed-off-by: Bogdan Kanivets <bkanivets@apple.com>
Problem: We pass grpc context down to applier in readonly serializable txn. This context can be cancelled for example due to timeout. This will trigger panic inside applyTxn Solution: Only panic for transactions with write operations fixes etcd-io#14110 main PR etcd-io#14149
Problem: We pass grpc context down to applier in readonly serializable txn. This context can be cancelled for example due to timeout. This will trigger panic inside applyTxn Solution: Only panic for transactions with write operations fixes etcd-io#14110 main PR etcd-io#14149 Signed-off-by: Bogdan Kanivets <bkanivets@apple.com>
Problem: We pass grpc context down to applier in readonly serializable txn. This context can be cancelled for example due to timeout. This will trigger panic inside applyTxn Solution: Only panic for transactions with write operations fixes etcd-io#14110 main PR etcd-io#14149 Signed-off-by: Bogdan Kanivets <bkanivets@apple.com>
Problem: We pass grpc context down to applier in readonly serializable txn. This context can be cancelled for example due to timeout. This will trigger panic inside applyTxn Solution: Only panic for transactions with write operations fixes etcd-io#14110 main PR etcd-io#14149 Signed-off-by: Bogdan Kanivets <bkanivets@apple.com>
What happened?
In a Jepsen test run of five etcd 3.5.3 nodes, with process pauses (sending processes SIGSTOP and SIGCONT), compaction (performed via the admin API), and defragmentation (via etcdctl), one etcd process crashed with the following error message:
This is the first time this has cropped up, so I'm not sure whether pauses, compaction, and defrag are all necessary to trigger the bug. The only nemesis operations on this particular node before the crash appear to be pause & resume, but I imagine it's possible that other nodes performing compaction or defragmentation might have caused them to exchange (or fail to exchange!) messages with this node which caused it to crash
Here's the full logs and data files from the cluster: 20220613T140950.000Z.zip.
What did you expect to happen?
I expected etcd not to crash.
How can we reproduce it (as minimally and precisely as possible)?
Check out jepsen-etcd 84d7d54698c387ed467dd5dfb8ca4bebc2ee46d5, and with a five-node cluster, run:
I'm not sure how long this will take to reproduce yet--still collecting evidence.
Anything else we need to know?
No response
Etcd version (please run commands below)
Etcd configuration (command line flags or environment variables)
Etcd debug information (please run commands blow, feel free to obfuscate the IP address or FQDN in the output)
These node IDs aren't going to match, because I collected them from a different test run, but it should give you a general idea of the cluster topology:
Note that since these tests involve process pauses, it's tough to reach in and grab a coherent cluster view--pretty much every attempt I made involved timeouts.
Relevant log output
No response
The text was updated successfully, but these errors were encountered: