New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Sync producer erroring for repeated transactions with the same transaction ID #2859
Comments
@tomplarge thanks for raising this one. When we commit the transaction, the txnmgr/endtxn request is sent to the coordinator, which writes a The Java client just treats this as a retriable error and tries again after a backoff (see here) so we should probably make sure we're doing the same |
Ah, I just spotted your Lines 359 to 360 in 4ad3504
|
@dnwe thanks for the response! I also ran the exactly once example and noticed it had the same issue when debug logging was enabled. The exactly once example suggests that retrying in application logic is recommended (https://github.com/IBM/sarama/blob/main/examples/exactly_once/main.go#L295). So are you saying that the way to handle this is just to be more generous with As an aside, in an optimal world, would it not make more sense to enforce synchronous behavior when committing a transaction? I'm curious what your thoughts are here. |
@tomplarge yeah I agree it shouldn't be async, ideally we wouldn't get the response from the Commit until it had 100% completed server side, but that async behaviour seems to be a bit of a nit in the Kafka code ever since the transactional producer was introduced in 0.11.0 rather than a mistake Sarama is making, and something they were just happy to hide away in the client retries See KAFKA-5477 where the 'transaction in progress' retry backoff was special cased to be shorter than the regular backoff specifically because the client can seemingly always get the in progress on the first attempt |
@dnwe this is great to know, thanks for pointing out that issue entry. I'll do some reconfiguring and more flexible retry logic. The larger question I have is at what point do we actually fail on the application side? For example, during a rebalance, I have noticed this error spewing from the fenced producers: |
Description
There seems to be an issue with repeatedly creating transactions with the same transaction ID. I have a toy example that creates a
SyncProducer
with a stable transaction ID"transaction-id"
, repeatedly callsBeginTxn(), SendMessages(), CommitTxn()
in a loop until failure. I have debug logging enabled, and almost immediately, I see the error:txnmgr/add-partition-to-txn retrying after 20ms... (1 attempts remaining) (transaction manager: failed to send partitions to transaction: kafka server: The producer attempted to update a transaction while another concurrent operation on the same transaction was ongoing)
When I increase
config.Producer.Transaction.Retry.Backoff
, I see crashes happen less often, but I still see the same error from the transaction manager.When I increase
sleepTime
(which introduces a delay between starting transactions), I see this issue go away entirely.Is this an issue in sarama's handling of transaction state? Or do I have a misunderstanding?
Versions
Configuration
Logs
logs: CLICK ME
Additional Context
Here is the program
The text was updated successfully, but these errors were encountered: