Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[QUESTION] Node Failing During a Transaction (Consume-Process-Produce) #905

Open
jameskirch opened this issue Jul 10, 2023 · 0 comments
Open
Labels

Comments

@jameskirch
Copy link

jameskirch commented Jul 10, 2023

We have followed the transactional consume-process-produce paradigm laid out at:

https://aiokafka.readthedocs.io/en/stable/examples/transaction_example.html

We seem to be seeing issues when a node is restarted (or becomes unreachable) while a transaction is in the middle of processing. The main scenario where this occurs is on high-traffic producers when Amazon conducts a rolling restart of their servers for maintenance etc. This inevitably leads to a node becoming unreachable whilst a producer is mid-transaction.

Instead of failing over to another node that is reachable, 'send_offsets_to_transaction' seems to keep spamming the problem node until timeouts are eventually hit and everything crashes (the same errors continue to spam the logs once the node recovers as well, which is strange):

Unable connect to node with id 2: [Errno 111] Connect call failed ('<ip>', <port>)
Could not send <class 'aiokafka.protocol.transaction.TxnOffsetCommitRequest_v0'>: NodeNotReadyError('Attempt to send a request to node which is not ready (node id 2).')
Unable connect to node with id 2: [Errno 111] Connect call failed ('<ip>', <port>)
Could not send <class 'aiokafka.protocol.transaction.TxnOffsetCommitRequest_v0'>: NodeNotReadyError('Attempt to send a request to node which is not ready (node id 2).')

...(above repeats 100s of times until an eventual timeout)

What is the intended behavior when a node becomes unreachable in the middle of a producer transaction? Is it inevitable that that the transaction will fail?

Is it possible to catch NodeNotReadyErrors so we can perhaps abort the transaction and start a new one, rather than having it get stuck in a loop and failing?

aiokafka.version == '0.8.0'
kafka.version == '2.0.2'

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

1 participant