You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
We seem to be seeing issues when a node is restarted (or becomes unreachable) while a transaction is in the middle of processing. The main scenario where this occurs is on high-traffic producers when Amazon conducts a rolling restart of their servers for maintenance etc. This inevitably leads to a node becoming unreachable whilst a producer is mid-transaction.
Instead of failing over to another node that is reachable, 'send_offsets_to_transaction' seems to keep spamming the problem node until timeouts are eventually hit and everything crashes (the same errors continue to spam the logs once the node recovers as well, which is strange):
Unable connect to node with id 2: [Errno 111] Connect call failed ('<ip>', <port>)
Could not send <class 'aiokafka.protocol.transaction.TxnOffsetCommitRequest_v0'>: NodeNotReadyError('Attempt to send a request to node which is not ready (node id 2).')
Unable connect to node with id 2: [Errno 111] Connect call failed ('<ip>', <port>)
Could not send <class 'aiokafka.protocol.transaction.TxnOffsetCommitRequest_v0'>: NodeNotReadyError('Attempt to send a request to node which is not ready (node id 2).')
...(above repeats 100s of times until an eventual timeout)
What is the intended behavior when a node becomes unreachable in the middle of a producer transaction? Is it inevitable that that the transaction will fail?
Is it possible to catch NodeNotReadyErrors so we can perhaps abort the transaction and start a new one, rather than having it get stuck in a loop and failing?
We have followed the transactional consume-process-produce paradigm laid out at:
https://aiokafka.readthedocs.io/en/stable/examples/transaction_example.html
We seem to be seeing issues when a node is restarted (or becomes unreachable) while a transaction is in the middle of processing. The main scenario where this occurs is on high-traffic producers when Amazon conducts a rolling restart of their servers for maintenance etc. This inevitably leads to a node becoming unreachable whilst a producer is mid-transaction.
Instead of failing over to another node that is reachable, 'send_offsets_to_transaction' seems to keep spamming the problem node until timeouts are eventually hit and everything crashes (the same errors continue to spam the logs once the node recovers as well, which is strange):
...(above repeats 100s of times until an eventual timeout)
What is the intended behavior when a node becomes unreachable in the middle of a producer transaction? Is it inevitable that that the transaction will fail?
Is it possible to catch NodeNotReadyErrors so we can perhaps abort the transaction and start a new one, rather than having it get stuck in a loop and failing?
aiokafka.version == '0.8.0'
kafka.version == '2.0.2'
The text was updated successfully, but these errors were encountered: