[QUESTION] Node Failing During a Transaction (Consume-Process-Produce) #905

jameskirch · 2023-07-10T23:46:10Z

We have followed the transactional consume-process-produce paradigm laid out at:

https://aiokafka.readthedocs.io/en/stable/examples/transaction_example.html

We seem to be seeing issues when a node is restarted (or becomes unreachable) while a transaction is in the middle of processing. The main scenario where this occurs is on high-traffic producers when Amazon conducts a rolling restart of their servers for maintenance etc. This inevitably leads to a node becoming unreachable whilst a producer is mid-transaction.

Instead of failing over to another node that is reachable, 'send_offsets_to_transaction' seems to keep spamming the problem node until timeouts are eventually hit and everything crashes (the same errors continue to spam the logs once the node recovers as well, which is strange):

Unable connect to node with id 2: [Errno 111] Connect call failed ('<ip>', <port>)
Could not send <class 'aiokafka.protocol.transaction.TxnOffsetCommitRequest_v0'>: NodeNotReadyError('Attempt to send a request to node which is not ready (node id 2).')
Unable connect to node with id 2: [Errno 111] Connect call failed ('<ip>', <port>)
Could not send <class 'aiokafka.protocol.transaction.TxnOffsetCommitRequest_v0'>: NodeNotReadyError('Attempt to send a request to node which is not ready (node id 2).')

...(above repeats 100s of times until an eventual timeout)

What is the intended behavior when a node becomes unreachable in the middle of a producer transaction? Is it inevitable that that the transaction will fail?

Is it possible to catch NodeNotReadyErrors so we can perhaps abort the transaction and start a new one, rather than having it get stuck in a loop and failing?

aiokafka.version == '0.8.0'
kafka.version == '2.0.2'

The text was updated successfully, but these errors were encountered:

jameskirch added the question label Jul 10, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[QUESTION] Node Failing During a Transaction (Consume-Process-Produce) #905

[QUESTION] Node Failing During a Transaction (Consume-Process-Produce) #905

jameskirch commented Jul 10, 2023 •

edited

[QUESTION] Node Failing During a Transaction (Consume-Process-Produce) #905

[QUESTION] Node Failing During a Transaction (Consume-Process-Produce) #905

Comments

jameskirch commented Jul 10, 2023 • edited

jameskirch commented Jul 10, 2023 •

edited