New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Consumer hanging on Close()
call when redeploying application
#767
Comments
This is the fix: #757 |
@jliunyu Thanks! Unfortunately, it looks like we're still seeing this. I went ahead and pulled your change into my own fork and deployed:
I ran a rolling-restart and we saw the same pattern in the logs. I have included them here: We're going to try dumping the goroutines to see if we can more closely determine where we're hanging. |
The PR is still pending for merging and the target release for the fix is v1.9.0. |
@jliunyu We tested the PR in a fork and it still does not fix the issue. I was able to reproduce the issue with the following test case on my local broker. I created a topic
If I run this locally, it hangs with the following output
If I remove the |
Thanks for your verification. If the You mentioned that if you removed IncrementalAssign/IncrementalUnassign, and the "partition.assignment.strategy" configuration setting, it works. It means the default BTW, if you don't use channel based consumer, please use the rebalance callback to handle the rebalance, it will be called from the Poll() function. |
We do indeed want to use cooperative-sticky, thanks |
I spent a little more time looking at this and was able to come up with a potential candidate solution In I reordered the if/else and tested it locally and it works successfully.
|
Good analysis, @kevinconaway! |
Unfortunately I don't think I have the proficiency in C to write a proper test case to accompany the change. Would you mind if I deferred the change to you? Thanks. |
@edenhill Is this something that you would be able to assist with or is it faster for me to attempt to cobble together a PR? I can certainly change the if order but adding a test might require some assistance if I don't get it right. |
Thanks so much for addressing this @edenhill. What is the timeline for a 1.9.0 librdkafka release? If not relatively soon would we be able to backport this to 1.8.x? |
If all goes well we should be able to release 1.9.0 next week. We generally try to avoid backports. |
@edenhill Any update on when librdkafka 1.9.0 will be released? |
@kevinconaway looks like 1.9.0 was just released: https://github.com/confluentinc/confluent-kafka-go/releases/tag/v1.9.0 |
Closing this after testing on the latest release (1.9.2) if any form of (un)assign is permitted during close. |
Calling `consumer.close()` was hanging indefinitely when the `optimizedRebalance` config was enabled. The issue is the same as confluentinc/confluent-kafka-go#767 The `rebalance` method needs to call `consumer.unassign()` when closing. It was not doing this because `this.consumer` was being set to `undefined` before disconnecting.
Description
We are running into issues when shutting down consumers in our environment, where closing the consumer gets stuck and loops before our pod (we run in k8s) is forcibly terminated. This does not happen for every instance of our application, but on a restart or redeploy of all pods, 5-10% will end up hanging and forcibly terminated. Details on how many pods are running and how we redeploy are in the "How to Reproduce" section.
We think it has something to do with the
unsubscribe()
call inconsumer.Close()
, but are unsure. It could also be an implentation detail of how we are handling Assign/Revoke events. It is worth noting that we will be in the middle of a rebalance while making this call toconsumer.Close
. We cannot use static membership due to the issue reported here:Config/Env Details:
Code Details
We instantiate our consumer with a rebalanceCB and a SubscribeTopics call:
Note: our rebalanceCB is a bit strange in that it just a wrapper for calling our "handleEvent" function:
With our handleEvent function (note - have removed some log lines/if statements to keep the code concise/clear):
We have two Go threads that handle polling and offset commits respectively. The application will shut down by stopping the poll goroutine with stopReading(), then closing offset commits channel, before finally attempting to close the consumer:
Additional Observations
According to our application logs, we will stop reading and writing offset commits, but we will hang when calling
r.consumer.Close()
.This can be found in the gist of client logs linked below, but I have included a snippet here:
After this point in the logs, we saw a few commonalities in the pods that get stuck. We consistently see:
and ultimately we will "hang" with this log line repeatedly reported:
How to reproduce
We can have this consistently occur in our setup, but I have not spun up a separate test.
We see this happen upon a deploy with 20 consumer instances running. If we run such that 4 consumers rotate every 30 seconds during a deploy, usually 1-2 out of 20 will end up stuck and forcibly terminated.
Please let me know if you need more information or have further questions.
The text was updated successfully, but these errors were encountered: