New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Kafka commit problem due to session closure #150
Comments
Default in loader was changed to auto commit true ab820c9 |
Message commit scenarios tested after upgrading the redshift cluster.
Cannot reproduce. The weird thing is how is the redshift cluster related to the Kafka commit problem. The only reason i can think of is somehow the kafka consumer session is expiring due to slowness in the resource starved redshift cluster. Since the |
One more observation with autoCommit: false, maxSize: 100 First consumeClaim did Commit(), but the commit did not happen. |
Reproduced.
The next consumer Session starts and the same thing happens, this big batch is not getting committed. Issue reproduced with auto commit true. Finally at 04:26 the current consumer group started showing as 3367. So the problem is the delay in the commit reflecting in Kafka when the batch size is huge. And this happens only when the auto commit is true. Solution
|
Issue was seen to happen with Deleting one of the three kafka pod and retrying |
Tracing what happens in sarama's
Then i found this IBM/sarama#1310 which says you get the generation error when you are trying to commit from a closed session. This is the reason big batches were not working out. Increasing the timeouts of kafka consumer would help. Also it should be configurable.
(only maxProcessingTime was tried) |
New configurations added as based on the type of processing and the size of the batch these values need configurations for every use case. We needed this to process huge batch of loads in Redshift cluster. When the batch size was huge the timeouts where happening leading to commits not going through. Details on why we are doing this is in this issue #150 (comment)
At present the whole loader will shutdown if session expires for even one routine. We are doing this so that we get to know of the problem fast enough as we get Crashloop alerts on continuous restarts. Later we can change the session timeouts to restart the ConsumeClaim and track the error rate prometheus #150 (comment)
Increasing the sarama's This solves this issue. |
Issue is happening still even after maxProcessing time was increased. It is now happening irrespective of batch size. |
Deleted the kafka pod things started working! Only to fail again. |
Trying retry session 5cc9052 as mentioned in IBM/sarama#1685 |
Things have been working out well for us after re-establishing sessions. And at present we think sessions were getting closed due to rebalancing happening. As everytime it happens we have seen this errror loop check partition number coroutine will stop to consume. https://stackoverflow.com/questions/39730126/difference-between-session-timeout-ms-and-max-poll-interval-ms-for-kafka-0-10 Will keep an eye on which case exactly the rebalance is happening, is it due to slow loader consumers? |
This was due to this #160 |
New configurations added as based on the type of processing and the size of the batch these values need configurations for every use case. We needed this to process huge batch of loads in Redshift cluster. When the batch size was huge the timeouts where happening leading to commits not going through. Details on why we are doing this is in this issue #150 (comment)
At present the whole loader will shutdown if session expires for even one routine. We are doing this so that we get to know of the problem fast enough as we get Crashloop alerts on continuous restarts. Later we can change the session timeouts to restart the ConsumeClaim and track the error rate prometheus #150 (comment)
New configurations added as based on the type of processing and the size of the batch these values need configurations for every use case. We needed this to process huge batch of loads in Redshift cluster. When the batch size was huge the timeouts where happening leading to commits not going through. Details on why we are doing this is in this issue #150 (comment)
At present the whole loader will shutdown if session expires for even one routine. We are doing this so that we get to know of the problem fast enough as we get Crashloop alerts on continuous restarts. Later we can change the session timeouts to restart the ConsumeClaim and track the error rate prometheus #150 (comment)
New configurations added as based on the type of processing and the size of the batch these values need configurations for every use case. We needed this to process huge batch of loads in Redshift cluster. When the batch size was huge the timeouts where happening leading to commits not going through. Details on why we are doing this is in this issue #150 (comment)
At present the whole loader will shutdown if session expires for even one routine. We are doing this so that we get to know of the problem fast enough as we get Crashloop alerts on continuous restarts. Later we can change the session timeouts to restart the ConsumeClaim and track the error rate prometheus #150 (comment)
Loader consumer group takes a lot of time to load based on the size of the batch and based on how loaded the redshift cluster is.
What we have seen when the cluster is resource starved, 100+ topics are being loaded to redshift concurrently and the batch size being operated by the loader is big (lakhs of rows in one load). Then the Commit to Kafka does not happen and the commits keep getting reprocessed.
What does not work
What works
The cause is still not known
The
b.session.Commit()
does not have any error to return even though it is synchronous call. So we are not really sure at present why this happens as both marking offset and commit happens but the commit does not.We are upgrading the redshift cluster as a short term fix.
Long term: discuss and debug to find the root cause.
Reddit thread on this https://www.reddit.com/r/apachekafka/comments/lw7kdz/kafka_manual_commit_problem/
Sarama Issue
Kafka Commit is not working IBM/sarama#1894
Consumer group ignores AutoCommit flag when it exits IBM/sarama#1843
commit offset manually when using consumer group IBM/sarama#1570 (comment)
The text was updated successfully, but these errors were encountered: