Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Occasional consumer stucked when restart consumer whit key_shared subscription type . #10284

Closed
baomingyu opened this issue Apr 20, 2021 · 6 comments
Assignees
Labels
type/bug The PR fixed a bug or issue reported a bug

Comments

@baomingyu
Copy link
Contributor

In such scene , consumer will be stucked after restart.
First step , tow consumers with key_shared subscription type and same group.
such as consumer1 and consumer2
Second step, broker receive consumer1 flow command with 1000 permits and do not get consumer2's flow command.
Third step, broker start send message to consumer, but messages whit keys are assigned to consumer2, so it will not send any message to consumers;
Fourth step , Next loop send time, getRestrictedMaxEntriesForConsumer will aways return 0, and will not send any messages.

@james-bright-helix
Copy link

We see this problem roughly 50% of the time when we do a rolling restart of our consumers, e.g., 5 at a time for total of 20. Normally the only way to fix it is to unload the bundle housing the topic. We've been experiencing this on all versions since 2.6.3 and have seen several "stuck consumer" type issues be marked as resolved but still issues with key_shared remain. Is there anything we can do when we experience this issue to assist in getting it fixed?

@codelipenghui
Copy link
Contributor

@james-bright-helix You can try out 2.8.1 or 2.7.3 which contains #10920

@james-bright-helix
Copy link

@codelipenghui sorry my message wasn't clear. We're on 2.8.1 and tried every version since 2.6.3 but still suffering from stuck key_shared consumers. I was hoping the attached PR was going to fix our issue but it seems to not be progressing hence the offer to help provide more details.

@codelipenghui
Copy link
Contributor

@james-bright-helix Do you have a way to reproduce the issue on 2.8.1?

@james-bright-helix
Copy link

james-bright-helix commented Nov 9, 2021

@james-bright-helix Do you have a way to reproduce the issue on 2.8.1?

@codelipenghui not consistently in a way that's not disruptive. We have to bounce our production app and then it happens frequently. we see it very rarely in our non-production envs which are much smaller. Are there any additional logging/metrics we can gather to share when it does happen?
One thing we noticed is that if you bounce only some of the consumers, e.g., 5 of 20 consumers, then the backlog is sometimes processed for a while before stopping again. Unloading the topic/namespace has been our only consistent way to recover.

@james-bright-helix
Copy link

james-bright-helix commented Nov 10, 2021

@codelipenghui I thought I should mention it's possibly related to #12208 as we are using reconsumeLaterAsync() on all topics (if we have an environmental error we want to retry after a delay) and on one topic we also use deliverAfter() although we don't usually see anything on the retry topics when it's stuck. As mentioned in that issue, it's not clear if these are expected to work with key_shared subscriptions.

@codelipenghui codelipenghui removed this from the 2.10.0 milestone Feb 14, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
type/bug The PR fixed a bug or issue reported a bug
Projects
None yet
4 participants