New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[broker] Fix issue where Key_Shared consumers could get stuck #10920
Conversation
super.removeConsumer(consumer); | ||
// The consumer must be removed from the selector before calling the superclass removeConsumer method. | ||
// In the superclass removeConsumer method, the pending acks that the consumer has are added to | ||
// messagesToRedeliver. If the consumer has not been removed from the selector at this point, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the consumer has not been removed from the selector at this point
Looks like a race condition between sending messages to the consumer and remove the consumer from the selector?
We have a synchronized
to protect the readEntriesComplete
and removeConsumer
. How can this happen?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@codelipenghui Perhaps this happens if the messages to be redelivered is in the managed ledger cache. In this case,
readMoreEntries()
↓
readEntriesComplete()
↓
sendMessagesToConsumers()
are executed and completed synchronously in PersistentDispatcherMultipleConsumers#removeConsumer()
.
Lines 184 to 195 in 894d92b
consumer.getPendingAcks().forEach((ledgerId, entryId, batchSize, none) -> { | |
if (addMessageToReplay(ledgerId, entryId)) { | |
redeliveryTracker.addIfAbsent(PositionImpl.get(ledgerId, entryId)); | |
} | |
}); | |
totalAvailablePermits -= consumer.getAvailablePermits(); | |
if (log.isDebugEnabled()) { | |
log.debug("[{}] Decreased totalAvailablePermits by {} in PersistentDispatcherMultipleConsumers. " | |
+ "New dispatcher permit count is {}", name, consumer.getAvailablePermits(), | |
totalAvailablePermits); | |
} | |
readMoreEntries(); |
synchronized
protection does not work because all of these methods are executed by the same thread.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, I think the stack trace here would be like:
removeConsumer()
↓
super.removeConsumer()
↓
readMoreEntries()
(This is what is used to trigger the re-delivery of the messages that were pending on the removed consumer)
↓
readEntriesComplete()
↓
sendMessagesToConsumers()
As mentioned, this can select the removed consumer which is still in the selector list.
At this point, the sendMessagesToConsumers()
will fail and the message will stay into the pendingAcks
set for that consumer, but, since the consumer was already removed, the redelivery of this message will not happen.
I think this change is the correct one.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
(cherry picked from commit 8065d6c)
Motivation
Repeatedly opening and closing consumers on a Key_Shared subscription can occasionally stop dispatching to all consumers. The following is stats of the topic when the phenomenon occurred.
stats.json
The strange thing is that every consumer has an
unackedMessages
value of 0, but the subscription-levelunackedMessages
value is 1.Modifications
The cause of this issue is the following part:
pulsar/pulsar-broker/src/main/java/org/apache/pulsar/broker/service/persistent/PersistentStickyKeyDispatcherMultipleConsumers.java
Lines 124 to 125 in 894d92b
When
removeConsumer()
of the superclass is called, the pending acks owned by that consumer are added tomessagesToRedeliver
.pulsar/pulsar-broker/src/main/java/org/apache/pulsar/broker/service/persistent/PersistentDispatcherMultipleConsumers.java
Lines 184 to 188 in 894d92b
However, the consumer has not yet been removed from
selector
, so the broker attempts to send messages to the consumer that has already been closed. Those messages are removed frommessagesToRedeliver
, but they aren't actually sent to any consumer.pulsar/pulsar-broker/src/main/java/org/apache/pulsar/broker/service/persistent/PersistentStickyKeyDispatcherMultipleConsumers.java
Lines 204 to 210 in 894d92b
As a result, the mark-delete position does not move and all consumers will get stuck.
Therefore, in
PersistentStickyKeyDispatcherMultipleConsumers#removeConsumer()
, we need to remove the consumer fromselector
before callingremoveConsumer()
of the superclass.