Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[fix][broker] Expect msgs after server initiated CloseProducer #19446

Conversation

michaeljmarshall
Copy link
Member

@michaeljmarshall michaeljmarshall commented Feb 7, 2023

Motivation

The intention of #12780 was to close the whole connection when the client sends an unexpected Send command. The goal of that change was to make the broker more defensive to prevent incorrect implementations in the client (see #12779) from leading to out of order messages.

However, the change in #12780 was too broad. It closes the connection in a very expected case. Specifically, when the server disconnects a producer due to a load balancing or unloading event, the broker sends the client a CloseProducer command. If the broker receives any additional messages for that producer, the broker closes the whole connection. This is an expensive interruption for clients with many producers/consumers. Because the Pulsar Producer is expected to pipeline Send commands, there is no current way to know if the client sent the messages before or after receiving the close producer command, and because the goal of #12780 was to increase stability, I think we should ignore the messages when they are received in these conditions.

I propose that we keep a map of recently closed producers, and use that to limit how long we keep around the producer's tombstone.

For reference, when connections are closed due to this weakness in the implementation, the broker logs:

log.warn("[{}] Received message, but the producer is not ready : {}. Closing the connection.",
remoteAddress, send.getProducerId());

I observed this log line more than 17,000 in the past 7 days. As such, I plan to cherry pick this to active release branches.

Modifications

  • Add a HashMap to track recently closed producers. This map is only ever updated on the ServerCnx's event loop. The one downside is that the HashMap will box the long keys and values. However, it is likely faster than the ConcurrentLongHashMap since it does not have any synchronization.
  • Add logic to ignore Send commands if they come recently after the broker sent a CloseProducer command.
  • Use the keep alive interval as the length of time the producer is considered "recently" closed. It is possible that some will want a new configuration here. I do not think we need one because we are really just waiting for the client to receive the CloseProducer request, and the keep alive interval should be sufficient for a full round trip from broker to client.
  • Remove the recently closed producer from the map if the client attempts to recreate the producer. In this condition, the client has already received the close command and is attempting to create a new producer.

Verifying this change

New tests are added.

Does this pull request potentially affect one of the following parts:

  • The binary protocol

This affects the protocol in a sense, but it does not change the protocol in any negative way.

Documentation

  • doc-not-needed

Matching PR in forked repository

PR in forked repository: michaeljmarshall#24

@lhotari lhotari requested a review from merlimat February 7, 2023 06:54
Copy link
Member

@lhotari lhotari left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good work @michaeljmarshall!

Copy link
Contributor

@eolivelli eolivelli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good solution

@@ -1622,6 +1625,14 @@ protected void handleSend(CommandSend send, ByteBuf headersAndPayload) {
CompletableFuture<Producer> producerFuture = producers.get(send.getProducerId());

if (producerFuture == null || !producerFuture.isDone() || producerFuture.isCompletedExceptionally()) {
if (recentlyClosedProducers.containsKey(send.getProducerId())) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we compare the "epoch" here ? maybe it is unnecessary, but I am not expert in this part of the protocol

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The send command does not have the producer's epoch, so we don't have that information in scope.

It could be valuable to discuss ways to improve the protocol for the future, like asking if the send command should have the epoch or some other identifier, but I want a backwards compatible solution that will work by upgrading the broker.

Also, I considered comparing epoch's when the Producer command is handled in another part of this PR, but I think it complicates the logic more than necessary, so we ignore the value then too.

The primary reason for keeping using the epoch value is to make sure the scheduled task does not remove the wrong key.

@lhotari
Copy link
Member

lhotari commented Feb 7, 2023

/pulsarbot rerun-failure-checks

Copy link
Contributor

@nicoloboschi nicoloboschi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@lhotari lhotari merged commit 524288c into apache:master Feb 7, 2023
michaeljmarshall added a commit that referenced this pull request Feb 7, 2023
michaeljmarshall added a commit that referenced this pull request Feb 7, 2023
(cherry picked from commit 524288c)
(cherry picked from commit 4cbe68e)
michaeljmarshall added a commit that referenced this pull request Feb 7, 2023
(cherry picked from commit 524288c)
(cherry picked from commit 4cbe68e)
michaeljmarshall added a commit that referenced this pull request Feb 7, 2023
(cherry picked from commit 524288c)
(cherry picked from commit 4cbe68e)
@michaeljmarshall michaeljmarshall added cherry-picked/branch-2.8 Archived: 2.8 is end of life cherry-picked/branch-2.9 Archived: 2.9 is end of life labels Feb 7, 2023
michaeljmarshall added a commit to datastax/pulsar that referenced this pull request Feb 8, 2023
…e#19446)

(cherry picked from commit 524288c)
(cherry picked from commit 4cbe68e)
(cherry picked from commit 283f773)
@michaeljmarshall michaeljmarshall deleted the keep-track-of-recently-closed-producers branch July 5, 2023 15:24
poorbarcode added a commit to poorbarcode/pulsar that referenced this pull request Sep 5, 2023
@poorbarcode
Copy link
Contributor

poorbarcode commented Sep 5, 2023

@michaeljmarshall

I think this is a great improvement.

But https://github.com/apache/pulsar/pull/19446/files#diff-1e0e8195fb5ec5a6d79acbc7d859c025a9b711f94e6ab37c94439e99b3202e84R1627-R1635 leading the send message future of the client could not be completed.

At first, I tried to set the send message future as completed if the producer no longer exists in the broker, but I found that I also had to deal with the order of pending requests. So, for a quick fix, I pushed a new PR(#21134) to revert the current PR.

poorbarcode added a commit to poorbarcode/pulsar that referenced this pull request Sep 6, 2023
poorbarcode added a commit to poorbarcode/pulsar that referenced this pull request Sep 7, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants