[ServerCnx] Only reply to client when code completes producerFuture #13949

michaeljmarshall · 2022-01-25T20:03:46Z

Motivation

We should only send the error response to the client when the code is able to complete the producerFuture. This logic is described here:

pulsar/pulsar-broker/src/main/java/org/apache/pulsar/broker/service/ServerCnx.java

Lines 1286 to 1293 in 2285d02

    
           // If client timed out, the future would have been completed 
        
           // by subsequent close. Send error back to 
        
           // client, only if not completed already. 
        
           if (producerFuture.completeExceptionally(exception)) { 
        
               commandSender.sendErrorResponse(requestId, 
        
                       BrokerServiceException.getClientErrorCode(cause), cause.getMessage()); 
        
           } 
        
           producers.remove(producerId, producerFuture);

Edit: in a previous version of this motivation section, I attributed the current behavior to #12874. That PR did not introduce this behavior, though.

Modifications

Move the response to the client into a conditional block that only runs when this section of the code is able to complete the future.

Verifying this change

This is a trivial change.

Does this pull request potentially affect one of the following parts:

It impacts how the broker interacts with the client. This change ensures that we have the correct behavior.

Documentation

no-need-doc

This change is completely internal.

mattisonchao

LGTM +1

Jason918

LGTM

Jason918 · 2022-01-26T02:16:15Z

In #12874, we reply to the client in all cases. That is not our current design though.

This issue exists long before #12874 which just changed the sync blocking to async implementation.

I wonder if there are some other cases have the same issue.

michaeljmarshall · 2022-01-26T04:07:25Z

In #12874, we reply to the client in all cases. That is not our current design though.

This issue exists long before #12874 which just changed the sync blocking to async implementation.

I wonder if there are some other cases have the same issue.

@Jason918 - thank you for calling this out, sorry about that misattribution. I noticed the code today when looking at recent changes, but I failed to dig deep enough to know that it predated your commit. I updated the PR description.

I did inspect the rest of the class today, and I don't see the behavior anywhere else.

codelipenghui · 2022-01-26T04:37:00Z

We should apply this change carefully.

If the client is like the followings:

The client sends the command to create the producer
The broker received the command and start to process the request, but can't complete it after the client-side operation timeout
The client tries to send a new create producer command with the same producer ID but a different request ID (Of course this usually doesn't happen)
Now the broker received the new create producer command and will use the existing future https://github.com/apache/pulsar/blob/master/pulsar-broker/src/main/java/org/apache/pulsar/broker/service/ServerCnx.java#L1173 to process the new request.
Then the previous request failed, and complete the future with an exception
The client will not able to receive the response for the second request.

In my opinion, we should handle such case in the following principles:

The server side should return a response to the client, one producer future might map to multiple request IDs, we should avoid broker miss response to the client, of course, if the connection is not available, there is no way to make the above guarantee.
The client-side should make sure to only process one response for a request ID, if the request ID has already been finished by another response, the client to ignore the subsequent responses.

I think it's the more easy and clear way to handle both the client-side and server-side. And currently, the client-side also follow this way https://github.com/apache/pulsar/blob/master/pulsar-client/src/main/java/org/apache/pulsar/client/impl/ClientCnx.java#L534, https://github.com/apache/pulsar/blob/master/pulsar-client/src/main/java/org/apache/pulsar/client/impl/ClientCnx.java#L719

codelipenghui · 2022-01-26T04:38:18Z

A related discussion before: #13245

michaeljmarshall · 2022-01-26T05:28:37Z

If the client is like the followings:

1. The client sends the command to create the producer

2. The broker received the command and start to process the request, but can't complete it after the client-side operation timeout

3. The client tries to send a new create producer command with the same producer ID but a different request ID (Of course this usually doesn't happen)

4. Now the broker received the new create producer command and will use the existing `future` https://github.com/apache/pulsar/blob/master/pulsar-broker/src/main/java/org/apache/pulsar/broker/service/ServerCnx.java#L1173 to process the new request.

5. Then the previous request failed, and complete the future with an exception

6. The client will not able to receive the response for the second request.

This is a valid concern. However, the client is supposed to send a CloseProducer command when timing out the first request before sending a Producer command. When it sends the CloseProducer command, the above scenario is avoided. This was not done in the Java Client until I fixed it a month ago: #13161. Additionally, it was not explicitly called out in the protocol documentation until a month ago either: #12948.

The consequence of my proposed PR: if a client does not follow the protocol spec, it will not receive a response (step 6), then the client will timeout waiting for a response and then it'll retry and succeed (or fail).

I agree that we need to be careful implementing this. I wouldn't cherry pick these changes to previous branches. However, as I have mentioned elsewhere, I think this is the right design because we should only respond to a client once.

The server side should return a response to the client, one producer future might map to multiple request IDs, we should avoid broker miss response to the client, of course, if the connection is not available, there is no way to make the above guarantee.

I don't think we need to design for this case because the client is not supposed to send two Producer commands without a CloseProducer command in between. Note that we don't store the requestId for a second Producer command, so there is no way for the broker to respond to the second request. Since the client removed the first request when it timed out, replying to it won't matter. The second client request will timeout, too, and the client will need to resend the Producer command a third time.

I think it's the more easy and clear way to handle both the client-side and server-side. And currently, the client-side also follow this way https://github.com/apache/pulsar/blob/master/pulsar-client/src/main/java/org/apache/pulsar/client/impl/ClientCnx.java#L534, https://github.com/apache/pulsar/blob/master/pulsar-client/src/main/java/org/apache/pulsar/client/impl/ClientCnx.java#L719

In both of these examples, the client will log warning messages if it receives responses for requests that are already completed. Those warnings won't be actionable. I think we should respond to client requests just once.

codelipenghui · 2022-01-26T07:12:44Z

However, as I have mentioned elsewhere, I think this is the right design because we should only respond to a client once.

Yes, I'm also talking about this one, the only difference is the behavior of the client-side sending multiple requests with the same producer ID.

Two options we are talking about:

Provide the spec to tell the implements, "If you create multiple requests using the same producer ID, you might never get a response".
Return an error to the client, "The current producer is creating, please close the old one first!"

I would tend to the second one.

michaeljmarshall · 2022-01-27T23:31:47Z

Return an error to the client, "The current producer is creating, please close the old one first!"

This is already the current behavior. See:

pulsar/pulsar-broker/src/main/java/org/apache/pulsar/broker/service/ServerCnx.java

Lines 1175 to 1201 in f48b53d

    
           if (existingProducerFuture != null) { 
        
               if (existingProducerFuture.isDone() && !existingProducerFuture.isCompletedExceptionally()) { 
        
                   Producer producer = existingProducerFuture.getNow(null); 
        
                   log.info("[{}] Producer with the same id is already created:" 
        
                           + " producerId={}, producer={}", remoteAddress, producerId, producer); 
        
                   commandSender.sendProducerSuccessResponse(requestId, producer.getProducerName(), 
        
                           producer.getSchemaVersion()); 
        
                   return null; 
        
               } else { 
        
                   // There was an early request to create a producer with same producerId. 
        
                   // This can happen when client timeout is lower than the broker timeouts. 
        
                   // We need to wait until the previous producer creation request 
        
                   // either complete or fails. 
        
                   ServerError error = null; 
        
                   if (!existingProducerFuture.isDone()) { 
        
                       error = ServerError.ServiceNotReady; 
        
                   } else { 
        
                       error = getErrorCode(existingProducerFuture); 
        
                       // remove producer with producerId as it's already completed with exception 
        
                       producers.remove(producerId, existingProducerFuture); 
        
                   } 
        
                   log.warn("[{}][{}] Producer with id is already present on the connection, producerId={}", 
        
                           remoteAddress, topicName, producerId); 
        
                   commandSender.sendErrorResponse(requestId, error, "Producer is already present on the connection"); 
        
                   return null; 
        
               } 
        
           }

.

I agree that sending a failure is the right design, since the client is not following the protocol spec (it shouldn't try to create the same producer twice). Although, technically, if the producer is already created, we just respond that it was created successfully. I am not sure that I like this design, but that is a different discussion.

I described in detail why it is problematic if the client does not send the CloseProducer command before trying to create a new producer here (https://lists.apache.org/thread/x7886r5v1dtg4c4nbptdfn97ryw097wl):

Specifically, if the client fails to send a CloseProducer command,
it ends up getting into a sequence of retries where each new
Producer command receives an immediate ErrorResponse because the
ServerCnx already has a pending producer. By sending a
CloseProducer command, the client gives the broker permission to
stop keeping track of the original create producer request. It also
means that if the topic eventually loads, the broker will respond to
the right request id with a ProducerSuccessResponse command.

This is another reason why the broker shouldn't respond if the producer future is already completed: it gets completed when the client sends a CloseProducer command.

codelipenghui · 2022-01-28T01:52:19Z

@michaeljmarshall Thanks for the explanation, LGTM.

Although, technically, if the producer is already created, we just respond that it was created successfully. I am not sure that I like this design, but that is a different discussion.

Yes, it should be a different discussion, I think we should return such as a ProducerConflict exception to the client, otherwise, if the client implementation does not follow the spec, it will have more than 1 active producer instance with the same producer ID.

michaeljmarshall · 2022-01-28T06:14:16Z

Yes, it should be a different discussion, I think we should return such as a ProducerConflict exception to the client, otherwise, if the client implementation does not follow the spec, it will have more than 1 active producer instance with the same producer ID.

@codelipenghui - that's a great point, and I support exploring a change to the behavior. I just looked at the git history, and we've responded to this behavior with a ProducerSuccess since the initial import of the project in 2016. The only change would be to reject the second request. The initial producer would continue to be connected.

…he#13949) ### Motivation We should only send the error response to the client when the code is able to complete the `producerFuture`. This logic is described here: https://github.com/apache/pulsar/blob/2285d02aa9957af7877b9d3d3c628a750d813ca7/pulsar-broker/src/main/java/org/apache/pulsar/broker/service/ServerCnx.java#L1286-L1293 Edit: in a previous version of this motivation section, I attributed the current behavior to apache#12874. That PR did not introduce this behavior, though. ### Modifications * Move the response to the client into a conditional block that only runs when this section of the code is able to complete the future.

[ServerCnx] Only reply to client when completing producerFuture

e45a348

michaeljmarshall added type/cleanup Code or doc cleanups e.g. remove the outdated documentation or remove the code no longer in use area/broker doc-not-needed Your PR changes do not impact docs labels Jan 25, 2022

michaeljmarshall added this to the 2.10.0 milestone Jan 25, 2022

michaeljmarshall requested review from merlimat, Jason918, hangc0276, jiazhai, eolivelli, 315157973 and codelipenghui January 25, 2022 20:03

michaeljmarshall self-assigned this Jan 25, 2022

mattisonchao approved these changes Jan 26, 2022

View reviewed changes

hangc0276 approved these changes Jan 26, 2022

View reviewed changes

Jason918 approved these changes Jan 26, 2022

View reviewed changes

codelipenghui approved these changes Jan 28, 2022

View reviewed changes

codelipenghui merged commit 3c6aae3 into apache:master Jan 28, 2022

michaeljmarshall deleted the conditionally-reply-to-client branch January 28, 2022 06:08

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[ServerCnx] Only reply to client when code completes producerFuture #13949

[ServerCnx] Only reply to client when code completes producerFuture #13949

michaeljmarshall commented Jan 25, 2022 •

edited

mattisonchao left a comment

Jason918 left a comment •

edited

Jason918 commented Jan 26, 2022

michaeljmarshall commented Jan 26, 2022

codelipenghui commented Jan 26, 2022

codelipenghui commented Jan 26, 2022

michaeljmarshall commented Jan 26, 2022

codelipenghui commented Jan 26, 2022

michaeljmarshall commented Jan 27, 2022

codelipenghui commented Jan 28, 2022

michaeljmarshall commented Jan 28, 2022

	// If client timed out, the future would have been completed
	// by subsequent close. Send error back to
	// client, only if not completed already.
	if (producerFuture.completeExceptionally(exception)) {
	commandSender.sendErrorResponse(requestId,
	BrokerServiceException.getClientErrorCode(cause), cause.getMessage());
	}
	producers.remove(producerId, producerFuture);

[ServerCnx] Only reply to client when code completes producerFuture #13949

[ServerCnx] Only reply to client when code completes producerFuture #13949

Conversation

michaeljmarshall commented Jan 25, 2022 • edited

Motivation

Modifications

Verifying this change

Does this pull request potentially affect one of the following parts:

Documentation

mattisonchao left a comment

Choose a reason for hiding this comment

Jason918 left a comment • edited

Choose a reason for hiding this comment

Jason918 commented Jan 26, 2022

michaeljmarshall commented Jan 26, 2022

codelipenghui commented Jan 26, 2022

codelipenghui commented Jan 26, 2022

michaeljmarshall commented Jan 26, 2022

codelipenghui commented Jan 26, 2022

michaeljmarshall commented Jan 27, 2022

codelipenghui commented Jan 28, 2022

michaeljmarshall commented Jan 28, 2022

michaeljmarshall commented Jan 25, 2022 •

edited

Jason918 left a comment •

edited