Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] When reconnecting to the server, the old connection is not closed #676

Open
wenbingshen opened this issue Dec 1, 2021 · 15 comments
Open

Comments

@wenbingshen
Copy link
Member

Expected behavior

When reconnecting to the server, the old connection should be closed.

Actual behavior

When reconnecting to the server, the old connection is not closed.

The following log progresses according to the timeline:

  1. First, there is an active channel and producer that is writing data to the broker
    image

  2. In the case that the previous connection has not been disconnected, the client sends an add producer operation for the same producerName
    image

  3. This addition operation comes from the same ip of the client, a different port connection, and the producerName is equal to the previous one
    image

  4. Remove the producer in ServerCnx from the pulsar server, Producer getting producer busy is removing existing producer from list pulsar#11804, this PR modified the equals method of the Producer, resulting in the inability to remove the producer from the topic's prodcuers map.
    image

  5. Then the old connection starts to be disconnected, and the producer status of the server on the connection is cleared
    image

image

  1. The new channel connection keeps retrying addProducer, and the producer is added successfully after closing the old connection.

Steps to reproduce

System configuration

Pulsar version: x.y

pulsar-client-go: 0.7.0
pulsar broker: 2.8.1

@wenbingshen
Copy link
Member Author

The client does not provide the producer name, it is generated by the pulsar broker and returned to the client. Old connection address: ab.cd.22.174:44352, new connection address: ab.cd.22.174:49750, the same produceName: pulsar_s1-8-346

When a new connection is established and request to add a producer to pulsar broker, the old connection address is still maintaining a relationship with the server, so the pulsar broker will not clear the producer cache related to the old connection

@wenbingshen
Copy link
Member Author

@wolfstudy @cckellogg @zymap PTAL, thanks.

@cckellogg
Copy link
Contributor

I'm trying to understand better what you are seeing. Is below what is happening?

1 - producer-1 with name my-producer is connected to a broker on topic-1
2 - app tries to create new producer producer-2 with the same name my-producer for topic-1
3 - the broker sends back an error saying producer with name my-producer is already connected to the topic-1

@wenbingshen
Copy link
Member Author

@cckellogg The second point is a bit wrong.
1 - producer-1 with name my-producer is connected to a broker on topic-1
2 - app tries to create new producer producer-1 with the same name my-producer for topic-1
3 - the broker sends back an error saying producer with name my-producer is already connected to the topic-1

Because the client does not provide producerName manually, if producer-2 is created at the second point, the pulsar server will return a completely different producerName, such as my-producer-2.

Because of this, at the first point, the client address is ab.cd.22.174:44352, and at the second point, the client address is ab.cd.22.174:49750 , and when the second point occurs, the first point’s that client address is still not disconnected with pulsar broker. Therefore, the pulsar broker cannot clear the related producer information on the server side, causing the new connection to repeatedly execute pb.BaseCommand_PRODUCER in the reconnectToBroker retry logic.

I don't understand the go language, but I roughly read the implementation of pulsar-client-go about connection. Although there is no evidence that the reconnectToBroker logic is entered when the old connection is not closed, it does happen.

Because you are more familiar with this, can you do some investigation on it? Thanks.

@wenbingshen
Copy link
Member Author

client log in time 2021-11-27T13:51:58

image

broker log with new client connection with ab.cd.22.174:49750 in time 2021-11-27T13:51:58

image

@wenbingshen
Copy link
Member Author

wenbingshen commented Dec 2, 2021

Regarding another producerName: pulsar_s1-8-501 has 5 timelines on the broker log side:

  1. pulsar_s1-8-501 with connection ab.cde.138.8:16631 is sending data normally
    image

  2. The broker receives the request to add a producer from the new connection ab.cde.138.8:48277 for pulsar_s1-8-501 , and because the channel id of the producer is inconsistent, the old producer cannot be overwritten.
    image
    image
    image
    image

  3. pulsar_s1-8-501 with old connection ab.cde.138.8:16631 continues to send data normally
    image

  4. Now the server has triggered channelInactive and started to close the connection ab.cde.138.8:16631 related state, including producer information.
    image
    image

  5. Finally, the pulsar broker again receives the new connection ab.cde.138.8:48277 for pulsar_s1-8-501 request to add Producer, and successfully adds the Producer.
    image

@cckellogg
Copy link
Contributor

Are you able to reproduce this consistently?

What are those steps?

Is the goal of the application to create two producers (producer-1 and producer-2) and have them both send messages to the same topic?

@wenbingshen
Copy link
Member Author

@cckellogg We have encountered it twice, each time the client can run normally for 4 to 5 days, but during the period, the client and the server cannot respond in time due to network factors or ping/pong, which leads to the need to disconnect. During the reconnection period, the error that the producer has already connected to topic often occurs.

this is the client logic
image

@wenbingshen
Copy link
Member Author

Are you able to reproduce this consistently?

What are those steps?

Is the goal of the application to create two producers (producer-1 and producer-2) and have them both send messages to the same topic?

Our program can always run normally for 4 to 5 days, but this kind of error always occurs when the network is disconnected and reconnected during the period. Therefore, pulsar-client-go must ensure that the old connection is closed first and then the new connection is established.

@thinker0
Copy link
Member

thinker0 commented Jun 20, 2022

This problem is occurring irregularly in pulsar-client-go.

2022-06-20T22:59:49.163+0900	ERROR	[pulsar]	pulsar/client.go:122	Reconnect failed: server error: ProducerBusy: org.apache.pulsar.broker.service.BrokerServiceException$NamingException: Producer with name 'app046-xxxxxx-93' is already connected to topic
2022-06-20T22:59:49.163+0900	ERROR	[pulsar]	pulsar/client.go:118	produce send failed: message send timeout: TimeoutError

@michaeljmarshall
Copy link
Member

@thinker0 - which version of Pulsar are you using? This issue references using brokers running 2.8.1. A fix for this issue (or a similar one) was released in 2.8.2 and above.

@thinker0
Copy link
Member

thinker0 commented Jun 29, 2022

which version of Pulsar are you using?
@michaeljmarshall

  • pulsar-broker-2.9.2
  • pulsar-discovery-2.8.3

@michaeljmarshall
Copy link
Member

@thinker0 - thanks, also which version of go client are you using?

@thinker0
Copy link
Member

	github.com/apache/pulsar-client-go v0.8.1-0.20220309072056-bb2fa811a0f5

@zzzming
Copy link
Contributor

zzzming commented Jan 26, 2024

We are still seeing this issue with Broker 2.10.5 and Go client lib 0.11.0.

The problem is more evident when there is network issues. I wonder if reconnectToBroker() can add more wait time when ProducerBusy error is detected. Broker needs time to clear the stale producer state/entry.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants