Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Producer with name xxx is already connect to topic #13342

Closed
wenbingshen opened this issue Dec 15, 2021 · 16 comments
Closed

[BUG] Producer with name xxx is already connect to topic #13342

wenbingshen opened this issue Dec 15, 2021 · 16 comments
Assignees
Labels
lifecycle/stale type/bug The PR fixed a bug or issue reported a bug

Comments

@wenbingshen
Copy link
Member

Describe the bug
Master issue: #13061 apache/pulsar-client-go#676
Pulsar version:2.8.1

After our investigation, this problem occurs when the ping/pong between the client and the server gradually deviates, until the client senses that the connection is closed, and the connection close operation fails due to network reasons, and the underlying network is not disconnected, resulting in pulsar The broker is still waiting for the ping/pong to time out, but the client has already used the same PartitionProducer, reconnected via the network (changed the port), and started AddProducer to the pulsar broker.

#11804, this PR rewrites the equals method of the Producer, resulting in that when different pulsar-client-go uses different port to reconnect, the old producer cannot be removed because the remoteAddress will be verified by equals:

if (producers.remove(producer.getProducerName(), producer)) {

#12846, this pr removes equals and will use hashcode for judgment. At this time, the
old producer cannot be removed.

This problem can be closed when the pulsar broker perceives ping/pong timeout, or the channel is abnormal, and the connection can be closed, and the producer state can be cleaned up. When the client AddProducer again, it can be restored; but during this period, the client reconnects and starts the add producer. The broker will always report an error: Producer with name is already connect to topic.

Therefore, I feel that the current protocol cannot fully prove whether the producer client can overwrite itself. It may be necessary to add some fields to prove: I am me

To Reproduce
Steps to reproduce the behavior:

  1. Change broker keepAliveIntervalSeconds=100
  2. You can choose a pulsar client in any language, such as pulsar-client-go or java and other clients
  3. Use the client to send data to the pulsar server
  4. Use a firewall to disconnect the network between the client and the broker. The time is maintained in 60s. After waiting for 60 seconds, close the firewall
  5. Now, you can check the broker log, at this time you can see the error: Producer with name is already connect to topic

Expected behavior
A clear and concise description of what you expected to happen.

@wenbingshen wenbingshen added the type/bug The PR fixed a bug or issue reported a bug label Dec 15, 2021
@wenbingshen wenbingshen changed the title [BUG] [BUG] Producer with name xxx is already connect to topic Dec 15, 2021
@wenbingshen
Copy link
Member Author

@BewareMyPower
Copy link
Contributor

It looks like the same issue with #13289. /cc @aloyszhang

@BewareMyPower
Copy link
Contributor

Could you confirm whether #13428 fixed the problem?

@wenbingshen
Copy link
Member Author

Could you confirm whether #13428 fixed the problem?

I don't think the problem is solved, because when the problem occurs, the channelInactive method is not entered, that is, after the producer reconnect, the old connection still does not trigger channelInactive.

I just built a test with the master branch (including #13428 ), and the repeat steps described in the issue can still be reproduced stably.

image

image

@github-actions
Copy link

The issue had no activity for 30 days, mark with Stale label.

@mattisonchao
Copy link
Member

I'm working on it and some solutions need to be discussed. Once a reasonable solution is finalized, I'll fix it in time.

@eolivelli
Copy link
Contributor

I see this problem quite often is some environments.

Do you have news @mattisonchao ?

@Technoboy-
Copy link
Contributor

Ah, we find that keepAliveIntervalSeconds exists on both client and server-side. The default value is 30s.
In your reproduce step, the server-side is 100s.
The broker will clean up the producer info when the channel is inactive. So it's about 100s to do this.
But client-side will close the channel in the 30s, and the firewall may cause the closure of the channel not to be sent, then the client will reconnect. In this case, the server-side has the producer info and will throw the above exception.

@Technoboy-
Copy link
Contributor

For this issue, it's better for the client to config this value, and the broker will use the client value.

@lhotari
Copy link
Member

lhotari commented Jun 2, 2022

There were some bugs that caused the connection to stay open for longs durations.
#15382 Closes the connection if a ping or pong message cannot be sent .
#15366 Fixes a proxy connection leak when inbound connection closes while connecting is in progress
Both of the above fixes address issues where the symptom is "Producer with name xxx is already connect to topic".

@lhotari
Copy link
Member

lhotari commented Jun 2, 2022

To Reproduce Steps to reproduce the behavior:

  1. Change broker keepAliveIntervalSeconds=100
  2. You can choose a pulsar client in any language, such as pulsar-client-go or java and other clients
  3. Use the client to send data to the pulsar server
  4. Use a firewall to disconnect the network between the client and the broker. The time is maintained in 60s. After waiting for 60 seconds, close the firewall
  5. Now, you can check the broker log, at this time you can see the error: Producer with name is already connect to topic

@wenbingshen I think that this is not a bug that you are describing. It might be a way to reproduce "Producer with name xxx is already connect to topic", but if the steps are followed, I think that it's completely expected behavior.

@wenbingshen
Copy link
Member Author

@lhotari Great, you get what I'm trying to say! I just finished the work at hand and would like to reply to @Technoboy about keepAliveIntervalSeconds=100, the reproduction steps to change keepAliveIntervalSeconds is just to reproduce the problem easily, so that everyone can understand the problem I want to describe, and in the production environment, we never to change the keepAliveIntervalSeconds configuration, it has always been the default 30s, but even so, we often see "Producer with name xxx is already connect to topic".

@wenbingshen
Copy link
Member Author

wenbingshen commented Jun 2, 2022

There were some bugs that caused the connection to stay open for longs durations. #15382 Closes the connection if a ping or pong message cannot be sent . #15366 Fixes a proxy connection leak when inbound connection closes while connecting is in progress Both of the above fixes address issues where the symptom is "Producer with name xxx is already connect to topic".

These PR looks great, but I'm not sure if this has been resolved #13061, may need to be validated in production environment.

think that this is not a bug that you are describing. It might be a way to reproduce "Producer with name xxx is already connect to topic", but if the steps are followed, I think that it's completely expected behavior.

If the steps to reproduce end up being an expected behavior, it may not be easy to reproduce the issue.
Perhaps it should be considered whether this expected behavior is correct, is there a way to replace and optimize.

I see this problem quite often is some environments.

@eolivelli often encountered this problem in some environments too.

@michaeljmarshall
Copy link
Member

Just want to post the explicit requirements that must be met for a producer to reconnect. Here is the producer logic:

public boolean isSuccessorTo(Producer other) {
return Objects.equals(producerName, other.producerName)
&& Objects.equals(topic, other.topic)
&& producerId == other.producerId
&& Objects.equals(cnx, other.cnx)
&& other.getEpoch() < epoch;
}

and here is where it's called:

private void tryOverwriteOldProducer(Producer oldProducer, Producer newProducer)
throws BrokerServiceException {
if (newProducer.isSuccessorTo(oldProducer) && !isUserProvidedProducerName(oldProducer)
&& !isUserProvidedProducerName(newProducer)) {
oldProducer.close(false);
if (!producers.replace(newProducer.getProducerName(), oldProducer, newProducer)) {
// Met concurrent update, throw exception here so that client can try reconnect later.
throw new BrokerServiceException.NamingException("Producer with name '" + newProducer.getProducerName()
+ "' replace concurrency error");
} else {
handleProducerRemoved(oldProducer);
}
} else {
throw new BrokerServiceException.NamingException(
"Producer with name '" + newProducer.getProducerName() + "' is already connected to topic");
}
}

Therefore, I feel that the current protocol cannot fully prove whether the producer client can overwrite itself. It may be necessary to add some fields to prove: I am me

This is an interesting question. The current design assumes that a producer is "a successor" to a former instance of a producer when it has the same connection. Given the way that producers and the ServerCnx are integrated, I think we'd need to refactor the code to remove the connection from the isSuccessorTo logic. Perhaps we could achieve producer identity with some kind of UUID passed to the client during producer creation.

The broker will always report an error: Producer with name is already connect to topic.

This raises another question for me. Should the producer use another name when it attempts to reconnect after certain failures, like a failed keep alive timeout? If the producer tried to connect to a broker and get a new producer name from the broker, it'd circumvent the issue here. It wouldn't work for overridden producer names on the client side. It would also lead to certain edge cases around potential duplicates in the produced messages (this would likely already happen when a keep alive fails).

@michaeljmarshall
Copy link
Member

michaeljmarshall commented Jun 2, 2022

My fix for #12846 was in 2.8.2, and this issue is opened against 2.8.1. Also note that the go client started using epoch in the producer connection logic in apache/pulsar-client-go#582. That went in 0.7.0, which is the version being used here apache/pulsar-client-go#676.

I think we need to reproduce this issue for the go client against a newer broker version before we dig into this further.

@michaeljmarshall
Copy link
Member

Closing this issue and marking as fixed for now since there is a likely fix and the issue hasn't been reproduced against more fixed versions of Pulsar.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
lifecycle/stale type/bug The PR fixed a bug or issue reported a bug
Projects
None yet
Development

No branches or pull requests

7 participants