Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix the topic in fenced state and can not recover. #11737

Merged
merged 3 commits into from Aug 25, 2021

Commits on Aug 21, 2021

  1. Fix the topic in fenced state and can not recover.

    Here is the log when the issue happens. The producer continues to reconnect to the broker, but the fenced state of the topic is always true.
    ```
    19:01:42.351 [pulsar-io-4-1] INFO  org.apache.pulsar.broker.service.ServerCnx - [/10.24.34.151:48052][persistent://public/default/test-8] Creating producer. producerId=8
    19:01:42.352 [Thread-174681] INFO  org.apache.pulsar.broker.service.ServerCnx - [/10.24.34.151:48052] persistent://public/default/test-8 configured with schema false
    19:01:42.352 [Thread-174681] WARN  org.apache.pulsar.broker.service.AbstractTopic - [persistent://public/default/test-8] Attempting to add producer to a fenced topic
    19:01:42.352 [Thread-174681] ERROR org.apache.pulsar.broker.service.ServerCnx - [/10.24.34.151:48052] Failed to add producer to topic persistent://public/default/test-8: producerId=8, org.apache.pulsar.broker.service.BrokerServiceException$TopicFencedException: Topic is temporarily unavailable
    ```
    
    After check the heap dump of the broker, the `pendingWriteOps` is 5, this is the reason why the topic can not recover from the fenced state.
    
    The topic will change to unfenced only the `pendingWriteOps` is 0, details can find at [PersistentTopic.decrementPendingWriteOpsAndCheck()](https://github.com/apache/pulsar/blob/794aa20d9f2a4e668cc36465362d22e042e6e536/pulsar-broker/src/main/java/org/apache/pulsar/broker/service/persistent/PersistentTopic.java#L463)
    
    But after checking the ML state of the topic, it shows the `pendingAddEntries` is 0 which not equals to `pendingWriteOps` of the topic.
    The root cause is we are polling add entry op from the `pendingAddEntries` in multiple threads, one is the the ZK callback thread when complete the ledger creating (https://github.com/apache/pulsar/blob/794aa20d9f2a4e668cc36465362d22e042e6e536/managed-ledger/src/main/java/org/apache/bookkeeper/mledger/impl/ManagedLedgerImpl.java#L1406, https://github.com/apache/pulsar/blob/794aa20d9f2a4e668cc36465362d22e042e6e536/managed-ledger/src/main/java/org/apache/bookkeeper/mledger/impl/ManagedLedgerImpl.java#L1669),
    another one is the ML worker thread when complete the add entry op (https://github.com/apache/pulsar/blob/794aa20d9f2a4e668cc36465362d22e042e6e536/managed-ledger/src/main/java/org/apache/bookkeeper/mledger/impl/OpAddEntry.java#L181)
    
    After the ongoing add entry op complete, but the corresponding op might been polled by the `clearPendingAddEntries` method. So it will poll another one, but due to
    not equals to the current op, the polled op will not get a chance to be failed, so that the `pendingWriteOps` will not change to 0.
    
    I have attached the complete logs for the topic:
    
    The fix is to complete the add entry op with ManagedLedgerException if the polled op is not equals to the current op.
    codelipenghui committed Aug 21, 2021
    Copy the full SHA
    a4fad97 View commit details
    Browse the repository at this point in the history

Commits on Aug 23, 2021

  1. Release buffer.

    codelipenghui committed Aug 23, 2021
    Copy the full SHA
    b979df4 View commit details
    Browse the repository at this point in the history

Commits on Aug 24, 2021

  1. Revert

    codelipenghui committed Aug 24, 2021
    Copy the full SHA
    18b3fa4 View commit details
    Browse the repository at this point in the history