[Bug] parseMessageMetadata error when broker entry metadata enable with high loading #22601

semistone · 2024-04-26T09:46:01Z

Search before asking

I searched in the issues and found nothing similar.

Read release policy

I understand that unsupported versions don't get bug fixes. I will attempt to reproduce the issue on a supported version of Pulsar client and Pulsar broker.

Version

3.2.2
3.1.2

Minimal reproduce step

publish event in about 6k QPS and 100Mbits/sec
with metaData
BatcherBuilder.KEY_BASED mode
and producer and send message by high concurrent/parallel producer process.
it happens only in almost real time consumer (almost zero backlog)

What did you expect to see?

no lost event

What did you see instead?

could see error log in broker and show
Failed to peek sticky key from the message metadata

it look like thread safe issue, because it happen randomly.
in 1M events, it only happen few times but the consumer will lose few events

Anything else?

the error similar to
#10967 but I think it's different issue.

the data in bookkeeper is correct.
I can download the event from bookkeeper and parse it successfully.
or consume the same event few minutes later and it could consume successfully.
but all subscriptions will get the same error in the same event in real time consumer(zero backlog).

I have traced source code.
it happens in
PersistentDispatcherMultipleConsumers.readEntriesComplete -> AbstractBaseDispatcher.filterEntriesForConsumer
-> Commands.peekAndCopyMessageMetadata

and I also print the ByteBuf contents,
it's I could clearly see the data isn't the same in bookkeeper

in normal event , the hex code usually start by 010e (magicCrc32c)

0000000      010e    9529    5fbc    0000    0a03    3a0a    6e69    7267

in one of our error event, the bytebuf have about 48 bytes strange data, then continue with normal data

0000000      0000    a610    0000    0000    0200    7239    0000    0000 <== from here
     
0000020      0200    1339    0000    0000    ea17    a8b0    8b8e    fa5e
        
0000040      2af0    2675    f645    1623    d17e    dc34    526d    ef44 <=== until here is garbage
          
0000060      010e    9529    5fbc    0000    0a03    3a0a    6e69    7267 <== from here is normal data

this is just an example, but sometimes the first few bytes are correct and something wrong after few bytes later.

I am still trying to debug when and how the ByteBuf returns incorrect data, and why it only happens during stress testing. It is still not easy to reproduce using the perf tool, but we can 100% reproduce it in our producer code.

Does anyone have any idea what could be causing this issue, and any suggestions on which library or class may have potential issues? Additionally, any suggestions on how to debug this issue or if I need to print any specific information to help identify the root cause would be appreciated. Thank you.

Are you willing to submit a PR?

I'm willing to submit a PR!

The text was updated successfully, but these errors were encountered:

lhotari · 2024-04-26T11:33:49Z

@semistone Just wondering if this could be related to apache/bookkeeper#4196?
There might are also other recent ByteBuf retain/release fixes such as #22393 .
In Bookkeeper, there's apache/bookkeeper#4289 pending release and apache/bookkeeper#4293 is pending review.

semistone · 2024-05-01T05:42:35Z

we still try to compare what's the different between our producer and perf tool
will feedback later once we have any conclusion.

semistone · 2024-05-02T07:40:13Z

@lhotari
we do many tests
current broker setting is

maxMessageSize=5242880

and producer setting (small batch message and big max bytes)

batchingMaxMessages: 500
batchingMaxBytes: 3145728
batchingMaxPublishDelayMicros: 500

payload
98% < 3K bytes
2% between 10-20K bytes

then it will show that error and publish throughput isn't good.

but if we change to

batchingMaxMessages: 1000
batchingMaxBytes: 3145728
batchingMaxPublishDelayMicros: 1000

and filter all data bigger than 15K bytes
then that error disappear

so we decide to create
one batch publisher to publish data < 15000 bytes
and one chunk publisher to publish data >= 15000 bytes
then it worked and performance is also better than previous test

we still don't known why
but at least we have workaround solution now.

I don't how which batch producer configuration could fix this errors.
if you have any suggestions, we will still try it .

and we also publish in multi thread programs,
seems like it's not directly related to loading but related to payload size
but maybe if the publish rate is low, it's more difficult to reproduce .

we also tried to reproduce by perf tool but it didn't always happen.

thanks

semistone · 2024-05-08T06:15:58Z

I tried to upgrade to bookkeeper 4.17.0
but still have the same issue :(

[pulsar@cockroach308 lib]$ ls |grep bookkeeper
org.apache.bookkeeper-bookkeeper-benchmark-4.17.0.jar
org.apache.bookkeeper-bookkeeper-common-4.17.0.jar
org.apache.bookkeeper-bookkeeper-common-allocator-4.17.0.jar
org.apache.bookkeeper-bookkeeper-perf-4.17.0.jar
org.apache.bookkeeper-bookkeeper-proto-4.17.0.jar
org.apache.bookkeeper-bookkeeper-server-4.17.0.jar
org.apache.bookkeeper-bookkeeper-slogger-api-4.17.0.jar
org.apache.bookkeeper-bookkeeper-slogger-slf4j-4.17.0.jar
org.apache.bookkeeper-bookkeeper-tools-4.17.0.jar
org.apache.bookkeeper-bookkeeper-tools-framework-4.17.0.jar
org.apache.bookkeeper-bookkeeper-tools-ledger-4.17.0.jar
org.apache.bookkeeper-circe-checksum-4.17.0.jar
org.apache.bookkeeper-cpu-affinity-4.17.0.jar
org.apache.bookkeeper.http-http-server-4.17.0.jar
org.apache.bookkeeper.http-vertx-http-server-4.17.0.jar
org.apache.bookkeeper-native-io-4.17.0.jar
org.apache.bookkeeper-statelib-4.17.0.jar
org.apache.bookkeeper.stats-bookkeeper-stats-api-4.17.0.jar
org.apache.bookkeeper.stats-codahale-metrics-provider-4.17.0.jar
org.apache.bookkeeper.stats-otel-metrics-provider-4.17.0.jar
org.apache.bookkeeper.stats-prometheus-metrics-provider-4.17.0.jar
org.apache.bookkeeper-stream-storage-cli-4.17.0.jar
org.apache.bookkeeper-stream-storage-java-client-4.17.0.jar
org.apache.bookkeeper-stream-storage-server-4.17.0.jar
org.apache.bookkeeper-stream-storage-service-api-4.17.0.jar
org.apache.bookkeeper-stream-storage-service-impl-4.17.0.jar
org.apache.bookkeeper.tests-stream-storage-tests-common-4.17.0.jar
org.apache.pulsar-pulsar-package-bookkeeper-storage-3.2.2.jar

lhotari · 2024-05-08T06:23:43Z

we also tried to reproduce by perf tool but it didn't always happen.

@semistone Please share a way how to reproduce it. It's not a problem if it's not always consistent. Fixing this issue will be a lot easier if there's at least some way to reproduce.

lhotari · 2024-05-08T06:24:12Z

I tried to upgrade to bookkeeper 4.17.0
but still have the same issue :(

@semistone Thanks for testing this.

semistone · 2024-05-08T07:13:22Z

we also tried to reproduce by perf tool but it didn't always happen.

@semistone Please share a way how to reproduce it. It's not a problem if it's not always consistent. Fixing this issue will be a lot easier if there's at least some way to reproduce.

I will try to reproduce in perf tool.

lhotari · 2024-05-08T08:34:09Z

@semistone since you have some way to reproduce this in your own tests, would you be able to test if this can be reproduced with dispatcherDispatchMessagesInSubscriptionThread=false?

pulsar/conf/broker.conf

Lines 435 to 436 in 80d4675

    
           # Dispatch messages and execute broker side filters in a per-subscription thread 
        
           dispatcherDispatchMessagesInSubscriptionThread=true

It impacts this code:

pulsar/pulsar-broker/src/main/java/org/apache/pulsar/broker/service/persistent/PersistentDispatcherMultipleConsumers.java

Lines 659 to 681 in 188355b

    
           // dispatch messages to a separate thread, but still in order for this subscription 
        
           // sendMessagesToConsumers is responsible for running broker-side filters 
        
           // that may be quite expensive 
        
           if (serviceConfig.isDispatcherDispatchMessagesInSubscriptionThread()) { 
        
               // setting sendInProgress here, because sendMessagesToConsumers will be executed 
        
               // in a separate thread, and we want to prevent more reads 
        
               acquireSendInProgress(); 
        
               dispatchMessagesThread.execute(() -> { 
        
                   if (sendMessagesToConsumers(readType, entries, false)) { 
        
                       updatePendingBytesToDispatch(-size); 
        
                       readMoreEntries(); 
        
                   } else { 
        
                       updatePendingBytesToDispatch(-size); 
        
                   } 
        
               }); 
        
           } else { 
        
               if (sendMessagesToConsumers(readType, entries, true)) { 
        
                   updatePendingBytesToDispatch(-size); 
        
                   readMoreEntriesAsync(); 
        
               } else { 
        
                   updatePendingBytesToDispatch(-size); 
        
               } 
        
           }

semistone · 2024-05-08T13:52:35Z

I almost could reproduce by perf tool
when very few payload > 30K bytes. others are 3K bytes
then
error happen when messageKeyGenerationMode=random
if without messageKeyGenerationMode, then error disappear

I guess in batch mode, payload size have some restriction.

let me confirm again tomorrow to make sure I didn't make any stupid mistake during my test.

semistone added the type/bug The PR fixed a bug or issue reported a bug label Apr 26, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug] parseMessageMetadata error when broker entry metadata enable with high loading #22601

[Bug] parseMessageMetadata error when broker entry metadata enable with high loading #22601

semistone commented Apr 26, 2024 •

edited

lhotari commented Apr 26, 2024

semistone commented May 1, 2024

semistone commented May 2, 2024 •

edited

semistone commented May 8, 2024 •

edited

lhotari commented May 8, 2024

lhotari commented May 8, 2024

semistone commented May 8, 2024

lhotari commented May 8, 2024

semistone commented May 8, 2024

[Bug] parseMessageMetadata error when broker entry metadata enable with high loading #22601

[Bug] parseMessageMetadata error when broker entry metadata enable with high loading #22601

Comments

semistone commented Apr 26, 2024 • edited

Search before asking

Read release policy

Version

Minimal reproduce step

What did you expect to see?

What did you see instead?

Anything else?

Are you willing to submit a PR?

lhotari commented Apr 26, 2024

semistone commented May 1, 2024

semistone commented May 2, 2024 • edited

semistone commented May 8, 2024 • edited

lhotari commented May 8, 2024

lhotari commented May 8, 2024

semistone commented May 8, 2024

lhotari commented May 8, 2024

semistone commented May 8, 2024

semistone commented Apr 26, 2024 •

edited

semistone commented May 2, 2024 •

edited

semistone commented May 8, 2024 •

edited