New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Bug] parseMessageMetadata error when broker entry metadata enable with high loading #22601
Comments
@semistone Just wondering if this could be related to apache/bookkeeper#4196? |
we still try to compare what's the different between our producer and perf tool |
@lhotari
and producer setting (small batch message and big max bytes)
payload then it will show that error and publish throughput isn't good. but if we change to
and filter all data bigger than 15K bytes so we decide to create we still don't known why I don't how which batch producer configuration could fix this errors. and we also publish in multi thread programs, we also tried to reproduce by perf tool but it didn't always happen. thanks |
I tried to upgrade to bookkeeper 4.17.0
|
@semistone Please share a way how to reproduce it. It's not a problem if it's not always consistent. Fixing this issue will be a lot easier if there's at least some way to reproduce. |
@semistone Thanks for testing this. |
I will try to reproduce in perf tool. |
@semistone since you have some way to reproduce this in your own tests, would you be able to test if this can be reproduced with Lines 435 to 436 in 80d4675
It impacts this code: Lines 659 to 681 in 188355b
|
I almost could reproduce by perf tool I guess in batch mode, payload size have some restriction. let me confirm again tomorrow to make sure I didn't make any stupid mistake during my test. |
Search before asking
Read release policy
Version
Minimal reproduce step
publish event in about 6k QPS and 100Mbits/sec
with metaData
BatcherBuilder.KEY_BASED mode
and producer and send message by high concurrent/parallel producer process.
it happens only in almost real time consumer (almost zero backlog)
What did you expect to see?
no lost event
What did you see instead?
could see error log in broker and show
Failed to peek sticky key from the message metadata
it look like thread safe issue, because it happen randomly.
in 1M events, it only happen few times but the consumer will lose few events
Anything else?
the error similar to
#10967 but I think it's different issue.
the data in bookkeeper is correct.
I can download the event from bookkeeper and parse it successfully.
or consume the same event few minutes later and it could consume successfully.
but all subscriptions will get the same error in the same event in real time consumer(zero backlog).
I have traced source code.
it happens in
PersistentDispatcherMultipleConsumers.readEntriesComplete -> AbstractBaseDispatcher.filterEntriesForConsumer
-> Commands.peekAndCopyMessageMetadata
and I also print the ByteBuf contents,
it's I could clearly see the data isn't the same in bookkeeper
in normal event , the hex code usually start by 010e (magicCrc32c)
in one of our error event, the bytebuf have about 48 bytes strange data, then continue with normal data
this is just an example, but sometimes the first few bytes are correct and something wrong after few bytes later.
I am still trying to debug when and how the ByteBuf returns incorrect data, and why it only happens during stress testing. It is still not easy to reproduce using the perf tool, but we can 100% reproduce it in our producer code.
Does anyone have any idea what could be causing this issue, and any suggestions on which library or class may have potential issues? Additionally, any suggestions on how to debug this issue or if I need to print any specific information to help identify the root cause would be appreciated. Thank you.
Are you willing to submit a PR?
The text was updated successfully, but these errors were encountered: