Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[2.8.1] FGC and throw NPE #14268

Closed
lordcheng10 opened this issue Feb 14, 2022 · 12 comments
Closed

[2.8.1] FGC and throw NPE #14268

lordcheng10 opened this issue Feb 14, 2022 · 12 comments
Labels
type/bug The PR fixed a bug or issue reported a bug

Comments

@lordcheng10
Copy link
Contributor

lordcheng10 commented Feb 14, 2022

Describe the bug
The phenomenon of our online problem is as follows:
1. NPE exception:
03:22:50.442 [BookKeeperClientWorker-OrderedExecutor-4-0] ERROR org.apache.bookkeeper.common.util.SafeRunnable - Unexpected throwable caught
java.lang.NullPointerException: null
at org.apache.bookkeeper.mledger.impl.OpAddEntry.addComplete(OpAddEntry.java:153) ~[org.apache.pulsar-managed-ledger-2.8.1.2.jar:2.8.1.2]
at
2. Frequent FGC, I dumped the memory and found that there are a lot of pendingAddOp objects (more than seven million), but the write traffic has dropped to the bottom:
image

The one that occupies the most heap memory is pendingAcks:
image

3.Looking at the log, some thread pools take too long to execute tasks, even reaching the hour level:
03:23:47.838 [bookkeeper-ml-scheduler-OrderedScheduler-54-0] WARN org.apache.bookkeeper.common.util.OrderedExecutor - Runnable org.apache.bookkeeper.mledger.impl.ManagedCursorImpl$$Lambda$1000/1597732433@2fc9a98e:class org.apache.bookkeeper.mledger.impl.ManagedCursorImpl$$Lambda$1000/1597732433 took too long 12721684 micros to execute.
03:24:53.508 [bookkeeper-ml-scheduler-OrderedScheduler-32-0] WARN org.apache.bookkeeper.common.util.OrderedExecutor - Runnable org.apache.bookkeeper.common.util.OrderedExecutor$TimedRunnable@3f387130:class org.apache.bookkeeper.common.util.OrderedExecutor$TimedRunnable took too long 16594876 micros to execute.
03:26:02.079 [bookkeeper-ml-scheduler-OrderedScheduler-35-0] WARN org.apache.bookkeeper.common.util.OrderedExecutor - Runnable org.apache.bookkeeper.mledger.impl.ManagedLedgerFactoryImpl$$Lambda$92/1773008684@4ddf5185:class org.apache.bookkeeper.mledger.impl.ManagedLedgerFactoryImpl$$Lambda$92/1773008684 took too long 17111879 micros to execute.

4. zookkepeer session timeout:
image

I noticed there is this PR that may be related to this issue: #12993

@lordcheng10 lordcheng10 added the type/bug The PR fixed a bug or issue reported a bug label Feb 14, 2022
@lordcheng10
Copy link
Contributor Author

@eolivelli @codelipenghui @michaeljmarshall PTAL,thanks!

@lordcheng10
Copy link
Contributor Author

lordcheng10 commented Feb 14, 2022

Similar to this issue: #9234
03:23:47.838 [bookkeeper-ml-scheduler-OrderedScheduler-54-0] WARN org.apache.bookkeeper.common.util.OrderedExecutor - Runnable org.apache.bookkeeper.mledger.impl.ManagedCursorImpl$$Lambda$1000/1597732433@2fc9a98e:class org.apache.bookkeeper.mledger.impl.ManagedCursorImpl$$Lambda$1000/1597732433 took too long 12721684 micros to execute.

@lhotari
Copy link
Member

lhotari commented Feb 15, 2022

Unexpected throwable caught
java.lang.NullPointerException: null
at org.apache.bookkeeper.mledger.impl.OpAddEntry.addComplete(OpAddEntry.java:153) ~[org.apache.pulsar-managed-ledger-2.8.1.2.jar:2.8.1.2]

This looks like a thread safety issue. There's #12606 which improves the situation slightly. The solution in #11387 would help address such issues in better way although that still has some gaps and possible issues.

@lordcheng10
Copy link
Contributor Author

lordcheng10 commented Feb 15, 2022

Unexpected throwable caught
java.lang.NullPointerException: null
at org.apache.bookkeeper.mledger.impl.OpAddEntry.addComplete(OpAddEntry.java:153) ~[org.apache.pulsar-managed-ledger-2.8.1.2.jar:2.8.1.2]

This looks like a thread safety issue. There's #12606 which improves the situation slightly. The solution in #11387 would help address such issues in better way although that still has some gaps and possible issues.

@lhotari
The most confusing thing is why the FGC is triggered, and the thread pool execution time reaches the hour level:
04:35:46.571 [bookkeeper-ml-scheduler-OrderedScheduler-18-0] WARN org.apache.bookkeeper.common.util.OrderedExecutor - Runnable org.apache.bookkeeper.mledger.util.SafeRun$1@89aa735:class org.apache.bookkeeper.mledger.util.SafeRun$1 took too long 13450016 micros to execute.

I even took an execution task to see and found that it is only reading and writing zk, and the timeout time of zk is only 30 seconds, so I can't understand why it reaches the hour level:
image

@lhotari
Copy link
Member

lhotari commented Feb 15, 2022

I even took an execution task to see and found that it is only reading and writing zk, and the timeout time of zk is only 30 seconds, so I can't understand why it reaches the hour level:

What do you mean with "reaches the hour level"? 13450016 microseconds is 13.45 seconds .

@lhotari
Copy link
Member

lhotari commented Feb 15, 2022

The most confusing thing is why the FGC is triggered

With FGC, do you mean "Full GC"? The GC log tells the reason why Full GCs get triggered. GC logging was added by #7498 . Have you checked the GC log entries?

@lhotari
Copy link
Member

lhotari commented Feb 15, 2022

I noticed there is this PR that may be related to this issue: #12993

Yes, #12993 could help addressing the NPE. Have you tested the behavior with Pulsar 2.8.2 which includes #12993?

@lordcheng10
Copy link
Contributor Author

What do you mean with "reaches the hour level"? 13450016 microseconds is 13.45 seconds .

Aha,I was wrong, it was 13.45 seconds .

@lordcheng10
Copy link
Contributor Author

I noticed there is this PR that may be related to this issue: #12993

Yes, #12993 could help addressing the NPE. Have you tested the behavior with Pulsar 2.8.2 which includes #12993?

I didn't test 2.8.2, only found NPE in 2.8.1

@lordcheng10
Copy link
Contributor Author

lordcheng10 commented Feb 15, 2022

The most confusing thing is why the FGC is triggered

With FGC, do you mean "Full GC"? The GC log tells the reason why Full GCs get triggered. GC logging was added by #7498 . Have you checked the GC log entries?

The gc log of the node in question is not saved. I found some FULL GC logs from other FULL GC logs on the line, but these nodes are not continuously FULL GC:
2022-02-15T12:45:47.604+0800: 423715.809: [Full GC (Allocation Failure) 23G->10G(24G), 11.4667426 secs]
[Eden: 0.0B(12.0G)->0.0B(12.0G) Survivors: 0.0B->0.0B Heap: 23.9G(24.0G)->10.6G(24.0G)], [Metaspace: 65542K->65542K(1114112K)]
[Times: user=20.56 sys=0.00, real=11.47 secs]

The JVM monitoring of the problem broker is as follows:
image

It seems that there is a memory leak. The memory size of the old age continues to increase, but the read and write traffic of the node has not increased. @lhotari

@lordcheng10
Copy link
Contributor Author

lordcheng10 commented Feb 15, 2022

I uploaded the dump file to Baidu Cloud Disk, which can be downloaded here:
https://pan.baidu.com/s/1qNMQFS5QtpkwcKpOlK5HiA
Extraction code: cvgr @lhotari

@codelipenghui
Copy link
Contributor

close by #14515

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
type/bug The PR fixed a bug or issue reported a bug
Projects
None yet
Development

No branches or pull requests

3 participants