Skip to content
This repository has been archived by the owner on Apr 1, 2024. It is now read-only.

ISSUE-12716: Broker OOM and memory leak, possibly due to ledger loss #3247

Closed
sijie opened this issue Nov 10, 2021 · 0 comments
Closed

ISSUE-12716: Broker OOM and memory leak, possibly due to ledger loss #3247

sijie opened this issue Nov 10, 2021 · 0 comments

Comments

@sijie
Copy link
Member

sijie commented Nov 10, 2021

Original Issue: apache#12716


Hi,

Describe the bug

Recently encountered a memory leak in Pulsar Broker. I've tried to identify the issue with logs like below.

I've enabled io.netty.leakDetectionLevel up to paranoid and io.netty.leakDetection.targetRecords=40, didn't find any leak messages appear.

We have 4 pulsar clusters:

  • version 2.7.3
  • all running in Kubernetes 1.20. Only one cluster have encountered such issue.

The only difference of the cluster in question is that I've enabled Geo-Replication on all namespaces. So I disable it, but the memory leak remains.

14:44:01.497 [pulsar-io-23-2] ERROR org.apache.pulsar.PulsarBrokerStarter - -- Shutting down - Received OOM exception: failed to allocate 16777216 byte(s) of direct memory (used: 1073741824, max: 1073741824)
io.netty.util.internal.OutOfDirectMemoryError: failed to allocate 16777216 byte(s) of direct memory (used: 1073741824, max: 1073741824)
	at io.netty.util.internal.PlatformDependent.incrementMemoryCounter(PlatformDependent.java:776) ~[io.netty-netty-common-4.1.60.Final.jar:4.1.60.Final]
	at io.netty.util.internal.PlatformDependent.allocateDirectNoCleaner(PlatformDependent.java:731) ~[io.netty-netty-common-4.1.60.Final.jar:4.1.60.Final]
	at io.netty.buffer.PoolArena$DirectArena.allocateDirect(PoolArena.java:645) ~[io.netty-netty-buffer-4.1.60.Final.jar:4.1.60.Final]
	at io.netty.buffer.PoolArena$DirectArena.newChunk(PoolArena.java:621) ~[io.netty-netty-buffer-4.1.60.Final.jar:4.1.60.Final]
	at io.netty.buffer.PoolArena.allocateNormal(PoolArena.java:204) ~[io.netty-netty-buffer-4.1.60.Final.jar:4.1.60.Final]
	at io.netty.buffer.PoolArena.tcacheAllocateNormal(PoolArena.java:188) ~[io.netty-netty-buffer-4.1.60.Final.jar:4.1.60.Final]
	at io.netty.buffer.PoolArena.allocate(PoolArena.java:138) ~[io.netty-netty-buffer-4.1.60.Final.jar:4.1.60.Final]
	at io.netty.buffer.PoolArena.allocate(PoolArena.java:128) ~[io.netty-netty-buffer-4.1.60.Final.jar:4.1.60.Final]
	at io.netty.buffer.PooledByteBufAllocator.newDirectBuffer(PooledByteBufAllocator.java:378) ~[io.netty-netty-buffer-4.1.60.Final.jar:4.1.60.Final]
	at io.netty.buffer.AbstractByteBufAllocator.directBuffer(AbstractByteBufAllocator.java:187) ~[io.netty-netty-buffer-4.1.60.Final.jar:4.1.60.Final]
	at org.apache.bookkeeper.common.allocator.impl.ByteBufAllocatorImpl.newDirectBuffer(ByteBufAllocatorImpl.java:164) [org.apache.bookkeeper-bookkeeper-common-allocator-4.12.0.jar:4.12.0]
	at org.apache.bookkeeper.common.allocator.impl.ByteBufAllocatorImpl.newDirectBuffer(ByteBufAllocatorImpl.java:158) [org.apache.bookkeeper-bookkeeper-common-allocator-4.12.0.jar:4.12.0]
	at io.netty.buffer.AbstractByteBufAllocator.directBuffer(AbstractByteBufAllocator.java:187) [io.netty-netty-buffer-4.1.60.Final.jar:4.1.60.Final]
	at io.netty.buffer.AbstractByteBufAllocator.directBuffer(AbstractByteBufAllocator.java:178) [io.netty-netty-buffer-4.1.60.Final.jar:4.1.60.Final]
	at io.netty.channel.unix.PreferredDirectByteBufAllocator.ioBuffer(PreferredDirectByteBufAllocator.java:53) [io.netty-netty-transport-native-unix-common-4.1.60.Final-linux-x86_64.jar:4.1.60.Final]
	at io.netty.channel.DefaultMaxMessagesRecvByteBufAllocator$MaxMessageHandle.allocate(DefaultMaxMessagesRecvByteBufAllocator.java:114) [io.netty-netty-transport-4.1.60.Final.jar:4.1.60.Final]
	at io.netty.channel.epoll.EpollRecvByteAllocatorHandle.allocate(EpollRecvByteAllocatorHandle.java:75) [io.netty-netty-transport-native-epoll-4.1.60.Final-linux-x86_64.jar:4.1.60.Final]
	at io.netty.channel.epoll.AbstractEpollStreamChannel$EpollStreamUnsafe.epollInReady(AbstractEpollStreamChannel.java:780) [io.netty-netty-transport-native-epoll-4.1.60.Final-linux-x86_64.jar:4.1.60.Final]
	at io.netty.channel.epoll.AbstractEpollChannel$AbstractEpollUnsafe$1.run(AbstractEpollChannel.java:425) [io.netty-netty-transport-native-epoll-4.1.60.Final-linux-x86_64.jar:4.1.60.Final]
	at io.netty.util.concurrent.AbstractEventExecutor.safeExecute(AbstractEventExecutor.java:164) [io.netty-netty-common-4.1.60.Final.jar:4.1.60.Final]
	at io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:472) [io.netty-netty-common-4.1.60.Final.jar:4.1.60.Final]
	at io.netty.channel.epoll.EpollEventLoop.run(EpollEventLoop.java:384) [io.netty-netty-transport-native-epoll-4.1.60.Final-linux-x86_64.jar:4.1.60.Final]
	at io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:989) [io.netty-netty-common-4.1.60.Final.jar:4.1.60.Final]
	at io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74) [io.netty-netty-common-4.1.60.Final.jar:4.1.60.Final]
	at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30) [io.netty-netty-common-4.1.60.Final.jar:4.1.60.Final]
	at java.lang.Thread.run(Thread.java:748) [?:1.8.0_302]

Screenshots

It is only a testing cluster with not much memory configured. But I believe the screenshots can show the leak.

Screen Shot 2021-11-10 at 9 35 54 AM

Screen Shot 2021-11-10 at 9 36 14 AM

Screen Shot 2021-11-10 at 9 35 41 AM

Screen Shot 2021-11-10 at 9 37 34 AM

Additional context

I might take a wild guess that the memory leak may come from Bookie side data loss.
I've found some error logs in it:

11:51:51.568 [BookieReadThreadPool-OrderedExecutor-3-0] INFO  org.apache.bookkeeper.proto.ReadEntryProcessorV3 - No ledger found while reading entry: 0 from ledger: 242603
11:51:52.828 [bookie-io-1-1] INFO  org.apache.bookkeeper.proto.BookieRequestHandler - Channels disconnected: [id: 0x5d119911, L:/10.119.139.211:3181 ! R:/10.119.131.152:34376]
11:52:10.560 [bookie-io-1-2] INFO  org.apache.bookkeeper.proto.AuthHandler - Authentication success on server side
11:52:10.560 [bookie-io-1-2] INFO  org.apache.bookkeeper.proto.BookieRequestHandler - Channel connected  [id: 0xb66b4346, L:/10.119.139.211:3181 - R:/10.119.164.211:54744]
11:52:11.149 [BookieReadThreadPool-OrderedExecutor-0-0] ERROR org.apache.bookkeeper.proto.ReadLacProcessorV3 - No ledger found while performing readLac from ledger: 242603
org.apache.bookkeeper.bookie.Bookie$NoLedgerException: Ledger 242603 not found
	at org.apache.bookkeeper.bookie.LedgerDescriptor.createReadOnly(LedgerDescriptor.java:52) ~[org.apache.bookkeeper-bookkeeper-server-4.12.0.jar:4.12.0]
	at org.apache.bookkeeper.bookie.HandleFactoryImpl.getReadOnlyHandle(HandleFactoryImpl.java:61) ~[org.apache.bookkeeper-bookkeeper-server-4.12.0.jar:4.12.0]
	at org.apache.bookkeeper.bookie.Bookie.getExplicitLac(Bookie.java:1364) ~[org.apache.bookkeeper-bookkeeper-server-4.12.0.jar:4.12.0]
	at org.apache.bookkeeper.proto.ReadLacProcessorV3.getReadLacResponse(ReadLacProcessorV3.java:71) [org.apache.bookkeeper-bookkeeper-server-4.12.0.jar:4.12.0]
	at org.apache.bookkeeper.proto.ReadLacProcessorV3.safeRun(ReadLacProcessorV3.java:118) [org.apache.bookkeeper-bookkeeper-server-4.12.0.jar:4.12.0]
	at org.apache.bookkeeper.common.util.SafeRunnable.run(SafeRunnable.java:36) [org.apache.bookkeeper-bookkeeper-common-4.12.0.jar:4.12.0]
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) [?:1.8.0_302]
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [?:1.8.0_302]
	at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30) [io.netty-netty-common-4.1.60.Final.jar:4.1.60.Final]
	at java.lang.Thread.run(Thread.java:748) [?:1.8.0_302]
11:52:11.150 [BookieReadThreadPool-OrderedExecutor-0-0] ERROR org.apache.bookkeeper.proto.ReadLacProcessorV3 - No ledger found while trying to read last entry: 242603
org.apache.bookkeeper.bookie.Bookie$NoLedgerException: Ledger 242603 not found
	at org.apache.bookkeeper.bookie.LedgerDescriptor.createReadOnly(LedgerDescriptor.java:52) ~[org.apache.bookkeeper-bookkeeper-server-4.12.0.jar:4.12.0]
	at org.apache.bookkeeper.bookie.HandleFactoryImpl.getReadOnlyHandle(HandleFactoryImpl.java:61) ~[org.apache.bookkeeper-bookkeeper-server-4.12.0.jar:4.12.0]
	at org.apache.bookkeeper.bookie.Bookie.readEntry(Bookie.java:1441) ~[org.apache.bookkeeper-bookkeeper-server-4.12.0.jar:4.12.0]
	at org.apache.bookkeeper.proto.ReadLacProcessorV3.getReadLacResponse(ReadLacProcessorV3.java:86) [org.apache.bookkeeper-bookkeeper-server-4.12.0.jar:4.12.0]
	at org.apache.bookkeeper.proto.ReadLacProcessorV3.safeRun(ReadLacProcessorV3.java:118) [org.apache.bookkeeper-bookkeeper-server-4.12.0.jar:4.12.0]
	at org.apache.bookkeeper.common.util.SafeRunnable.run(SafeRunnable.java:36) [org.apache.bookkeeper-bookkeeper-common-4.12.0.jar:4.12.0]
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) [?:1.8.0_302]
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [?:1.8.0_302]
	at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30) [io.netty-netty-common-4.1.60.Final.jar:4.1.60.Final]
	at java.lang.Thread.run(Thread.java:748) [?:1.8.0_302]
11:52:11.163 [BookieReadThreadPool-OrderedExecutor-3-0] INFO  org.apache.bookkeeper.proto.ReadEntryProcessorV3 - No ledger found while reading entry: 0 from ledger: 242603

This is because we've encountered some Kubernetes Node issue long times ago and some of the ledgers are unintentionally deleted.

And I can confirm that all the bookies are working just fine without any memory leak issue (except the above exceptions)

Tell me if there's any clue or anything I can do to diagnostic the memory leak.

Thanks.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

1 participant