Multiple OOM encountered on benchmark cluster #8509

korthout · 2022-01-03T15:56:35Z

Describe the bug

The benchmark cluster for branch release-1.3.0 experienced multiple Out Of Memory (OOM) errors.

This is a potential regression, although it is likely this issue exists already longer. Note that the resources for the benchmark project were reduced recently. See #8268

Occurrences

zeebe-2 @ 2021-12-27 ~11:21:45

Only a small dip in processing throughput

GC shortly spiked and then dropped

Simultaneously JVM memory usage increased from max ~200MB to spikes above 500MB and direct buffer pool memory usage doubled in this short window from ~400MB to ~860MB.

During this time, RocksDB memory usage was similar to before ~500MB per partition.

Install requests were frequently sent 🤔

It had just transitioned to INACTIVE and had closed the database, when it started to transition to FOLLOWER.
Once it opened the database it soon after stopped.

2021-12-27 11:21:22.684 CET "Transition to INACTIVE on term 12 completed" 
2021-12-27 11:21:22.734 CET "Closed database from '/usr/local/zeebe/data/raft-partition/partitions/2/runtime'." 
2021-12-27 11:21:22.784 CET "Committed new snapshot 289647968-12-833412599-833412170" 
2021-12-27 11:21:22.785 CET "Deleting previous snapshot 289590585-12-833268476-833246498" 
2021-12-27 11:21:22.787 CET "Scheduling log compaction up to index 289647968" 
2021-12-27 11:21:22.787 CET "RaftServer{raft-partition-partition-2}{role=FOLLOWER} - Committed snapshot FileBasedSnapshot{directory=/usr/local/zeebe/data/raft-partition/partitions/2/snapshots/289647968-12-833412599-833412170, checksumFile=/usr/local/zeebe/data/raft-partition/partitions/2/snapshots/289647968-12-833412599-833412170.checksum, checksum=2283870558, metadata=FileBasedSnapshotMetadata{index=289647968, term=12, processedPosition=833412599, exporterPosition=833412170}}" 
2021-12-27 11:21:22.787 CET "RaftServer{raft-partition-partition-2}{role=FOLLOWER} - Delete existing log (lastIndex '289645353') and replace with received snapshot (index '289647968')" 
2021-12-27 11:21:22.816 CET "Transition to FOLLOWER on term 12 requested." 
2021-12-27 11:21:22.817 CET "Prepare transition from INACTIVE on term 12 to FOLLOWER" 
2021-12-27 11:21:22.817 CET "Prepare transition from INACTIVE on term 12 to FOLLOWER - preparing ExporterDirector" 
2021-12-27 11:21:22.817 CET "Prepare transition from INACTIVE on term 12 to FOLLOWER - preparing SnapshotDirector" 
2021-12-27 11:21:22.817 CET "Prepare transition from INACTIVE on term 12 to FOLLOWER - preparing StreamProcessor" 
2021-12-27 11:21:22.817 CET "Prepare transition from INACTIVE on term 12 to FOLLOWER - preparing QueryService" 
2021-12-27 11:21:22.817 CET "Prepare transition from INACTIVE on term 12 to FOLLOWER - preparing ZeebeDb" 
2021-12-27 11:21:22.817 CET "Prepare transition from INACTIVE on term 12 to FOLLOWER - preparing LogStream" 
2021-12-27 11:21:22.817 CET "Prepare transition from INACTIVE on term 12 to FOLLOWER - preparing LogStorage" 
2021-12-27 11:21:22.817 CET "Preparing transition from INACTIVE on term 12 completed" 
2021-12-27 11:21:22.817 CET "Transition to FOLLOWER on term 12 starting" 
2021-12-27 11:21:22.817 CET "Transition to FOLLOWER on term 12 - transitioning LogStorage" 
2021-12-27 11:21:22.818 CET "Transition to FOLLOWER on term 12 - transitioning LogStream" 
2021-12-27 11:21:22.818 CET "Detected 'HEALTHY' components. The current health status of components: [ZeebePartition-2{status=HEALTHY}, raft-partition-partition-2{status=HEALTHY}, Broker-2-LogStream-2{status=HEALTHY}]" 
2021-12-27 11:21:22.818 CET "Transition to FOLLOWER on term 12 - transitioning ZeebeDb" 
2021-12-27 11:21:22.818 CET "Partition-2 recovered, marking it as healthy" 
2021-12-27 11:21:22.818 CET "Detected 'HEALTHY' components. The current health status of components: [Broker-2-ZeebePartition-2{status=HEALTHY}, Partition-1{status=HEALTHY}, Partition-3{status=HEALTHY}]" 
2021-12-27 11:21:22.818 CET "Recovering state from available snapshot: FileBasedSnapshot{directory=/usr/local/zeebe/data/raft-partition/partitions/2/snapshots/289647968-12-833412599-833412170, checksumFile=/usr/local/zeebe/data/raft-partition/partitions/2/snapshots/289647968-12-833412599-833412170.checksum, checksum=2283870558, metadata=FileBasedSnapshotMetadata{index=289647968, term=12, processedPosition=833412599, exporterPosition=833412170}}" 
2021-12-27 11:21:22.915 CET "Opened database from '/usr/local/zeebe/data/raft-partition/partitions/2/runtime'." 
2021-12-27 11:21:22.915 CET "Transition to FOLLOWER on term 12 - transitioning QueryService" 
2021-12-27 11:21:22.916 CET "Engine created. [value-mapper: CompositeValueMapper(List(io.camunda.zeebe.el.impl.feel.MessagePackValueMapper@649caa67)), function-provider: io.camunda.zeebe.el.impl.feel.FeelFunctionProvider@2afb9843, clock: io.camunda.zeebe.el.impl.ZeebeFeelEngineClock@6c21e905, configuration: Configuration(false)]" 
2021-12-27 11:21:22.916 CET "Transition to FOLLOWER on term 12 - transitioning StreamProcessor"  
2021-12-27 11:21:22.965 CET "request [POST http://elasticsearch-master:9200/_bulk] returned 1 warnings: [299 Elasticsearch-7.16.2-2b937c44140b6559905130a8650c64dbd0879cfb "Elasticsearch built-in security features are not enabled. Without authentication, your cluster could be accessible to anyone. See https://www.elastic.co/guide/en/elasticsearch/reference/7.16/security-minimal-setup.html to enable security."]"  
2021-12-27 11:21:25.069 CET  ++ hostname -f

zeebe-2 @ 2021-12-28 ~09:19:45, followed by zeebe-1 @ 2021-12-28 ~09:25:15

Just before the OOM, the starter and worker restart, which might explain the loss of processing throughput.

Zeebe-2 has restarted at ~09:19:45 so the OOM should've happened just before that.

Zeebe 2
If we filter on that pod alone, we see that it was actually shortly processing as leader just before the OOM.

GC is much more quiet here before the OOM. JVM memory usage is about 600MB and direct buffer pool memory has just increased to this ~860MB again (just like before). RocksDB is still stable at 500MB per partition, no screenshot added.

Zeebe 2 did not produce any interesting logs, as far as I could tell.

Zeebe 1
Zeebe-1 also does some processing as leader shortly before its OOM, ~5 min after zeebe-2 crashed.

Zeebe-1 looks a lot like zeebe-2 when we look at the memory decomposition. Note the increase in direct pool buffer memory just before the OOM like the other cases.

Partitions fully recovered, but about 1m30s after a snapshot was committed, an actor appears blocked. This means that the health tick is no longer updated. Directly after this, the pod dies.

2021-12-28 09:22:16.229 CET "Partition-2 recovered, marking it as healthy"
2021-12-28 09:22:16.229 CET "Detected 'HEALTHY' components. The current health status of components: [Broker-1-ZeebePartition-2{status=HEALTHY}, Partition-1{status=HEALTHY}, Partition-3{status=HEALTHY}]"
2021-12-28 09:22:16.667 CET "Detected 'HEALTHY' components. The current health status of components: [ZeebePartition-1{status=HEALTHY}, Broker-1-Exporter-1{status=HEALTHY}, raft-partition-partition-1{status=HEALTHY}, Broker-1-LogStream-1{status=HEALTHY}, Broker-1-StreamProcessor-1{status=HEALTHY}, Broker-1-SnapshotDirector-1{status=HEALTHY}]"
2021-12-28 09:22:16.668 CET "Partition-1 recovered, marking it as healthy"
2021-12-28 09:22:16.668 CET "Detected 'HEALTHY' components. The current health status of components: [Broker-1-ZeebePartition-2{status=HEALTHY}, Broker-1-ZeebePartition-1{status=HEALTHY}, Partition-3{status=HEALTHY}]"
2021-12-28 09:22:21.703 CET "Taking temporary snapshot into /usr/local/zeebe/data/raft-partition/partitions/3/pending/359974668-18-1034639769-1034638654."
2021-12-28 09:22:21.907 CET "Finished taking temporary snapshot, need to wait until last written event position 1034640293 is committed, current commit position is 1034640235. After that snapshot will be committed."
2021-12-28 09:22:21.933 CET "Current commit position 1034640293 >= 1034640293, committing snapshot FileBasedTransientSnapshot{directory=/usr/local/zeebe/data/raft-partition/partitions/3/pending/359974668-18-1034639769-1034638654, checksum=890364779, metadata=FileBasedSnapshotMetadata{index=359974668, term=18, processedPosition=1034639769, exporterPosition=1034638654}}."
2021-12-28 09:22:21.941 CET "Committed new snapshot 359974668-18-1034639769-1034638654"
2021-12-28 09:22:21.942 CET "Deleting previous snapshot 359646996-17-1033697878-1033694056"
2021-12-28 09:22:21.947 CET "Scheduling log compaction up to index 359974668"
2021-12-28 09:22:21.951 CET "raft-partition-partition-3 - Deleting log up from 359633252 up to 359947572 (removing 21 segments)"
2021-12-28 09:22:32.628 CET "Detected 'HEALTHY' components. The current health status of components: [Partition-2{status=HEALTHY}, Partition-1{status=HEALTHY}, Partition-3{status=HEALTHY}]"
2021-12-28 09:23:41.408 CET "Detected 'HEALTHY' components. The current health status of components: [Partition-2{status=HEALTHY}, Partition-1{status=HEALTHY}, Partition-3{status=HEALTHY}]"
2021-12-28 09:23:41.848 CET "Detected 'UNHEALTHY' components. The current health status of components: [ZeebePartition-1{status=HEALTHY}, Broker-1-Exporter-1{status=HEALTHY}, raft-partition-partition-1{status=HEALTHY}, Broker-1-LogStream-1{status=HEALTHY}, Broker-1-StreamProcessor-1{status=UNHEALTHY, issue='actor appears blocked'}, Broker-1-SnapshotDirector-1{status=HEALTHY}]"
2021-12-28 09:23:41.852 CET "Partition-1 failed, marking it as unhealthy: Broker-1{status=HEALTHY}"
2021-12-28 09:23:41.852 CET "Detected 'UNHEALTHY' components. The current health status of components: [Partition-2{status=HEALTHY}, Partition-1{status=UNHEALTHY, issue=Broker-1-StreamProcessor-1{status=UNHEALTHY, issue='actor appears blocked'}}, Partition-3{status=HEALTHY}]"
2021-12-28 09:23:41.861 CET "Detected 'HEALTHY' components. The current health status of components: [ZeebePartition-2{status=HEALTHY}, Broker-1-Exporter-2{status=HEALTHY}, raft-partition-partition-2{status=HEALTHY}, Broker-1-LogStream-2{status=HEALTHY}, Broker-1-SnapshotDirector-2{status=HEALTHY}, Broker-1-StreamProcessor-2{status=HEALTHY}]"
2021-12-28 09:23:41.861 CET "Partition-2 recovered, marking it as healthy"
2021-12-28 09:23:41.861 CET "Detected 'UNHEALTHY' components. The current health status of components: [Broker-1-ZeebePartition-2{status=HEALTHY}, Partition-1{status=UNHEALTHY, issue=Broker-1-StreamProcessor-1{status=UNHEALTHY, issue='actor appears blocked'}}, Partition-3{status=HEALTHY}]"
2021-12-28 09:24:11.884 CET "Detected 'UNHEALTHY' components. The current health status of components: [Broker-1-StreamProcessor-3{status=UNHEALTHY, issue='actor appears blocked'}, ZeebePartition-3{status=HEALTHY}, Broker-1-Exporter-3{status=HEALTHY}, raft-partition-partition-3{status=HEALTHY}, Broker-1-LogStream-3{status=HEALTHY}, Broker-1-SnapshotDirector-3{status=HEALTHY}]"
2021-12-28 09:24:39.044 CET ++ hostname -f

zeebe-2 @ 2021-12-28 ~22:50:00

Again only a small dip in processing throughput (nice and quick failover 🚀 )

Zeebe-2 was leader and processing before OOM

Interestingly, the logs just before the restart of zeebe-2 at this time, are practically identical to the logs of zeebe-2 on the first OOM (the day before on the 27th).

Zeebe-2 had just transitioned to INACTIVE and closed the database. It was transitioning to FOLLOWER again and just after it opened the database it is transitioning the StreamProcessor. Which is the same transition it OOM-ed at the day before.

2021-12-28 22:49:37.461 CET "Transition to INACTIVE on term 16 completed"
2021-12-28 22:49:37.537 CET "Closed database from '/usr/local/zeebe/data/raft-partition/partitions/1/runtime'."
2021-12-28 22:49:37.624 CET "Committed new snapshot 397383357-16-1142860979-1142860461"
2021-12-28 22:49:37.625 CET "Deleting previous snapshot 397252741-16-1142453760-1142669843"
2021-12-28 22:49:37.631 CET "RaftServer{raft-partition-partition-1}{role=FOLLOWER} - Committed snapshot FileBasedSnapshot{directory=/usr/local/zeebe/data/raft-partition/partitions/1/snapshots/397383357-16-1142860979-1142860461, checksumFile=/usr/local/zeebe/data/raft-partition/partitions/1/snapshots/397383357-16-1142860979-1142860461.checksum, checksum=3576496807, metadata=FileBasedSnapshotMetadata{index=397383357, term=16, processedPosition=1142860979, exporterPosition=1142860461}}"
2021-12-28 22:49:37.631 CET "Scheduling log compaction up to index 397383357"
2021-12-28 22:49:37.631 CET "RaftServer{raft-partition-partition-1}{role=FOLLOWER} - Delete existing log (lastIndex '397286333') and replace with received snapshot (index '397383357')"
2021-12-28 22:49:37.670 CET "Transition to FOLLOWER on term 16 requested."
2021-12-28 22:49:37.670 CET "Prepare transition from INACTIVE on term 16 to FOLLOWER"
2021-12-28 22:49:37.671 CET "Prepare transition from INACTIVE on term 16 to FOLLOWER - preparing ExporterDirector"
2021-12-28 22:49:37.671 CET "Prepare transition from INACTIVE on term 16 to FOLLOWER - preparing SnapshotDirector"
2021-12-28 22:49:37.671 CET "Prepare transition from INACTIVE on term 16 to FOLLOWER - preparing StreamProcessor"
2021-12-28 22:49:37.671 CET "Prepare transition from INACTIVE on term 16 to FOLLOWER - preparing QueryService"
2021-12-28 22:49:37.671 CET "Prepare transition from INACTIVE on term 16 to FOLLOWER - preparing ZeebeDb"
2021-12-28 22:49:37.671 CET "Prepare transition from INACTIVE on term 16 to FOLLOWER - preparing LogStream"
2021-12-28 22:49:37.671 CET "Prepare transition from INACTIVE on term 16 to FOLLOWER - preparing LogStorage"
2021-12-28 22:49:37.671 CET "Preparing transition from INACTIVE on term 16 completed"
2021-12-28 22:49:37.671 CET "Transition to FOLLOWER on term 16 starting"
2021-12-28 22:49:37.671 CET "Transition to FOLLOWER on term 16 - transitioning LogStorage"
2021-12-28 22:49:37.672 CET "Transition to FOLLOWER on term 16 - transitioning LogStream"
2021-12-28 22:49:37.672 CET "Detected 'HEALTHY' components. The current health status of components: [ZeebePartition-1{status=HEALTHY}, raft-partition-partition-1{status=HEALTHY}, Broker-2-LogStream-1{status=HEALTHY}]"
2021-12-28 22:49:37.672 CET "Transition to FOLLOWER on term 16 - transitioning ZeebeDb"
2021-12-28 22:49:37.672 CET "Partition-1 recovered, marking it as healthy"
2021-12-28 22:49:37.673 CET "Detected 'HEALTHY' components. The current health status of components: [Partition-2{status=HEALTHY}, Broker-2-ZeebePartition-1{status=HEALTHY}, Partition-3{status=HEALTHY}]"
2021-12-28 22:49:37.673 CET "Recovering state from available snapshot: FileBasedSnapshot{directory=/usr/local/zeebe/data/raft-partition/partitions/1/snapshots/397383357-16-1142860979-1142860461, checksumFile=/usr/local/zeebe/data/raft-partition/partitions/1/snapshots/397383357-16-1142860979-1142860461.checksum, checksum=3576496807, metadata=FileBasedSnapshotMetadata{index=397383357, term=16, processedPosition=1142860979, exporterPosition=1142860461}}"
2021-12-28 22:49:37.837 CET "Opened database from '/usr/local/zeebe/data/raft-partition/partitions/1/runtime'."
2021-12-28 22:49:37.838 CET "Transition to FOLLOWER on term 16 - transitioning QueryService"
2021-12-28 22:49:37.840 CET "Engine created. [value-mapper: CompositeValueMapper(List(io.camunda.zeebe.el.impl.feel.MessagePackValueMapper@2465e772)), function-provider: io.camunda.zeebe.el.impl.feel.FeelFunctionProvider@2495802c, clock: io.camunda.zeebe.el.impl.ZeebeFeelEngineClock@263ffbf0, configuration: Configuration(false)]"
2021-12-28 22:49:37.841 CET "Transition to FOLLOWER on term 16 - transitioning StreamProcessor"
2021-12-28 22:49:39.701 CET ++ hostname -f

If you look at the logs from before that time, for a long period (at least multiple hours) it keeps transitioning between follower and inactive and the opposite direction. It's in a loop:

2021-12-28 16:36:14.030 CET partition-3 "Transition to LEADER on term 21 requested."
2021-12-28 16:36:14.127 CET partition-3 "Transition to LEADER on term 21 completed"
2021-12-28 16:36:19.476 CET partition-2 "Transition to LEADER on term 19 requested."
2021-12-28 16:36:19.590 CET partition-2 "Transition to LEADER on term 19 completed"
2021-12-28 16:44:13.078 CET partition-1 "Transition to INACTIVE on term 16 requested."
2021-12-28 16:44:13.084 CET partition-1 "Transition to INACTIVE on term 16 completed"
2021-12-28 16:44:13.301 CET partition-1 "Transition to FOLLOWER on term 16 requested."
2021-12-28 16:44:13.514 CET partition-1 "Transition to FOLLOWER on term 16 completed"
2021-12-28 16:54:13.701 CET partition-1 "Transition to INACTIVE on term 16 requested."
2021-12-28 16:54:13.705 CET partition-1 "Transition to INACTIVE on term 16 completed"
2021-12-28 16:54:13.987 CET partition-1 "Transition to FOLLOWER on term 16 requested."
2021-12-28 16:54:14.206 CET partition-1 "Transition to FOLLOWER on term 16 completed"
2021-12-28 16:59:14.028 CET partition-1 "Transition to INACTIVE on term 16 requested."
2021-12-28 16:59:14.032 CET partition-1 "Transition to INACTIVE on term 16 completed"
2021-12-28 16:59:14.294 CET partition-1 "Transition to FOLLOWER on term 16 requested."
2021-12-28 16:59:14.541 CET partition-1 "Transition to FOLLOWER on term 16 completed"
2021-12-28 17:04:14.683 CET partition-1 "Transition to INACTIVE on term 16 requested."
2021-12-28 17:04:14.687 CET partition-1 "Transition to INACTIVE on term 16 completed"
2021-12-28 17:04:15.346 CET partition-1 "Transition to FOLLOWER on term 16 requested."
2021-12-28 17:04:15.545 CET partition-1 "Transition to FOLLOWER on term 16 completed"
2021-12-28 17:09:15.002 CET partition-1 "Transition to INACTIVE on term 16 requested."
2021-12-28 17:09:15.006 CET partition-1 "Transition to INACTIVE on term 16 completed"
2021-12-28 17:09:15.233 CET partition-1 "Transition to FOLLOWER on term 16 requested."
2021-12-28 17:09:15.492 CET partition-1 "Transition to FOLLOWER on term 16 completed"
2021-12-28 17:14:15.248 CET partition-1 "Transition to INACTIVE on term 16 requested."
2021-12-28 17:14:15.253 CET partition-1 "Transition to INACTIVE on term 16 completed"
2021-12-28 17:14:15.631 CET partition-1 "Transition to FOLLOWER on term 16 requested."
2021-12-28 17:14:15.891 CET partition-1 "Transition to FOLLOWER on term 16 completed"
2021-12-28 17:19:15.953 CET partition-1 "Transition to INACTIVE on term 16 requested."
2021-12-28 17:19:15.956 CET partition-1 "Transition to INACTIVE on term 16 completed"
2021-12-28 17:19:16.219 CET partition-1 "Transition to FOLLOWER on term 16 requested."
2021-12-28 17:19:16.428 CET partition-1 "Transition to FOLLOWER on term 16 completed"
2021-12-28 17:24:15.936 CET partition-1 "Transition to INACTIVE on term 16 requested."
2021-12-28 17:24:15.940 CET partition-1 "Transition to INACTIVE on term 16 completed"
2021-12-28 17:24:16.216 CET partition-1 "Transition to FOLLOWER on term 16 requested."
2021-12-28 17:24:16.425 CET partition-1 "Transition to FOLLOWER on term 16 completed"
2021-12-28 17:29:17.013 CET partition-1 "Transition to INACTIVE on term 16 requested."
2021-12-28 17:29:17.016 CET partition-1 "Transition to INACTIVE on term 16 completed"
2021-12-28 17:29:17.265 CET partition-1 "Transition to FOLLOWER on term 16 requested."
2021-12-28 17:29:17.482 CET partition-1 "Transition to FOLLOWER on term 16 completed"
2021-12-28 17:34:17.042 CET partition-1 "Transition to INACTIVE on term 16 requested."
2021-12-28 17:34:17.046 CET partition-1 "Transition to INACTIVE on term 16 completed"
2021-12-28 17:34:17.380 CET partition-1 "Transition to FOLLOWER on term 16 requested."
2021-12-28 17:34:17.585 CET partition-1 "Transition to FOLLOWER on term 16 completed"
2021-12-28 17:39:17.481 CET partition-1 "Transition to INACTIVE on term 16 requested."
2021-12-28 17:39:17.484 CET partition-1 "Transition to INACTIVE on term 16 completed"
.... and so on, until 22:50:00

This also happened the day before: https://cloudlogging.app.goo.gl/7qpb4Rammh11eqYh6

Hypothesis
Looking at the above cases, it seems that a partition gets stuck in a transition loop between FOLLOWER and INACTIVE. Perhaps we have a memory leak in transitions.

The text was updated successfully, but these errors were encountered:

npepinpe · 2022-01-04T08:35:37Z

@romansmirnov @deepthidevaki - could this be related to #7992 ?

korthout · 2022-01-04T09:48:31Z

Happened again 2022-01-04 ~07:58:05 for zeebe-1, see https://cloudlogging.app.goo.gl/uYTT7vjcAUdKyzb9A for the moment of restart.

When you look at the Transition to logs, then it's clear that all three brokers continuously transition between FOLLOWER and INACTIVE. ~~Is that expected behavior?~~

EDIT: Discussed shortly with @Zelldon and he mentioned that this happens for install requests (i.e. receiving a snapshot). So these logs indicate that the followers receive new snapshots every 5 minutes. That's not optimal but also not wrong.

korthout · 2022-01-04T10:49:45Z

Please note that this transitioning happens as well on our long running benchmarks for 1.3.0-alpha1, 1.3.0-alpha2 and another release-1.3.0 build, but not on the long-running-v1-for-minor-updates benchmark (which is currently running 1.2.9).

This indicates that between 1.2.x and 1.3.0-alpha1 something was introduced that makes followers lag behind. Note that the reduction of benchmark resources was done after 1.3.0-alpha2. See related logs. I've also checked that this logged entry can actually be logged in 1.2.x versions and it would be logged if this transition would happen.

@npepinpe ~~That would be a regression in 1.3.0, is that blocking the release in your opinion?~~

EDIT: Deeper investigation shows that older versions are also sending install requests at a similar rate, but just don't use the same transition logic and so don't log this Transition to line. @oleschoenburg told me that there is a configuration setting (default at 100) which determines whether to replicate log entries or the snapshot. Due to the throughput on the benchmarks, followers are generally lagging behind a few thousand records, i.e. a leader produces about 2500 records per second when it's doing 200 simple PI/s. It seems this is simply misconfigured for our benchmarks. Likely this should be set to <number of seconds allowed to lag> * <number of records produced per partition per second>. I won't consider this is a regression and I'll continue with the release. Sorry for the confusion and convoluting this issue with an unrelated problem.

deepthidevaki · 2022-01-04T13:10:44Z

@romansmirnov @deepthidevaki - could this be related to #7992 ?

Not sure. Is a heap dump available?

korthout · 2022-01-04T13:15:20Z

No heap dumps were created

deepthidevaki · 2022-01-04T13:48:03Z

No heap dumps were created

Was it then out of direct memory? Then it might be a different problem.

npepinpe · 2022-01-04T13:48:39Z

IIRC it was a kubernetes OOM, so it's not the JVM that crashed but the scheduler which killed the container.

korthout · 2022-01-07T09:30:03Z

These OOMs also occur on benchmarks medic-cw-51-975b33b9e9-benchmark and medic-cw-01-7841f75abe-benchmark. So, IMO we no longer need the release-1-3-0 benchmark for this investigation, and I'd like to delete it. We would've normally deleted this benchmark as part of the post-release-process, but I skipped it because we still needed to do this investigation.

Note, that the release-1-3-0 benchmark also exists in the long-running-cluster, but that one does not suffer from OOMs.

If anyone is against deleting the release-1-3-0 benchmark from the zeebe-cluster, please respond before or during the medic handover. Otherwise, I'll delete it after the handover.

korthout · 2022-01-07T13:09:27Z

Benchmark release-1-3-0 is deleted.

romansmirnov · 2022-01-21T13:20:10Z

There is a memory leak! When the broker transitions between the roles, the broker stops running services and starts new services (depending on the old and new role). This includes stopping and starting the LogStream:

When the LogStream is started, it will create a new Dispatcher (i.e., writeBuffer). The Dispatcher allocates a direct memory:
https://github.com/camunda-cloud/zeebe/blob/9e189a1d087ba08191cb888bca2ccc32c325cf8e/util/src/main/java/io/camunda/zeebe/util/allocation/DirectBufferAllocator.java#L19-L20
When the LogStream is closed, it will stop the Dispatcher (i.e., writeBuffer). The Dispatcher dereferences ByteBuffer:
https://github.com/camunda-cloud/zeebe/blob/9e189a1d087ba08191cb888bca2ccc32c325cf8e/util/src/main/java/io/camunda/zeebe/util/allocation/AllocatedBuffer.java#L35-L42

However, when dereferencing the ByteBuffer the direct memory is not freed immediately. Meaning, unreachable DirectByteBuffer instances are not collected yet, and direct memory is not released. Also, it is not deterministic when the allocated direct memory is freed.

Basically, the role changes happen whenever there is a role change in the Raft layer, or when a new snapshot must be installed. This results in increased allocated direct memory:

This is also visible in the Process Memory Usage

The growing allocated direct memory correlates with the number of installed snapshots on a broker:

Unfortunately, I was not able to reproduce the OOM, but I am quite confident that releasing allocated direct memory is one part of this issue here. Meaning, if there are a lot of role changes, then the unreachable direct buffer grows but which are not garbage collected and does not free the direct memory. In this example, there are 10 direct buffer but only 3 would be expected:

When releasing the allocated memory, it results in a constant size (~400MB) of allocated direct memory:

Also, the Process Memory Usage is quite constant and does not grow with every installed snapshot:

Also, the number of Direct Buffers in the heap dump is constant

romansmirnov · 2022-01-21T13:27:29Z

Just for reference, in the heap dump it is possible to execute an OQL (Open Object Query Language). The following query helps to trouble shoot memory leaks in the direct memory:

SELECT x, x.capacity, x.position, x.limit FROM java.nio.DirectByteBuffer x WHERE ((x.capacity > (1024 * 1024)) and (x.cleaner != null))

It shows all DirectByteBuffer instances, with their capacity and its current position.

npepinpe · 2022-01-21T14:06:32Z

Related, interesting bit from Netty's documentation about their own ByteBuf allocator:

When a new java.nio.ByteBuffer is allocated, its content is filled with zeroes. This "zeroing" consumes CPU cycles and memory bandwidth. Normally, the buffer is then immediately filled from some data source, so the zeroing did no good.

To be reclaimed, java.nio.ByteBuffer relies on the JVM garbage collector. It works OK for heap buffers, but not direct buffers. By design, direct buffers are expected to live a long time. Thus, allocating many short-lived direct NIO buffers often causes an OutOfMemoryError. Also, deallocating a direct buffer explicitly using the (hidden, proprietary) API isn't very fast.

ChrisKujawa · 2022-01-21T14:09:55Z

Do you think it might make more sense to reuse the dispatcher and just reset/clear it on transitions?

npepinpe · 2022-01-21T14:14:13Z

Possibly - I don't know how resilient our dispatcher currently is, but if we assume we just reset in memory properties and zero the buffer, then that's probably still faster and less intensive on the memory than freeing the buffer, allocating a new one, and zero-ing (which the JVM does).

At the same time, we know the pitfalls that come with reusing resources, so we'd have to make sure the reset/clear works correctly 😄

romansmirnov · 2022-01-24T14:34:33Z

@npepinpe, thanks for your input.

I also read Netty's comment about zero-ing and dug a bit into it. But I don't see this as an issue in our case, because when doing the transitions between the roles, Zeebe is not on the critical path (or datapath). Of course, the transitions should happen quickly and not take ages, especially, when Zeebe transitions to the LEADER role so that it can start with processing quickly in a failover scenario. But as long as the cluster runs stable, direct memory allocation shouldn't be a problem, because Zeebe does not allocate direct memory when doing the processing, when being a LEADER for a partition. The creation/closing of the dispatcher happens only when doing the transitions.

That's why I would like to keep the scope on solving the "memory leak" in the direct memory by releasing the direct memory when closing the dispatcher (and keeping the performance topic out of the scope for now). That way, Zeebe broker node can "survive" many role transitions in a short timeframe caused by

Leadership changes caused by disruptive brokers in the cluster
Receiving many snapshots as a FOLLOWER

When releasing the direct memory, there is one issue that arises: Other components (like the Stream Processor or Comand Request Handler) should not try to write to (or read from) the direct memory, otherwise, they will try to access an illegal address and the JVM crashes. My current approach would be to ensure that all relevant components are closed/notified about the event of closing the dispatcher before the actual close happens. Alternatively, the dispatcher is only opened when transitioning to the LEADER state.

@npepinpe, please let me know if you want to discuss this issue.

npepinpe · 2022-01-24T14:44:08Z

That's fine, in my opinion we're still at a stage where correctness trumps performance most of the time. And as you mentioned, I doubt the performance gain/loss is noticeable anyway.

Regarding the second point, since there's no way to recover or handle a SIGSEV or SIGBUS, I'd like to have the strongest possible guarantees that we don't try to read from/write to freed memory. Are we confident that we can guarantee all components are closed before freeing the buffer? Can we offer stronger guarantees than that? Possibly not, but it doesn't hurt to spend a bit of time on that to explore our options. Because ensuring all relevant components are closed/notified is hard to do, and especially hard to future proof, in general (although maybe I misunderstood your proposal). OTOH, could we potentially delegate to the dispatcher the task of writing to the buffer? e.g. claim a segment of memory, then pass a BufferWriter to the dispatcher, such that writing also happens on the dispatcher's actor? This is probably less performant, but it would allow a single point of control to the buffer, which may result in less errors?

💭 At the same time, a SIGSEV or SIGBUS on the dispatcher will not cause any permanent issues (e.g. data corruption/loss), so I suppose it's not the worse thing that can happen, as compared to reusing the same dispatcher and potentially writing the wrong things and causing corruption.

If you want, we can discuss this or brainstorm a solution - I'm free tomorrow afternoon. If that's blocking you, you can always grab someone else from the team, like Ole or Chris.

romansmirnov · 2022-01-25T16:48:20Z

Just a quick summary: there are two different types of "components" that writes to the dispatcher, and there is only "component" that reads from the dispatcher.

Who reads from the dispatcher?

LogStorageAppender: The appender reads fragments from the dispatcher and appends the fragments to the log storage (eventually to Raft). When reading those fragments, it copies the data into a (heap) byte buffer.
https://github.com/camunda-cloud/zeebe/blob/0cea18418e719d8819e52372ba0816465531e5cd/logstreams/src/main/java/io/camunda/zeebe/logstreams/impl/log/LogStorageAppender.java#L120
So, when the log storage reads and persists the fragments, it operates on that copy. Meaning, when the dispatcher closes and the direct memory gets released, this wouldn't cause any problems.

Who writes to the dispatcher?

StreamProcessor: When the stream processor processes events from the logstream, it may write follow-up events, etc. Those events are written to the dispatcher (which are then read from the log storage appender).
SubscriptionApiCommandMessageHandler: Subscribes to messages and writes received messages to the leader's dispatcher (if there is a leader for that partition).
LeaderManagementRequestHandler: To apply deployments across the partitions, the handler writes to the leader's dispatcher to replicate the deployment eventually.
CommandApiServiceImpl: Handles all incoming commands and writes them to the leader's dispatcher (so that these commands are appended to the log storage, replicated, and processed).

When on the broker layer (triggered by a Raft role change or when installing a snapshot, etc.) a transition to another role is initiated, then a PartitionTransition is started that includes different steps to close current services and restart them. It includes the services StreamProcessor and LogStorageAppender (which exists in the context of the LogStream):

The StreamProcessor is closed prior to the log stream (and hence before the dispatcher). Meaning, the stream processor won't write anything to the dispatcher anymore.

...
Prepare transition from INACTIVE on term 16 to FOLLOWER - preparing StreamProcessor
...
Prepare transition from INACTIVE on term 16 to FOLLOWER - preparing LogStream"
...

The LogStorageAppender is closed when the corresponding LogStream gets closed. This happens before the dispatcher gets closed:
https://github.com/camunda-cloud/zeebe/blob/0cea18418e719d8819e52372ba0816465531e5cd/logstreams/src/main/java/io/camunda/zeebe/logstreams/impl/log/LogStreamImpl.java#L249-L286

Only when the transition succeeded all the other "write components" are notified by calling the PartitionListener, for example, by calling #onBecomingFollower(). In this case, if a transition from LEADER to FOLLOWER happens, all three components (SubscriptionApiCommandMessageHandler, LeaderManagementRequestHandler, CommandApiServiceImpl) remove their writer from their internal partition to streaming writer mapping (i.e., leadingStreams).

lenaschoenburg · 2022-03-15T08:16:13Z

@npepinpe We are still getting OOM's in current benchmarks. Is this something we want to work on for the 1.4.0 release?

npepinpe · 2022-03-15T08:17:37Z

Could you elaborate on this? Are we getting K8s OOMs or Java OOMs?

lenaschoenburg · 2022-03-15T08:23:58Z

I'm not quite sure but for example here: http://34.77.165.228/d/I4lo7_EZk/zeebe?viewPanel=33&orgId=1&from=1647282747298&to=1647314979422&var-DS_PROMETHEUS=Prometheus&var-namespace=medic-cw-08-12c4ea63e6-benchmark&var-pod=All&var-partition=All it looks like a k8s OOM, while in this error the JVM is running OOM: https://console.cloud.google.com/errors/detail/CKSrwcbo6qihLg;service=;version=;filter=%5B%22OutOfMemory%22%5D?project=zeebe-io

npepinpe · 2022-03-15T08:42:42Z

The second one could be the previous bug @romansmirnov was working on, i.e. dispatchers aren't freed (or not in a timely fashion) which results in us running out of direct memory.

npepinpe · 2022-04-29T08:11:20Z

I think freeing the dispatcher's buffer eagerly was just part of the issue, i.e. it would help handle multiple consecutive transitions which might cause a burst of allocated dispatchers that isn't freed immediately. I'm not sure it was the main cause. That said, I don't think there's any harm in doing it if we can guarantee it's safe to do. We would have to look into why we didn't merge this PR - #8632.

npepinpe · 2022-04-29T08:12:29Z

Re-reading the issue, we can scope it to just ensuring we're freeing the dispatcher's byte buffer eagerly to avoid bursts of transitions causing too much memory to be allocated.

We will tackle ensuring resources are closed with the upcoming KR separately.

deepthidevaki · 2022-05-12T07:31:30Z

In our benchmarks, the frequent role transitions are in followers. When a follower receives a snapshot, it closes its current follower services and installs new one. So the transitions are follower -> follower. Other transitions are usually triggered by restarts. Leader -> follower rarely happens (usually happens when the leader is restarted.) and follower -> leader happens mostly only once during the lifetime of a pod. Frequent Leader -> follower -> Leader transitions happens when there are network partitions, which is not very common in our benchmarks as well as in a production setup.

One cause for having so many dispatcher buffers is that follower StreamProcessor also creates a logstream writer which opens the dispatcher. In follower, the StreamProcessor never writes to the logstream. So there is no need to create a writer, as a result no need to open a dispatcher.

If we fix the StreamProcessor in follower role to not open the dispatcher, this would prevent the case where we have a lot of dispatcher buffers are open. This is only a partial fix, as we are not fixing the root cause of freeing the buffer. But this would be easy to fix and would prevent the most common case that we observe in our benchmarks as well as in a production set up.

ChrisKujawa · 2022-05-12T07:40:15Z

Did you verified that?

One cause for having so many dispatcher buffers is that follower StreamProcessor also creates a logstream writer which opens the dispatcher. In follower, the StreamProcessor never writes to the logstream. So there is no need to create a writer, as a result no need to open a dispatcher.

Because when we implemented replay on followers we implemented a noop writer, so I would expect that we have no real writter.

deepthidevaki · 2022-05-12T07:43:28Z

@Zelldon We still create the writer in StreamProcessor, even if we are not using it.
https://github.com/camunda/zeebe/blob/e3e7ab281d290f7ec006e28fc4d6fd7eae1067b9/engine/src/main/java/io/camunda/zeebe/engine/processing/streamprocessor/StreamProcessor.java#L114-L117

ChrisKujawa · 2022-05-12T07:44:28Z

Nice catch @deepthidevaki 🕵️‍♀️

deepthidevaki · 2022-05-12T12:49:20Z

I tested a quick fix for not creating writer in the follower StreamProcessor. Here are the observations from the benchmark.

With the fix:

zeebe-1 is follower for all partitions and frequently receiving InstallRequests triggering role transitions.

Base version (main branch)

zeebe-2 is follower for all partitions and frequently receiving InstallRequests triggering role transitions.

Direct memory usage with the fix is much less compared to the main branch.

ChrisKujawa · 2022-05-12T12:53:46Z

👍

But still, there is something else that is increasing the memory 🤔 Did I understand that right?

deepthidevaki · 2022-05-12T13:54:17Z

+1

But still, there is something else that is increasing the memory thinking Did I understand that right?

You mean the increase in Process memory?

I think that it can be attributed to two things - 1. rocks db memory. 2. mapped byte buffers.
For mapped buffers, I would assume it is a temporary thing and the OS would swaps the memory as needed. Looking at the heapdump, there are many directbuffers linked to mapped buffers from the journal.
I'm not sure if rocksdb is leaking memory.

We can check if OOM occurs again after this fix.

9367: Do not open dispatchers in follower role r=deepthidevaki a=deepthidevaki ## Description In follower role, StreamProcessor is running only in replay mode. When the writer is created, dispatcher is also opened which allocates direct buffer. This is unnecessary as the writer is never used. The allocated buffer consumes memory and can create memory pressure on the system. To fix this, we create the writer only after the replay is completed. ## Related issues related #8509 Co-authored-by: Deepthi Devaki Akkoorath <deepthidevaki@gmail.com>

deepthidevaki · 2022-05-13T13:38:41Z

Fixed Do not open dispatchers in follower role #9367
A benchmark with this fix is running in the long-running cluster http://35.189.240.202/d/NzsO1mUnk/zeebe-overview?orgId=1&refresh=10s&var-DS_PROMETHEUS=Prometheus&var-namespace=dd-8509-oom-fix&var-pod=All&var-partition=All I will monitor them for OOMs.

This is not a complete fix. The main cause that Directbuffers are not freed immediately is still there. There is no real memory leak from the Dispatchers, as far as I know. So DirectBuffers will be eventually freed anyway. So the above fix will prevent the need to to aggressively free DirectBuffers. Then the question is should I look into how to free dispatcher buffers?

9376: [Backport stable/8.0] Do not open dispatcher on follower role r=deepthidevaki a=deepthidevaki ## Description Backport of #9367 related #8509 Co-authored-by: Deepthi Devaki Akkoorath <deepthidevaki@gmail.com>

npepinpe · 2022-05-13T15:28:48Z

I think we should close direct memory as soon as possible, considering there's no easy to way to recover when we run out of it other than crashing (correct me if I'm wrong). I think the main worry before was how do we guarantee we won't access a freed buffer? It seems like this doesn't crash as I expected, though I'm not sure if that is always guaranteed or not 🤷

9377: [Backport stable/1.3] Do not open dispatcher on follower role r=deepthidevaki a=deepthidevaki ## Description Backport of #9367 related #8509 Co-authored-by: Deepthi Devaki Akkoorath <deepthidevaki@gmail.com>

deepthidevaki · 2022-05-30T06:42:16Z

OOM still happens

Zeebe-2 became leader before the OOM. It uses only < 300 MB of direct memory, which is required for 3 leader partitions. So direct memory allocation is not the reason for this OOM.

ChrisKujawa · 2022-05-30T06:50:25Z

@deepthidevaki but the rocksdb memory usage is not part of it, right? We had this issue before, right?

deepthidevaki · 2022-05-30T06:55:59Z

@Zelldon I suspect it is either rocksdb and/or mapped buffers for the journal files. We should look into it. Rocksdb metrics doesn't show high memory usage. But from my previous experience, rocksdb uses much more memory than it reports.

npepinpe · 2022-06-01T07:37:48Z

As the OOM occurred on a long running cluster after 2 weeks of constant load, after which the cluster recovered quickly, we decided to postpone working on this for now. I would personally propose to close this, as by the time we look into it again it most likely will have changed quite a bit. Happy to be challenged on this though, let me know 👍

korthout added kind/bug scope/broker blocker/info Impact: Regression labels Jan 3, 2022

romansmirnov self-assigned this Jan 18, 2022

romansmirnov mentioned this issue Jan 21, 2022

fix(allocated/buffer): release allocated direct memory on close #8632

Closed

9 tasks

npepinpe removed the blocker/info label Feb 3, 2022

npepinpe unassigned romansmirnov Feb 3, 2022

npepinpe assigned npepinpe and deepthidevaki and unassigned npepinpe May 9, 2022

deepthidevaki mentioned this issue May 12, 2022

Do not open dispatchers in follower role #9367

Merged

10 tasks

This was referenced May 13, 2022

[Backport stable/8.0] Do not open dispatcher on follower role #9376

Merged

[Backport stable/1.3] Do not open dispatcher on follower role #9377

Merged

deepthidevaki removed their assignment May 31, 2022

npepinpe added version:1.3.9 release/8.0.3 labels Jun 1, 2022

remcowesterhoud added the version:8.1.0-alpha2 label Jun 7, 2022

npepinpe closed this as completed Jun 24, 2022

ChrisKujawa added the version:8.1.0 label Oct 4, 2022

Multiple OOM encountered on benchmark cluster #8509

Multiple OOM encountered on benchmark cluster #8509

Comments

korthout commented Jan 3, 2022 • edited Loading

npepinpe commented Jan 4, 2022

korthout commented Jan 4, 2022 • edited Loading

korthout commented Jan 4, 2022 • edited Loading

deepthidevaki commented Jan 4, 2022

korthout commented Jan 4, 2022

deepthidevaki commented Jan 4, 2022

npepinpe commented Jan 4, 2022 • edited Loading

korthout commented Jan 7, 2022

korthout commented Jan 7, 2022 • edited Loading

romansmirnov commented Jan 21, 2022 • edited Loading

romansmirnov commented Jan 21, 2022

npepinpe commented Jan 21, 2022

ChrisKujawa commented Jan 21, 2022

npepinpe commented Jan 21, 2022 • edited Loading

romansmirnov commented Jan 24, 2022

npepinpe commented Jan 24, 2022 • edited Loading

romansmirnov commented Jan 25, 2022

lenaschoenburg commented Mar 15, 2022

npepinpe commented Mar 15, 2022

lenaschoenburg commented Mar 15, 2022

npepinpe commented Mar 15, 2022

npepinpe commented Apr 29, 2022

npepinpe commented Apr 29, 2022

deepthidevaki commented May 12, 2022

ChrisKujawa commented May 12, 2022

deepthidevaki commented May 12, 2022

ChrisKujawa commented May 12, 2022

deepthidevaki commented May 12, 2022

ChrisKujawa commented May 12, 2022

deepthidevaki commented May 12, 2022

deepthidevaki commented May 13, 2022

npepinpe commented May 13, 2022

deepthidevaki commented May 30, 2022

ChrisKujawa commented May 30, 2022

deepthidevaki commented May 30, 2022

npepinpe commented Jun 1, 2022 • edited Loading

korthout commented Jan 3, 2022 •

edited

Loading

korthout commented Jan 4, 2022 •

edited

Loading

korthout commented Jan 4, 2022 •

edited

Loading

npepinpe commented Jan 4, 2022 •

edited

Loading

korthout commented Jan 7, 2022 •

edited

Loading

romansmirnov commented Jan 21, 2022 •

edited

Loading

npepinpe commented Jan 21, 2022 •

edited

Loading

npepinpe commented Jan 24, 2022 •

edited

Loading

npepinpe commented Jun 1, 2022 •

edited

Loading