Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Multiple OOM encountered on benchmark cluster #8509

Closed
korthout opened this issue Jan 3, 2022 · 39 comments
Closed

Multiple OOM encountered on benchmark cluster #8509

korthout opened this issue Jan 3, 2022 · 39 comments
Labels
area/reliability Marks an issue as related to improving the reliability of our software (i.e. it behaves as expected) kind/bug Categorizes an issue or PR as a bug scope/broker Marks an issue or PR to appear in the broker section of the changelog severity/high Marks a bug as having a noticeable impact on the user with no known workaround version:1.3.9 version:8.1.0-alpha2 version:8.1.0 Marks an issue as being completely or in parts released in 8.1.0

Comments

@korthout
Copy link
Member

korthout commented Jan 3, 2022

Describe the bug

The benchmark cluster for branch release-1.3.0 experienced multiple Out Of Memory (OOM) errors.

This is a potential regression, although it is likely this issue exists already longer. Note that the resources for the benchmark project were reduced recently. See #8268

Occurrences

zeebe-2 @ 2021-12-27 ~11:21:45

Only a small dip in processing throughput
Screen Shot 2022-01-03 at 16 29 54

GC shortly spiked and then dropped
Screen Shot 2022-01-03 at 16 22 41

Simultaneously JVM memory usage increased from max ~200MB to spikes above 500MB and direct buffer pool memory usage doubled in this short window from ~400MB to ~860MB.
Screen Shot 2022-01-03 at 16 23 51

During this time, RocksDB memory usage was similar to before ~500MB per partition.
Screen Shot 2022-01-03 at 16 27 37

Install requests were frequently sent 🤔
Screen Shot 2022-01-03 at 16 33 56

It had just transitioned to INACTIVE and had closed the database, when it started to transition to FOLLOWER.
Once it opened the database it soon after stopped.

2021-12-27 11:21:22.684 CET "Transition to INACTIVE on term 12 completed" 
2021-12-27 11:21:22.734 CET "Closed database from '/usr/local/zeebe/data/raft-partition/partitions/2/runtime'." 
2021-12-27 11:21:22.784 CET "Committed new snapshot 289647968-12-833412599-833412170" 
2021-12-27 11:21:22.785 CET "Deleting previous snapshot 289590585-12-833268476-833246498" 
2021-12-27 11:21:22.787 CET "Scheduling log compaction up to index 289647968" 
2021-12-27 11:21:22.787 CET "RaftServer{raft-partition-partition-2}{role=FOLLOWER} - Committed snapshot FileBasedSnapshot{directory=/usr/local/zeebe/data/raft-partition/partitions/2/snapshots/289647968-12-833412599-833412170, checksumFile=/usr/local/zeebe/data/raft-partition/partitions/2/snapshots/289647968-12-833412599-833412170.checksum, checksum=2283870558, metadata=FileBasedSnapshotMetadata{index=289647968, term=12, processedPosition=833412599, exporterPosition=833412170}}" 
2021-12-27 11:21:22.787 CET "RaftServer{raft-partition-partition-2}{role=FOLLOWER} - Delete existing log (lastIndex '289645353') and replace with received snapshot (index '289647968')" 
2021-12-27 11:21:22.816 CET "Transition to FOLLOWER on term 12 requested." 
2021-12-27 11:21:22.817 CET "Prepare transition from INACTIVE on term 12 to FOLLOWER" 
2021-12-27 11:21:22.817 CET "Prepare transition from INACTIVE on term 12 to FOLLOWER - preparing ExporterDirector" 
2021-12-27 11:21:22.817 CET "Prepare transition from INACTIVE on term 12 to FOLLOWER - preparing SnapshotDirector" 
2021-12-27 11:21:22.817 CET "Prepare transition from INACTIVE on term 12 to FOLLOWER - preparing StreamProcessor" 
2021-12-27 11:21:22.817 CET "Prepare transition from INACTIVE on term 12 to FOLLOWER - preparing QueryService" 
2021-12-27 11:21:22.817 CET "Prepare transition from INACTIVE on term 12 to FOLLOWER - preparing ZeebeDb" 
2021-12-27 11:21:22.817 CET "Prepare transition from INACTIVE on term 12 to FOLLOWER - preparing LogStream" 
2021-12-27 11:21:22.817 CET "Prepare transition from INACTIVE on term 12 to FOLLOWER - preparing LogStorage" 
2021-12-27 11:21:22.817 CET "Preparing transition from INACTIVE on term 12 completed" 
2021-12-27 11:21:22.817 CET "Transition to FOLLOWER on term 12 starting" 
2021-12-27 11:21:22.817 CET "Transition to FOLLOWER on term 12 - transitioning LogStorage" 
2021-12-27 11:21:22.818 CET "Transition to FOLLOWER on term 12 - transitioning LogStream" 
2021-12-27 11:21:22.818 CET "Detected 'HEALTHY' components. The current health status of components: [ZeebePartition-2{status=HEALTHY}, raft-partition-partition-2{status=HEALTHY}, Broker-2-LogStream-2{status=HEALTHY}]" 
2021-12-27 11:21:22.818 CET "Transition to FOLLOWER on term 12 - transitioning ZeebeDb" 
2021-12-27 11:21:22.818 CET "Partition-2 recovered, marking it as healthy" 
2021-12-27 11:21:22.818 CET "Detected 'HEALTHY' components. The current health status of components: [Broker-2-ZeebePartition-2{status=HEALTHY}, Partition-1{status=HEALTHY}, Partition-3{status=HEALTHY}]" 
2021-12-27 11:21:22.818 CET "Recovering state from available snapshot: FileBasedSnapshot{directory=/usr/local/zeebe/data/raft-partition/partitions/2/snapshots/289647968-12-833412599-833412170, checksumFile=/usr/local/zeebe/data/raft-partition/partitions/2/snapshots/289647968-12-833412599-833412170.checksum, checksum=2283870558, metadata=FileBasedSnapshotMetadata{index=289647968, term=12, processedPosition=833412599, exporterPosition=833412170}}" 
2021-12-27 11:21:22.915 CET "Opened database from '/usr/local/zeebe/data/raft-partition/partitions/2/runtime'." 
2021-12-27 11:21:22.915 CET "Transition to FOLLOWER on term 12 - transitioning QueryService" 
2021-12-27 11:21:22.916 CET "Engine created. [value-mapper: CompositeValueMapper(List(io.camunda.zeebe.el.impl.feel.MessagePackValueMapper@649caa67)), function-provider: io.camunda.zeebe.el.impl.feel.FeelFunctionProvider@2afb9843, clock: io.camunda.zeebe.el.impl.ZeebeFeelEngineClock@6c21e905, configuration: Configuration(false)]" 
2021-12-27 11:21:22.916 CET "Transition to FOLLOWER on term 12 - transitioning StreamProcessor"  
2021-12-27 11:21:22.965 CET "request [POST http://elasticsearch-master:9200/_bulk] returned 1 warnings: [299 Elasticsearch-7.16.2-2b937c44140b6559905130a8650c64dbd0879cfb "Elasticsearch built-in security features are not enabled. Without authentication, your cluster could be accessible to anyone. See https://www.elastic.co/guide/en/elasticsearch/reference/7.16/security-minimal-setup.html to enable security."]"  
2021-12-27 11:21:25.069 CET  ++ hostname -f

zeebe-2 @ 2021-12-28 ~09:19:45, followed by zeebe-1 @ 2021-12-28 ~09:25:15

Just before the OOM, the starter and worker restart, which might explain the loss of processing throughput.
Screen Shot 2022-01-03 at 17 00 08

Zeebe-2 has restarted at ~09:19:45 so the OOM should've happened just before that.
Screen Shot 2022-01-03 at 17 01 13

Zeebe 2
If we filter on that pod alone, we see that it was actually shortly processing as leader just before the OOM.
Screen Shot 2022-01-03 at 17 04 40

GC is much more quiet here before the OOM. JVM memory usage is about 600MB and direct buffer pool memory has just increased to this ~860MB again (just like before). RocksDB is still stable at 500MB per partition, no screenshot added.
Screen Shot 2022-01-03 at 17 08 31

Zeebe 2 did not produce any interesting logs, as far as I could tell.

Zeebe 1
Zeebe-1 also does some processing as leader shortly before its OOM, ~5 min after zeebe-2 crashed.
Screen Shot 2022-01-03 at 17 27 56

Zeebe-1 looks a lot like zeebe-2 when we look at the memory decomposition. Note the increase in direct pool buffer memory just before the OOM like the other cases.
Screen Shot 2022-01-03 at 17 30 13

Partitions fully recovered, but about 1m30s after a snapshot was committed, an actor appears blocked. This means that the health tick is no longer updated. Directly after this, the pod dies.

2021-12-28 09:22:16.229 CET "Partition-2 recovered, marking it as healthy"
2021-12-28 09:22:16.229 CET "Detected 'HEALTHY' components. The current health status of components: [Broker-1-ZeebePartition-2{status=HEALTHY}, Partition-1{status=HEALTHY}, Partition-3{status=HEALTHY}]"
2021-12-28 09:22:16.667 CET "Detected 'HEALTHY' components. The current health status of components: [ZeebePartition-1{status=HEALTHY}, Broker-1-Exporter-1{status=HEALTHY}, raft-partition-partition-1{status=HEALTHY}, Broker-1-LogStream-1{status=HEALTHY}, Broker-1-StreamProcessor-1{status=HEALTHY}, Broker-1-SnapshotDirector-1{status=HEALTHY}]"
2021-12-28 09:22:16.668 CET "Partition-1 recovered, marking it as healthy"
2021-12-28 09:22:16.668 CET "Detected 'HEALTHY' components. The current health status of components: [Broker-1-ZeebePartition-2{status=HEALTHY}, Broker-1-ZeebePartition-1{status=HEALTHY}, Partition-3{status=HEALTHY}]"
2021-12-28 09:22:21.703 CET "Taking temporary snapshot into /usr/local/zeebe/data/raft-partition/partitions/3/pending/359974668-18-1034639769-1034638654."
2021-12-28 09:22:21.907 CET "Finished taking temporary snapshot, need to wait until last written event position 1034640293 is committed, current commit position is 1034640235. After that snapshot will be committed."
2021-12-28 09:22:21.933 CET "Current commit position 1034640293 >= 1034640293, committing snapshot FileBasedTransientSnapshot{directory=/usr/local/zeebe/data/raft-partition/partitions/3/pending/359974668-18-1034639769-1034638654, checksum=890364779, metadata=FileBasedSnapshotMetadata{index=359974668, term=18, processedPosition=1034639769, exporterPosition=1034638654}}."
2021-12-28 09:22:21.941 CET "Committed new snapshot 359974668-18-1034639769-1034638654"
2021-12-28 09:22:21.942 CET "Deleting previous snapshot 359646996-17-1033697878-1033694056"
2021-12-28 09:22:21.947 CET "Scheduling log compaction up to index 359974668"
2021-12-28 09:22:21.951 CET "raft-partition-partition-3 - Deleting log up from 359633252 up to 359947572 (removing 21 segments)"
2021-12-28 09:22:32.628 CET "Detected 'HEALTHY' components. The current health status of components: [Partition-2{status=HEALTHY}, Partition-1{status=HEALTHY}, Partition-3{status=HEALTHY}]"
2021-12-28 09:23:41.408 CET "Detected 'HEALTHY' components. The current health status of components: [Partition-2{status=HEALTHY}, Partition-1{status=HEALTHY}, Partition-3{status=HEALTHY}]"
2021-12-28 09:23:41.848 CET "Detected 'UNHEALTHY' components. The current health status of components: [ZeebePartition-1{status=HEALTHY}, Broker-1-Exporter-1{status=HEALTHY}, raft-partition-partition-1{status=HEALTHY}, Broker-1-LogStream-1{status=HEALTHY}, Broker-1-StreamProcessor-1{status=UNHEALTHY, issue='actor appears blocked'}, Broker-1-SnapshotDirector-1{status=HEALTHY}]"
2021-12-28 09:23:41.852 CET "Partition-1 failed, marking it as unhealthy: Broker-1{status=HEALTHY}"
2021-12-28 09:23:41.852 CET "Detected 'UNHEALTHY' components. The current health status of components: [Partition-2{status=HEALTHY}, Partition-1{status=UNHEALTHY, issue=Broker-1-StreamProcessor-1{status=UNHEALTHY, issue='actor appears blocked'}}, Partition-3{status=HEALTHY}]"
2021-12-28 09:23:41.861 CET "Detected 'HEALTHY' components. The current health status of components: [ZeebePartition-2{status=HEALTHY}, Broker-1-Exporter-2{status=HEALTHY}, raft-partition-partition-2{status=HEALTHY}, Broker-1-LogStream-2{status=HEALTHY}, Broker-1-SnapshotDirector-2{status=HEALTHY}, Broker-1-StreamProcessor-2{status=HEALTHY}]"
2021-12-28 09:23:41.861 CET "Partition-2 recovered, marking it as healthy"
2021-12-28 09:23:41.861 CET "Detected 'UNHEALTHY' components. The current health status of components: [Broker-1-ZeebePartition-2{status=HEALTHY}, Partition-1{status=UNHEALTHY, issue=Broker-1-StreamProcessor-1{status=UNHEALTHY, issue='actor appears blocked'}}, Partition-3{status=HEALTHY}]"
2021-12-28 09:24:11.884 CET "Detected 'UNHEALTHY' components. The current health status of components: [Broker-1-StreamProcessor-3{status=UNHEALTHY, issue='actor appears blocked'}, ZeebePartition-3{status=HEALTHY}, Broker-1-Exporter-3{status=HEALTHY}, raft-partition-partition-3{status=HEALTHY}, Broker-1-LogStream-3{status=HEALTHY}, Broker-1-SnapshotDirector-3{status=HEALTHY}]"
2021-12-28 09:24:39.044 CET ++ hostname -f

zeebe-2 @ 2021-12-28 ~22:50:00

Again only a small dip in processing throughput (nice and quick failover 🚀 ) Screen Shot 2022-01-03 at 18 17 51

Zeebe-2 was leader and processing before OOM
Screen Shot 2022-01-03 at 18 16 52

Interestingly, the logs just before the restart of zeebe-2 at this time, are practically identical to the logs of zeebe-2 on the first OOM (the day before on the 27th).

Zeebe-2 had just transitioned to INACTIVE and closed the database. It was transitioning to FOLLOWER again and just after it opened the database it is transitioning the StreamProcessor. Which is the same transition it OOM-ed at the day before.

2021-12-28 22:49:37.461 CET "Transition to INACTIVE on term 16 completed"
2021-12-28 22:49:37.537 CET "Closed database from '/usr/local/zeebe/data/raft-partition/partitions/1/runtime'."
2021-12-28 22:49:37.624 CET "Committed new snapshot 397383357-16-1142860979-1142860461"
2021-12-28 22:49:37.625 CET "Deleting previous snapshot 397252741-16-1142453760-1142669843"
2021-12-28 22:49:37.631 CET "RaftServer{raft-partition-partition-1}{role=FOLLOWER} - Committed snapshot FileBasedSnapshot{directory=/usr/local/zeebe/data/raft-partition/partitions/1/snapshots/397383357-16-1142860979-1142860461, checksumFile=/usr/local/zeebe/data/raft-partition/partitions/1/snapshots/397383357-16-1142860979-1142860461.checksum, checksum=3576496807, metadata=FileBasedSnapshotMetadata{index=397383357, term=16, processedPosition=1142860979, exporterPosition=1142860461}}"
2021-12-28 22:49:37.631 CET "Scheduling log compaction up to index 397383357"
2021-12-28 22:49:37.631 CET "RaftServer{raft-partition-partition-1}{role=FOLLOWER} - Delete existing log (lastIndex '397286333') and replace with received snapshot (index '397383357')"
2021-12-28 22:49:37.670 CET "Transition to FOLLOWER on term 16 requested."
2021-12-28 22:49:37.670 CET "Prepare transition from INACTIVE on term 16 to FOLLOWER"
2021-12-28 22:49:37.671 CET "Prepare transition from INACTIVE on term 16 to FOLLOWER - preparing ExporterDirector"
2021-12-28 22:49:37.671 CET "Prepare transition from INACTIVE on term 16 to FOLLOWER - preparing SnapshotDirector"
2021-12-28 22:49:37.671 CET "Prepare transition from INACTIVE on term 16 to FOLLOWER - preparing StreamProcessor"
2021-12-28 22:49:37.671 CET "Prepare transition from INACTIVE on term 16 to FOLLOWER - preparing QueryService"
2021-12-28 22:49:37.671 CET "Prepare transition from INACTIVE on term 16 to FOLLOWER - preparing ZeebeDb"
2021-12-28 22:49:37.671 CET "Prepare transition from INACTIVE on term 16 to FOLLOWER - preparing LogStream"
2021-12-28 22:49:37.671 CET "Prepare transition from INACTIVE on term 16 to FOLLOWER - preparing LogStorage"
2021-12-28 22:49:37.671 CET "Preparing transition from INACTIVE on term 16 completed"
2021-12-28 22:49:37.671 CET "Transition to FOLLOWER on term 16 starting"
2021-12-28 22:49:37.671 CET "Transition to FOLLOWER on term 16 - transitioning LogStorage"
2021-12-28 22:49:37.672 CET "Transition to FOLLOWER on term 16 - transitioning LogStream"
2021-12-28 22:49:37.672 CET "Detected 'HEALTHY' components. The current health status of components: [ZeebePartition-1{status=HEALTHY}, raft-partition-partition-1{status=HEALTHY}, Broker-2-LogStream-1{status=HEALTHY}]"
2021-12-28 22:49:37.672 CET "Transition to FOLLOWER on term 16 - transitioning ZeebeDb"
2021-12-28 22:49:37.672 CET "Partition-1 recovered, marking it as healthy"
2021-12-28 22:49:37.673 CET "Detected 'HEALTHY' components. The current health status of components: [Partition-2{status=HEALTHY}, Broker-2-ZeebePartition-1{status=HEALTHY}, Partition-3{status=HEALTHY}]"
2021-12-28 22:49:37.673 CET "Recovering state from available snapshot: FileBasedSnapshot{directory=/usr/local/zeebe/data/raft-partition/partitions/1/snapshots/397383357-16-1142860979-1142860461, checksumFile=/usr/local/zeebe/data/raft-partition/partitions/1/snapshots/397383357-16-1142860979-1142860461.checksum, checksum=3576496807, metadata=FileBasedSnapshotMetadata{index=397383357, term=16, processedPosition=1142860979, exporterPosition=1142860461}}"
2021-12-28 22:49:37.837 CET "Opened database from '/usr/local/zeebe/data/raft-partition/partitions/1/runtime'."
2021-12-28 22:49:37.838 CET "Transition to FOLLOWER on term 16 - transitioning QueryService"
2021-12-28 22:49:37.840 CET "Engine created. [value-mapper: CompositeValueMapper(List(io.camunda.zeebe.el.impl.feel.MessagePackValueMapper@2465e772)), function-provider: io.camunda.zeebe.el.impl.feel.FeelFunctionProvider@2495802c, clock: io.camunda.zeebe.el.impl.ZeebeFeelEngineClock@263ffbf0, configuration: Configuration(false)]"
2021-12-28 22:49:37.841 CET "Transition to FOLLOWER on term 16 - transitioning StreamProcessor"
2021-12-28 22:49:39.701 CET ++ hostname -f

If you look at the logs from before that time, for a long period (at least multiple hours) it keeps transitioning between follower and inactive and the opposite direction. It's in a loop:

2021-12-28 16:36:14.030 CET partition-3 "Transition to LEADER on term 21 requested."
2021-12-28 16:36:14.127 CET partition-3 "Transition to LEADER on term 21 completed"
2021-12-28 16:36:19.476 CET partition-2 "Transition to LEADER on term 19 requested."
2021-12-28 16:36:19.590 CET partition-2 "Transition to LEADER on term 19 completed"
2021-12-28 16:44:13.078 CET partition-1 "Transition to INACTIVE on term 16 requested."
2021-12-28 16:44:13.084 CET partition-1 "Transition to INACTIVE on term 16 completed"
2021-12-28 16:44:13.301 CET partition-1 "Transition to FOLLOWER on term 16 requested."
2021-12-28 16:44:13.514 CET partition-1 "Transition to FOLLOWER on term 16 completed"
2021-12-28 16:54:13.701 CET partition-1 "Transition to INACTIVE on term 16 requested."
2021-12-28 16:54:13.705 CET partition-1 "Transition to INACTIVE on term 16 completed"
2021-12-28 16:54:13.987 CET partition-1 "Transition to FOLLOWER on term 16 requested."
2021-12-28 16:54:14.206 CET partition-1 "Transition to FOLLOWER on term 16 completed"
2021-12-28 16:59:14.028 CET partition-1 "Transition to INACTIVE on term 16 requested."
2021-12-28 16:59:14.032 CET partition-1 "Transition to INACTIVE on term 16 completed"
2021-12-28 16:59:14.294 CET partition-1 "Transition to FOLLOWER on term 16 requested."
2021-12-28 16:59:14.541 CET partition-1 "Transition to FOLLOWER on term 16 completed"
2021-12-28 17:04:14.683 CET partition-1 "Transition to INACTIVE on term 16 requested."
2021-12-28 17:04:14.687 CET partition-1 "Transition to INACTIVE on term 16 completed"
2021-12-28 17:04:15.346 CET partition-1 "Transition to FOLLOWER on term 16 requested."
2021-12-28 17:04:15.545 CET partition-1 "Transition to FOLLOWER on term 16 completed"
2021-12-28 17:09:15.002 CET partition-1 "Transition to INACTIVE on term 16 requested."
2021-12-28 17:09:15.006 CET partition-1 "Transition to INACTIVE on term 16 completed"
2021-12-28 17:09:15.233 CET partition-1 "Transition to FOLLOWER on term 16 requested."
2021-12-28 17:09:15.492 CET partition-1 "Transition to FOLLOWER on term 16 completed"
2021-12-28 17:14:15.248 CET partition-1 "Transition to INACTIVE on term 16 requested."
2021-12-28 17:14:15.253 CET partition-1 "Transition to INACTIVE on term 16 completed"
2021-12-28 17:14:15.631 CET partition-1 "Transition to FOLLOWER on term 16 requested."
2021-12-28 17:14:15.891 CET partition-1 "Transition to FOLLOWER on term 16 completed"
2021-12-28 17:19:15.953 CET partition-1 "Transition to INACTIVE on term 16 requested."
2021-12-28 17:19:15.956 CET partition-1 "Transition to INACTIVE on term 16 completed"
2021-12-28 17:19:16.219 CET partition-1 "Transition to FOLLOWER on term 16 requested."
2021-12-28 17:19:16.428 CET partition-1 "Transition to FOLLOWER on term 16 completed"
2021-12-28 17:24:15.936 CET partition-1 "Transition to INACTIVE on term 16 requested."
2021-12-28 17:24:15.940 CET partition-1 "Transition to INACTIVE on term 16 completed"
2021-12-28 17:24:16.216 CET partition-1 "Transition to FOLLOWER on term 16 requested."
2021-12-28 17:24:16.425 CET partition-1 "Transition to FOLLOWER on term 16 completed"
2021-12-28 17:29:17.013 CET partition-1 "Transition to INACTIVE on term 16 requested."
2021-12-28 17:29:17.016 CET partition-1 "Transition to INACTIVE on term 16 completed"
2021-12-28 17:29:17.265 CET partition-1 "Transition to FOLLOWER on term 16 requested."
2021-12-28 17:29:17.482 CET partition-1 "Transition to FOLLOWER on term 16 completed"
2021-12-28 17:34:17.042 CET partition-1 "Transition to INACTIVE on term 16 requested."
2021-12-28 17:34:17.046 CET partition-1 "Transition to INACTIVE on term 16 completed"
2021-12-28 17:34:17.380 CET partition-1 "Transition to FOLLOWER on term 16 requested."
2021-12-28 17:34:17.585 CET partition-1 "Transition to FOLLOWER on term 16 completed"
2021-12-28 17:39:17.481 CET partition-1 "Transition to INACTIVE on term 16 requested."
2021-12-28 17:39:17.484 CET partition-1 "Transition to INACTIVE on term 16 completed"
.... and so on, until 22:50:00

This also happened the day before: https://cloudlogging.app.goo.gl/7qpb4Rammh11eqYh6

Hypothesis
Looking at the above cases, it seems that a partition gets stuck in a transition loop between FOLLOWER and INACTIVE. Perhaps we have a memory leak in transitions.

@korthout korthout added kind/bug Categorizes an issue or PR as a bug scope/broker Marks an issue or PR to appear in the broker section of the changelog blocker/info Marks an issue as blocked, awaiting more information from the author Impact: Regression labels Jan 3, 2022
@npepinpe npepinpe added this to Planned in Zeebe Jan 4, 2022
@npepinpe
Copy link
Member

npepinpe commented Jan 4, 2022

@romansmirnov @deepthidevaki - could this be related to #7992 ?

@korthout
Copy link
Member Author

korthout commented Jan 4, 2022

Happened again 2022-01-04 ~07:58:05 for zeebe-1, see https://cloudlogging.app.goo.gl/uYTT7vjcAUdKyzb9A for the moment of restart.

When you look at the Transition to logs, then it's clear that all three brokers continuously transition between FOLLOWER and INACTIVE. Is that expected behavior?

EDIT: Discussed shortly with @Zelldon and he mentioned that this happens for install requests (i.e. receiving a snapshot). So these logs indicate that the followers receive new snapshots every 5 minutes. That's not optimal but also not wrong.

@korthout
Copy link
Member Author

korthout commented Jan 4, 2022

Please note that this transitioning happens as well on our long running benchmarks for 1.3.0-alpha1, 1.3.0-alpha2 and another release-1.3.0 build, but not on the long-running-v1-for-minor-updates benchmark (which is currently running 1.2.9).

This indicates that between 1.2.x and 1.3.0-alpha1 something was introduced that makes followers lag behind. Note that the reduction of benchmark resources was done after 1.3.0-alpha2. See related logs. I've also checked that this logged entry can actually be logged in 1.2.x versions and it would be logged if this transition would happen.

@npepinpe That would be a regression in 1.3.0, is that blocking the release in your opinion?

EDIT: Deeper investigation shows that older versions are also sending install requests at a similar rate, but just don't use the same transition logic and so don't log this Transition to line. @oleschoenburg told me that there is a configuration setting (default at 100) which determines whether to replicate log entries or the snapshot. Due to the throughput on the benchmarks, followers are generally lagging behind a few thousand records, i.e. a leader produces about 2500 records per second when it's doing 200 simple PI/s. It seems this is simply misconfigured for our benchmarks. Likely this should be set to <number of seconds allowed to lag> * <number of records produced per partition per second>. I won't consider this is a regression and I'll continue with the release. Sorry for the confusion and convoluting this issue with an unrelated problem.

@deepthidevaki
Copy link
Contributor

@romansmirnov @deepthidevaki - could this be related to #7992 ?

Not sure. Is a heap dump available?

@korthout
Copy link
Member Author

korthout commented Jan 4, 2022

No heap dumps were created

@deepthidevaki
Copy link
Contributor

No heap dumps were created

Was it then out of direct memory? Then it might be a different problem.

@npepinpe
Copy link
Member

npepinpe commented Jan 4, 2022

IIRC it was a kubernetes OOM, so it's not the JVM that crashed but the scheduler which killed the container.

@korthout
Copy link
Member Author

korthout commented Jan 7, 2022

These OOMs also occur on benchmarks medic-cw-51-975b33b9e9-benchmark and medic-cw-01-7841f75abe-benchmark. So, IMO we no longer need the release-1-3-0 benchmark for this investigation, and I'd like to delete it. We would've normally deleted this benchmark as part of the post-release-process, but I skipped it because we still needed to do this investigation.

Note, that the release-1-3-0 benchmark also exists in the long-running-cluster, but that one does not suffer from OOMs.

If anyone is against deleting the release-1-3-0 benchmark from the zeebe-cluster, please respond before or during the medic handover. Otherwise, I'll delete it after the handover.

@korthout
Copy link
Member Author

korthout commented Jan 7, 2022

Benchmark release-1-3-0 is deleted.

@romansmirnov romansmirnov self-assigned this Jan 18, 2022
@romansmirnov romansmirnov moved this from Planned to In progress in Zeebe Jan 18, 2022
@romansmirnov
Copy link
Member

romansmirnov commented Jan 21, 2022

There is a memory leak! When the broker transitions between the roles, the broker stops running services and starts new services (depending on the old and new role). This includes stopping and starting the LogStream:

However, when dereferencing the ByteBuffer the direct memory is not freed immediately. Meaning, unreachable DirectByteBuffer instances are not collected yet, and direct memory is not released. Also, it is not deterministic when the allocated direct memory is freed.

Basically, the role changes happen whenever there is a role change in the Raft layer, or when a new snapshot must be installed. This results in increased allocated direct memory:

image

This is also visible in the Process Memory Usage

image

The growing allocated direct memory correlates with the number of installed snapshots on a broker:

image

Unfortunately, I was not able to reproduce the OOM, but I am quite confident that releasing allocated direct memory is one part of this issue here. Meaning, if there are a lot of role changes, then the unreachable direct buffer grows but which are not garbage collected and does not free the direct memory. In this example, there are 10 direct buffer but only 3 would be expected:

image

When releasing the allocated memory, it results in a constant size (~400MB) of allocated direct memory:

image

Also, the Process Memory Usage is quite constant and does not grow with every installed snapshot:

image

Also, the number of Direct Buffers in the heap dump is constant

image

@romansmirnov
Copy link
Member

Just for reference, in the heap dump it is possible to execute an OQL (Open Object Query Language). The following query helps to trouble shoot memory leaks in the direct memory:

SELECT x, x.capacity, x.position, x.limit FROM java.nio.DirectByteBuffer x WHERE ((x.capacity > (1024 * 1024)) and (x.cleaner != null))

It shows all DirectByteBuffer instances, with their capacity and its current position.

@npepinpe
Copy link
Member

Related, interesting bit from Netty's documentation about their own ByteBuf allocator:

When a new java.nio.ByteBuffer is allocated, its content is filled with zeroes. This "zeroing" consumes CPU cycles and memory bandwidth. Normally, the buffer is then immediately filled from some data source, so the zeroing did no good.

To be reclaimed, java.nio.ByteBuffer relies on the JVM garbage collector. It works OK for heap buffers, but not direct buffers. By design, direct buffers are expected to live a long time. Thus, allocating many short-lived direct NIO buffers often causes an OutOfMemoryError. Also, deallocating a direct buffer explicitly using the (hidden, proprietary) API isn't very fast.

@Zelldon
Copy link
Member

Zelldon commented Jan 21, 2022

Do you think it might make more sense to reuse the dispatcher and just reset/clear it on transitions?

@npepinpe
Copy link
Member

npepinpe commented Jan 21, 2022

Possibly - I don't know how resilient our dispatcher currently is, but if we assume we just reset in memory properties and zero the buffer, then that's probably still faster and less intensive on the memory than freeing the buffer, allocating a new one, and zero-ing (which the JVM does).

At the same time, we know the pitfalls that come with reusing resources, so we'd have to make sure the reset/clear works correctly 😄

@romansmirnov
Copy link
Member

@npepinpe, thanks for your input.

I also read Netty's comment about zero-ing and dug a bit into it. But I don't see this as an issue in our case, because when doing the transitions between the roles, Zeebe is not on the critical path (or datapath). Of course, the transitions should happen quickly and not take ages, especially, when Zeebe transitions to the LEADER role so that it can start with processing quickly in a failover scenario. But as long as the cluster runs stable, direct memory allocation shouldn't be a problem, because Zeebe does not allocate direct memory when doing the processing, when being a LEADER for a partition. The creation/closing of the dispatcher happens only when doing the transitions.

That's why I would like to keep the scope on solving the "memory leak" in the direct memory by releasing the direct memory when closing the dispatcher (and keeping the performance topic out of the scope for now). That way, Zeebe broker node can "survive" many role transitions in a short timeframe caused by

  1. Leadership changes caused by disruptive brokers in the cluster
  2. Receiving many snapshots as a FOLLOWER

When releasing the direct memory, there is one issue that arises: Other components (like the Stream Processor or Comand Request Handler) should not try to write to (or read from) the direct memory, otherwise, they will try to access an illegal address and the JVM crashes. My current approach would be to ensure that all relevant components are closed/notified about the event of closing the dispatcher before the actual close happens. Alternatively, the dispatcher is only opened when transitioning to the LEADER state.

@npepinpe, please let me know if you want to discuss this issue.

@npepinpe
Copy link
Member

npepinpe commented Jan 24, 2022

That's fine, in my opinion we're still at a stage where correctness trumps performance most of the time. And as you mentioned, I doubt the performance gain/loss is noticeable anyway.

Regarding the second point, since there's no way to recover or handle a SIGSEV or SIGBUS, I'd like to have the strongest possible guarantees that we don't try to read from/write to freed memory. Are we confident that we can guarantee all components are closed before freeing the buffer? Can we offer stronger guarantees than that? Possibly not, but it doesn't hurt to spend a bit of time on that to explore our options. Because ensuring all relevant components are closed/notified is hard to do, and especially hard to future proof, in general (although maybe I misunderstood your proposal). OTOH, could we potentially delegate to the dispatcher the task of writing to the buffer? e.g. claim a segment of memory, then pass a BufferWriter to the dispatcher, such that writing also happens on the dispatcher's actor? This is probably less performant, but it would allow a single point of control to the buffer, which may result in less errors?

💭 At the same time, a SIGSEV or SIGBUS on the dispatcher will not cause any permanent issues (e.g. data corruption/loss), so I suppose it's not the worse thing that can happen, as compared to reusing the same dispatcher and potentially writing the wrong things and causing corruption.

If you want, we can discuss this or brainstorm a solution - I'm free tomorrow afternoon. If that's blocking you, you can always grab someone else from the team, like Ole or Chris.

@romansmirnov
Copy link
Member

Just a quick summary: there are two different types of "components" that writes to the dispatcher, and there is only "component" that reads from the dispatcher.

Who reads from the dispatcher?

Who writes to the dispatcher?

  • StreamProcessor: When the stream processor processes events from the logstream, it may write follow-up events, etc. Those events are written to the dispatcher (which are then read from the log storage appender).
  • SubscriptionApiCommandMessageHandler: Subscribes to messages and writes received messages to the leader's dispatcher (if there is a leader for that partition).
  • LeaderManagementRequestHandler: To apply deployments across the partitions, the handler writes to the leader's dispatcher to replicate the deployment eventually.
  • CommandApiServiceImpl: Handles all incoming commands and writes them to the leader's dispatcher (so that these commands are appended to the log storage, replicated, and processed).

When on the broker layer (triggered by a Raft role change or when installing a snapshot, etc.) a transition to another role is initiated, then a PartitionTransition is started that includes different steps to close current services and restart them. It includes the services StreamProcessor and LogStorageAppender (which exists in the context of the LogStream):

  • The StreamProcessor is closed prior to the log stream (and hence before the dispatcher). Meaning, the stream processor won't write anything to the dispatcher anymore.
...
Prepare transition from INACTIVE on term 16 to FOLLOWER - preparing StreamProcessor
...
Prepare transition from INACTIVE on term 16 to FOLLOWER - preparing LogStream"
...

Only when the transition succeeded all the other "write components" are notified by calling the PartitionListener, for example, by calling #onBecomingFollower(). In this case, if a transition from LEADER to FOLLOWER happens, all three components (SubscriptionApiCommandMessageHandler, LeaderManagementRequestHandler, CommandApiServiceImpl) remove their writer from their internal partition to streaming writer mapping (i.e., leadingStreams).

@romansmirnov romansmirnov moved this from In progress to Ready in Zeebe Feb 1, 2022
@npepinpe npepinpe moved this from Ready to Planned in Zeebe Feb 3, 2022
@npepinpe npepinpe removed the blocker/info Marks an issue as blocked, awaiting more information from the author label Feb 3, 2022
@npepinpe
Copy link
Member

I think freeing the dispatcher's buffer eagerly was just part of the issue, i.e. it would help handle multiple consecutive transitions which might cause a burst of allocated dispatchers that isn't freed immediately. I'm not sure it was the main cause. That said, I don't think there's any harm in doing it if we can guarantee it's safe to do. We would have to look into why we didn't merge this PR - #8632.

@npepinpe
Copy link
Member

Re-reading the issue, we can scope it to just ensuring we're freeing the dispatcher's byte buffer eagerly to avoid bursts of transitions causing too much memory to be allocated.

We will tackle ensuring resources are closed with the upcoming KR separately.

@npepinpe npepinpe assigned npepinpe and deepthidevaki and unassigned npepinpe May 9, 2022
@deepthidevaki
Copy link
Contributor

In our benchmarks, the frequent role transitions are in followers. When a follower receives a snapshot, it closes its current follower services and installs new one. So the transitions are follower -> follower. Other transitions are usually triggered by restarts. Leader -> follower rarely happens (usually happens when the leader is restarted.) and follower -> leader happens mostly only once during the lifetime of a pod. Frequent Leader -> follower -> Leader transitions happens when there are network partitions, which is not very common in our benchmarks as well as in a production setup.

One cause for having so many dispatcher buffers is that follower StreamProcessor also creates a logstream writer which opens the dispatcher. In follower, the StreamProcessor never writes to the logstream. So there is no need to create a writer, as a result no need to open a dispatcher.

If we fix the StreamProcessor in follower role to not open the dispatcher, this would prevent the case where we have a lot of dispatcher buffers are open. This is only a partial fix, as we are not fixing the root cause of freeing the buffer. But this would be easy to fix and would prevent the most common case that we observe in our benchmarks as well as in a production set up.

@Zelldon
Copy link
Member

Zelldon commented May 12, 2022

Did you verified that?

One cause for having so many dispatcher buffers is that follower StreamProcessor also creates a logstream writer which opens the dispatcher. In follower, the StreamProcessor never writes to the logstream. So there is no need to create a writer, as a result no need to open a dispatcher.

Because when we implemented replay on followers we implemented a noop writer, so I would expect that we have no real writter.

@deepthidevaki
Copy link
Contributor

@Zelldon We still create the writer in StreamProcessor, even if we are not using it.

protected void onActorStarting() {
actor.runOnCompletionBlockingCurrentPhase(
logStream.newLogStreamBatchWriter(), this::onRetrievingWriter);
}

@Zelldon
Copy link
Member

Zelldon commented May 12, 2022

Nice catch @deepthidevaki 🕵️‍♀️

@deepthidevaki
Copy link
Contributor

I tested a quick fix for not creating writer in the follower StreamProcessor. Here are the observations from the benchmark.

With the fix:

zeebe-1 is follower for all partitions and frequently receiving InstallRequests triggering role transitions.

image

Base version (main branch)

zeebe-2 is follower for all partitions and frequently receiving InstallRequests triggering role transitions.

image

Direct memory usage with the fix is much less compared to the main branch.

@Zelldon
Copy link
Member

Zelldon commented May 12, 2022

👍

But still, there is something else that is increasing the memory 🤔 Did I understand that right?

@deepthidevaki
Copy link
Contributor

+1

But still, there is something else that is increasing the memory thinking Did I understand that right?

You mean the increase in Process memory?

I think that it can be attributed to two things - 1. rocks db memory. 2. mapped byte buffers.
For mapped buffers, I would assume it is a temporary thing and the OS would swaps the memory as needed. Looking at the heapdump, there are many directbuffers linked to mapped buffers from the journal.
I'm not sure if rocksdb is leaking memory.

We can check if OOM occurs again after this fix.

zeebe-bors-camunda bot added a commit that referenced this issue May 13, 2022
9367: Do not open dispatchers in follower role r=deepthidevaki a=deepthidevaki

## Description

In follower role, StreamProcessor is running only in replay mode. When the writer is created, dispatcher is also opened which allocates direct buffer. This is unnecessary as the writer is never used. The allocated buffer consumes memory and can create memory pressure on the system. To fix this, we create the writer only after the replay is completed. 

## Related issues

related #8509 



Co-authored-by: Deepthi Devaki Akkoorath <deepthidevaki@gmail.com>
@deepthidevaki
Copy link
Contributor

This is not a complete fix. The main cause that Directbuffers are not freed immediately is still there. There is no real memory leak from the Dispatchers, as far as I know. So DirectBuffers will be eventually freed anyway. So the above fix will prevent the need to to aggressively free DirectBuffers. Then the question is should I look into how to free dispatcher buffers?

zeebe-bors-camunda bot added a commit that referenced this issue May 13, 2022
9376: [Backport  stable/8.0] Do not open dispatcher on follower role r=deepthidevaki a=deepthidevaki

## Description

Backport of #9367 

related #8509 


Co-authored-by: Deepthi Devaki Akkoorath <deepthidevaki@gmail.com>
@npepinpe
Copy link
Member

I think we should close direct memory as soon as possible, considering there's no easy to way to recover when we run out of it other than crashing (correct me if I'm wrong). I think the main worry before was how do we guarantee we won't access a freed buffer? It seems like this doesn't crash as I expected, though I'm not sure if that is always guaranteed or not 🤷

zeebe-bors-camunda bot added a commit that referenced this issue May 16, 2022
9377: [Backport stable/1.3] Do not open dispatcher on follower role r=deepthidevaki a=deepthidevaki

## Description

Backport of #9367

related #8509

Co-authored-by: Deepthi Devaki Akkoorath <deepthidevaki@gmail.com>
@deepthidevaki
Copy link
Contributor

OOM still happens

image
image

Zeebe-2 became leader before the OOM. It uses only < 300 MB of direct memory, which is required for 3 leader partitions. So direct memory allocation is not the reason for this OOM.

@Zelldon
Copy link
Member

Zelldon commented May 30, 2022

@deepthidevaki but the rocksdb memory usage is not part of it, right? We had this issue before, right?

@deepthidevaki
Copy link
Contributor

@Zelldon I suspect it is either rocksdb and/or mapped buffers for the journal files. We should look into it. Rocksdb metrics doesn't show high memory usage. But from my previous experience, rocksdb uses much more memory than it reports.

@deepthidevaki deepthidevaki removed their assignment May 31, 2022
@npepinpe
Copy link
Member

npepinpe commented Jun 1, 2022

As the OOM occurred on a long running cluster after 2 weeks of constant load, after which the cluster recovered quickly, we decided to postpone working on this for now. I would personally propose to close this, as by the time we look into it again it most likely will have changed quite a bit. Happy to be challenged on this though, let me know 👍

@Zelldon Zelldon added the version:8.1.0 Marks an issue as being completely or in parts released in 8.1.0 label Oct 4, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/reliability Marks an issue as related to improving the reliability of our software (i.e. it behaves as expected) kind/bug Categorizes an issue or PR as a bug scope/broker Marks an issue or PR to appear in the broker section of the changelog severity/high Marks a bug as having a noticeable impact on the user with no known workaround version:1.3.9 version:8.1.0-alpha2 version:8.1.0 Marks an issue as being completely or in parts released in 8.1.0
Projects
None yet
Development

Successfully merging a pull request may close this issue.

7 participants