Snapshottting process is stuck and no new snapshot are taken #7207

deepthidevaki · 2021-06-07T14:44:18Z

Describe the bug

A user reported broker's disk went full. Looking into the logs, I observed the following. A snapshot is taken, but we don't see a log saying "Created snapshot" which is the last step in committing a snapshot. It looks like the snapshot committing process in stuck just before that - specifically at invoking snapshot listeners.

The following is the last log related to snapshot for partition 9. A few hours after this the broker went out of disk space.

ESC[36mzeebe_broker1 |ESC[0m       io.camunda.zeebe.logstreams.snapshot - Based on lowest exporter position '369272775' and last processed position '369272775', determined '369272775' as snapshot position.
ESC[36mzeebe_broker1 |ESC[0m 2021-06-06 15:13:36.042 [Broker-1-SnapshotStore-9] [Broker-1-zb-fs-workers-4] DEBUG
ESC[36mzeebe_broker1 |ESC[0m       io.camunda.zeebe.logstreams.snapshot - Taking temporary snapshot into /usr/local/zeebe/data/raft-partition/partitions/9/pending/131936949-3-369272775-369272775.
ESC[36mzeebe_broker1 |ESC[0m 2021-06-06 15:13:36.190 [Broker-1-SnapshotDirector-9] [Broker-1-zb-fs-workers-4] DEBUG
ESC[36mzeebe_broker1 |ESC[0m       io.camunda.zeebe.logstreams.snapshot - Created pending snapshot for Broker-1-StreamProcessor-9
ESC[36mzeebe_broker1 |ESC[0m 2021-06-06 15:13:36.190 [Broker-1-SnapshotDirector-9] [Broker-1-zb-fs-workers-4] INFO 
ESC[36mzeebe_broker1 |ESC[0m       io.camunda.zeebe.logstreams.snapshot - Finished taking snapshot, need to wait until last written event position 369272848 is committed, current commit position is 36927277
5. After that snapshot can be marked as valid.
ESC[36mzeebe_broker1 |ESC[0m 2021-06-06 15:13:36.206 [Broker-1-SnapshotStore-9] [Broker-1-zb-fs-workers-6] DEBUG
ESC[36mzeebe_broker1 |ESC[0m       io.camunda.zeebe.snapshots.impl.FileBasedSnapshotStore - Purging snapshots older than FileBasedSnapshot{directory=/usr/local/zeebe/data/raft-partition/partitions/9/snapsho
ts/131936949-3-369272775-369272775, checksumFile=/usr/local/zeebe/data/raft-partition/partitions/9/snapshots/131936949-3-369272775-369272775.checksum, checksum=280791072, metadata=FileBasedSnapshotMetadata{
index=131936949, term=3, processedPosition=369272775, exporterPosition=369272775}}
ESC[36mzeebe_broker1 |ESC[0m 2021-06-06 15:13:36.206 [Broker-1-SnapshotStore-9] [Broker-1-zb-fs-workers-6] DEBUG
ESC[36mzeebe_broker1 |ESC[0m       io.camunda.zeebe.snapshots.impl.FileBasedSnapshotStore - Deleting snapshot FileBasedSnapshot{directory=/usr/local/zeebe/data/raft-partition/partitions/9/snapshots/13178753
4-3-368772735-368772739, checksumFile=/usr/local/zeebe/data/raft-partition/partitions/9/snapshots/131787534-3-368772735-368772739.checksum, checksum=167712353, metadata=FileBasedSnapshotMetadata{index=13178
7534, term=3, processedPosition=368772735, exporterPosition=368772739}}
ESC[36mzeebe_broker1 |ESC[0m 2021-06-06 15:13:36.207 [Broker-1-SnapshotStore-9] [Broker-1-zb-fs-workers-6] DEBUG
ESC[36mzeebe_broker1 |ESC[0m       io.camunda.zeebe.snapshots.impl.FileBasedSnapshotStore - Search for orphaned snapshots below oldest valid snapshot with index FileBasedSnapshotMetadata{index=131936949, te
rm=3, processedPosition=369272775, exporterPosition=369272775} in /usr/local/zeebe/data/raft-partition/partitions/9/pending
ESC[36mzeebe_broker1 |ESC[0m 2021-06-06 15:13:36.208 [Broker-1-SnapshotStore-9] [Broker-1-zb-fs-workers-6] DEBUG
ESC[36mzeebe_broker1 |ESC[0m       io.camunda.zeebe.logstreams.snapshot - New snapshot 131936949-3-369272775-369272775 was persisted. Start replicating.
ESC[36mzeebe_broker1 |ESC[0m 2021-06-06 15:13:36.208 [Broker-1-DeletionService-9] [Broker-1-zb-actors-4] DEBUG
ESC[36mzeebe_broker1 |ESC[0m       io.camunda.zeebe.broker.logstreams.delete - Compacting Atomix log up to index 131936949

Also observed similar logs in the leaders of other partitions.

It looks like persist call never completes. Hence AsyncSnapshotDirector does not take a new snapshot in the next snapshot interval because the previous snapshot has not completed.

To Reproduce

Not sure how to reproduce. This was observed in a long running cluster. Brokers are not restarted ever, though there have been leader changes in between.

Expected behavior

Snapshotting and compaction happen as long as exporter and processing is making progress.

Environment:

Zeebe Version: 1.0.0

The text was updated successfully, but these errors were encountered:

deepthidevaki · 2021-06-07T15:13:20Z

Some possible improvements to handle similar cases:

Improve logging - It would be better to log when AsyncSnapshotDirector does not take a snapshot. In this case some thing like "Not taking snapshot because the previous snapshot process is not completed" would have been useful.
Split listener invoking out of FileBasedSnapshotStore:newSnapshot. Currently invoking the listeners, deleting previous snapshot, all are done in the same context as committing snapshot. This is not necessary. All of these tasks can be done asynchronously after the snapshot is committed.

miguelpires · 2021-06-10T13:29:09Z

What happened:

We take a snapshot and try to replicate it
During replication, we load a snapshot chunk and get an OOM error (around 06/06 - 15:13:36)
It's not handled anywhere because it's an error, not an exception
After that no snapshots are taken for the partition so the log isn't compacted and, eventually, the disk becomes full

Stacktrace:

^[[36mzeebe_broker1 |^[[0m Exception in thread "Broker-1-zb-fs-workers-6" java.lang.OutOfMemoryError: Direct buffer memory
^[[36mzeebe_broker1 |^[[0m  at java.base/java.nio.Bits.reserveMemory(Unknown Source)
^[[36mzeebe_broker1 |^[[0m  at java.base/java.nio.DirectByteBuffer.<init>(Unknown Source)
^[[36mzeebe_broker1 |^[[0m  at java.base/java.nio.ByteBuffer.allocateDirect(Unknown Source)
^[[36mzeebe_broker1 |^[[0m  at java.base/sun.nio.ch.Util.getTemporaryDirectBuffer(Unknown Source)
^[[36mzeebe_broker1 |^[[0m  at java.base/sun.nio.ch.IOUtil.read(Unknown Source)
^[[36mzeebe_broker1 |^[[0m  at java.base/sun.nio.ch.FileChannelImpl.read(Unknown Source)
^[[36mzeebe_broker1 |^[[0m  at java.base/sun.nio.ch.ChannelInputStream.read(Unknown Source)
^[[36mzeebe_broker1 |^[[0m  at java.base/sun.nio.ch.ChannelInputStream.read(Unknown Source)
^[[36mzeebe_broker1 |^[[0m  at java.base/sun.nio.ch.ChannelInputStream.read(Unknown Source)
^[[36mzeebe_broker1 |^[[0m  at java.base/java.nio.file.Files.read(Unknown Source)
^[[36mzeebe_broker1 |^[[0m  at java.base/java.nio.file.Files.readAllBytes(Unknown Source)
^[[36mzeebe_broker1 |^[[0m  at io.camunda.zeebe.snapshots.impl.SnapshotChunkUtil.createSnapshotChunkFromFile(SnapshotChunkUtil.java:37)
^[[36mzeebe_broker1 |^[[0m  at io.camunda.zeebe.snapshots.impl.FileBasedSnapshotChunkReader.next(FileBasedSnapshotChunkReader.java:104)
^[[36mzeebe_broker1 |^[[0m  at io.camunda.zeebe.snapshots.impl.FileBasedSnapshotChunkReader.next(FileBasedSnapshotChunkReader.java:32)
^[[36mzeebe_broker1 |^[[0m  at io.camunda.zeebe.broker.system.partitions.impl.StateControllerImpl.onNewSnapshot(StateControllerImpl.java:194)
^[[36mzeebe_broker1 |^[[0m  at io.camunda.zeebe.snapshots.impl.FileBasedSnapshotStore.lambda$newSnapshot$11(FileBasedSnapshotStore.java:450)
^[[36mzeebe_broker1 |^[[0m  at java.base/java.util.concurrent.CopyOnWriteArrayList.forEach(Unknown Source)
^[[36mzeebe_broker1 |^[[0m  at java.base/java.util.concurrent.CopyOnWriteArraySet.forEach(Unknown Source)
^[[36mzeebe_broker1 |^[[0m  at io.camunda.zeebe.snapshots.impl.FileBasedSnapshotStore.newSnapshot(FileBasedSnapshotStore.java:450)
^[[36mzeebe_broker1 |^[[0m  at io.camunda.zeebe.snapshots.impl.FileBasedTransientSnapshot.lambda$persist$3(FileBasedTransientSnapshot.java:120)
^[[36mzeebe_broker1 |^[[0m  at io.camunda.zeebe.util.sched.ActorControl.lambda$call$0(ActorControl.java:136)
^[[36mzeebe_broker1 |^[[0m  at io.camunda.zeebe.util.sched.ActorJob.invoke(ActorJob.java:62)
^[[36mzeebe_broker1 |^[[0m  at io.camunda.zeebe.util.sched.ActorJob.execute(ActorJob.java:39)
^[[36mzeebe_broker1 |^[[0m  at io.camunda.zeebe.util.sched.ActorTask.execute(ActorTask.java:122)
^[[36mzeebe_broker1 |^[[0m  at io.camunda.zeebe.util.sched.ActorThread.executeCurrentTask(ActorThread.java:94)
^[[36mzeebe_broker1 |^[[0m  at io.camunda.zeebe.util.sched.ActorThread.doWork(ActorThread.java:78)
^[[36mzeebe_broker1 |^[[0m  at io.camunda.zeebe.util.sched.ActorThread.run(ActorThread.java:191)

Metrics:

This wasn't the first OOM error - that occurred at 14:52:27. Some other brokers also met the same problem soon after.

Regarding the question of why there was an OOM error, it's hard to tell without further information. I can ask what were their JVM configurations and see if those might have been related. However, what we can do is enable the XX:+ExitOnOutOfMemoryError by default (and also inform the affected users of this option as something they can enable now). This would have made the broker crash on the first OOM instead of running in a broken state until the disk was full.

7251: fix(dist): enable exit on OOM java opt r=npepinpe a=MiguelPires ## Description Enables the '-XX:+ExitOnOutOfMemoryError' by default. If we get an OOM we should let the broker crash since it's very difficult to recover from it and keeping the broker running might cause other problems (e.g., #7207) ## Related issues  related to #7207 Co-authored-by: Miguel Pires <miguel.pires@camunda.com>

npepinpe · 2021-06-16T07:39:55Z

Let's close this for now and reopen when we have reproduced it with a heap dump.

deepthidevaki added kind/bug scope/broker severity/high labels Jun 7, 2021

npepinpe added Impact: Data labels Jun 8, 2021

npepinpe assigned miguelpires Jun 8, 2021

miguelpires mentioned this issue Jun 10, 2021

fix(dist): enable exit on OOM java opt #7251

Merged

9 tasks

npepinpe unassigned miguelpires Jun 15, 2021

npepinpe closed this as completed Jun 16, 2021

npepinpe added the Release: 1.1.0 label Jul 7, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Snapshottting process is stuck and no new snapshot are taken #7207

Snapshottting process is stuck and no new snapshot are taken #7207

deepthidevaki commented Jun 7, 2021

deepthidevaki commented Jun 7, 2021

miguelpires commented Jun 10, 2021

npepinpe commented Jun 16, 2021

Snapshottting process is stuck and no new snapshot are taken #7207

Snapshottting process is stuck and no new snapshot are taken #7207

Comments

deepthidevaki commented Jun 7, 2021

deepthidevaki commented Jun 7, 2021

miguelpires commented Jun 10, 2021

npepinpe commented Jun 16, 2021