Skip to content

Snapshottting process is stuck and no new snapshot are taken #7207

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
deepthidevaki opened this issue Jun 7, 2021 · 3 comments
Closed

Snapshottting process is stuck and no new snapshot are taken #7207

deepthidevaki opened this issue Jun 7, 2021 · 3 comments
Labels
kind/bug Categorizes an issue or PR as a bug scope/broker Marks an issue or PR to appear in the broker section of the changelog severity/high Marks a bug as having a noticeable impact on the user with no known workaround

Comments

@deepthidevaki
Copy link
Contributor

Describe the bug

A user reported broker's disk went full. Looking into the logs, I observed the following. A snapshot is taken, but we don't see a log saying "Created snapshot" which is the last step in committing a snapshot. It looks like the snapshot committing process in stuck just before that - specifically at invoking snapshot listeners.

The following is the last log related to snapshot for partition 9. A few hours after this the broker went out of disk space.

ESC[36mzeebe_broker1 |ESC[0m       io.camunda.zeebe.logstreams.snapshot - Based on lowest exporter position '369272775' and last processed position '369272775', determined '369272775' as snapshot position.
ESC[36mzeebe_broker1 |ESC[0m 2021-06-06 15:13:36.042 [Broker-1-SnapshotStore-9] [Broker-1-zb-fs-workers-4] DEBUG
ESC[36mzeebe_broker1 |ESC[0m       io.camunda.zeebe.logstreams.snapshot - Taking temporary snapshot into /usr/local/zeebe/data/raft-partition/partitions/9/pending/131936949-3-369272775-369272775.
ESC[36mzeebe_broker1 |ESC[0m 2021-06-06 15:13:36.190 [Broker-1-SnapshotDirector-9] [Broker-1-zb-fs-workers-4] DEBUG
ESC[36mzeebe_broker1 |ESC[0m       io.camunda.zeebe.logstreams.snapshot - Created pending snapshot for Broker-1-StreamProcessor-9
ESC[36mzeebe_broker1 |ESC[0m 2021-06-06 15:13:36.190 [Broker-1-SnapshotDirector-9] [Broker-1-zb-fs-workers-4] INFO 
ESC[36mzeebe_broker1 |ESC[0m       io.camunda.zeebe.logstreams.snapshot - Finished taking snapshot, need to wait until last written event position 369272848 is committed, current commit position is 36927277
5. After that snapshot can be marked as valid.
ESC[36mzeebe_broker1 |ESC[0m 2021-06-06 15:13:36.206 [Broker-1-SnapshotStore-9] [Broker-1-zb-fs-workers-6] DEBUG
ESC[36mzeebe_broker1 |ESC[0m       io.camunda.zeebe.snapshots.impl.FileBasedSnapshotStore - Purging snapshots older than FileBasedSnapshot{directory=/usr/local/zeebe/data/raft-partition/partitions/9/snapsho
ts/131936949-3-369272775-369272775, checksumFile=/usr/local/zeebe/data/raft-partition/partitions/9/snapshots/131936949-3-369272775-369272775.checksum, checksum=280791072, metadata=FileBasedSnapshotMetadata{
index=131936949, term=3, processedPosition=369272775, exporterPosition=369272775}}
ESC[36mzeebe_broker1 |ESC[0m 2021-06-06 15:13:36.206 [Broker-1-SnapshotStore-9] [Broker-1-zb-fs-workers-6] DEBUG
ESC[36mzeebe_broker1 |ESC[0m       io.camunda.zeebe.snapshots.impl.FileBasedSnapshotStore - Deleting snapshot FileBasedSnapshot{directory=/usr/local/zeebe/data/raft-partition/partitions/9/snapshots/13178753
4-3-368772735-368772739, checksumFile=/usr/local/zeebe/data/raft-partition/partitions/9/snapshots/131787534-3-368772735-368772739.checksum, checksum=167712353, metadata=FileBasedSnapshotMetadata{index=13178
7534, term=3, processedPosition=368772735, exporterPosition=368772739}}
ESC[36mzeebe_broker1 |ESC[0m 2021-06-06 15:13:36.207 [Broker-1-SnapshotStore-9] [Broker-1-zb-fs-workers-6] DEBUG
ESC[36mzeebe_broker1 |ESC[0m       io.camunda.zeebe.snapshots.impl.FileBasedSnapshotStore - Search for orphaned snapshots below oldest valid snapshot with index FileBasedSnapshotMetadata{index=131936949, te
rm=3, processedPosition=369272775, exporterPosition=369272775} in /usr/local/zeebe/data/raft-partition/partitions/9/pending
ESC[36mzeebe_broker1 |ESC[0m 2021-06-06 15:13:36.208 [Broker-1-SnapshotStore-9] [Broker-1-zb-fs-workers-6] DEBUG
ESC[36mzeebe_broker1 |ESC[0m       io.camunda.zeebe.logstreams.snapshot - New snapshot 131936949-3-369272775-369272775 was persisted. Start replicating.
ESC[36mzeebe_broker1 |ESC[0m 2021-06-06 15:13:36.208 [Broker-1-DeletionService-9] [Broker-1-zb-actors-4] DEBUG
ESC[36mzeebe_broker1 |ESC[0m       io.camunda.zeebe.broker.logstreams.delete - Compacting Atomix log up to index 131936949

Also observed similar logs in the leaders of other partitions.

It looks like persist call never completes. Hence AsyncSnapshotDirector does not take a new snapshot in the next snapshot interval because the previous snapshot has not completed.

To Reproduce

Not sure how to reproduce. This was observed in a long running cluster. Brokers are not restarted ever, though there have been leader changes in between.

Expected behavior

  • Snapshotting and compaction happen as long as exporter and processing is making progress.

Environment:

  • Zeebe Version: 1.0.0
@deepthidevaki deepthidevaki added kind/bug Categorizes an issue or PR as a bug scope/broker Marks an issue or PR to appear in the broker section of the changelog severity/high Marks a bug as having a noticeable impact on the user with no known workaround labels Jun 7, 2021
@deepthidevaki
Copy link
Contributor Author

Some possible improvements to handle similar cases:

  1. Improve logging - It would be better to log when AsyncSnapshotDirector does not take a snapshot. In this case some thing like "Not taking snapshot because the previous snapshot process is not completed" would have been useful.
  2. Split listener invoking out of FileBasedSnapshotStore:newSnapshot. Currently invoking the listeners, deleting previous snapshot, all are done in the same context as committing snapshot. This is not necessary. All of these tasks can be done asynchronously after the snapshot is committed.

@miguelpires
Copy link
Contributor

What happened:

  • We take a snapshot and try to replicate it
  • During replication, we load a snapshot chunk and get an OOM error (around 06/06 - 15:13:36)
  • It's not handled anywhere because it's an error, not an exception
  • After that no snapshots are taken for the partition so the log isn't compacted and, eventually, the disk becomes full

Stacktrace:

^[[36mzeebe_broker1 |^[[0m Exception in thread "Broker-1-zb-fs-workers-6" java.lang.OutOfMemoryError: Direct buffer memory
^[[36mzeebe_broker1 |^[[0m  at java.base/java.nio.Bits.reserveMemory(Unknown Source)
^[[36mzeebe_broker1 |^[[0m  at java.base/java.nio.DirectByteBuffer.<init>(Unknown Source)
^[[36mzeebe_broker1 |^[[0m  at java.base/java.nio.ByteBuffer.allocateDirect(Unknown Source)
^[[36mzeebe_broker1 |^[[0m  at java.base/sun.nio.ch.Util.getTemporaryDirectBuffer(Unknown Source)
^[[36mzeebe_broker1 |^[[0m  at java.base/sun.nio.ch.IOUtil.read(Unknown Source)
^[[36mzeebe_broker1 |^[[0m  at java.base/sun.nio.ch.FileChannelImpl.read(Unknown Source)
^[[36mzeebe_broker1 |^[[0m  at java.base/sun.nio.ch.ChannelInputStream.read(Unknown Source)
^[[36mzeebe_broker1 |^[[0m  at java.base/sun.nio.ch.ChannelInputStream.read(Unknown Source)
^[[36mzeebe_broker1 |^[[0m  at java.base/sun.nio.ch.ChannelInputStream.read(Unknown Source)
^[[36mzeebe_broker1 |^[[0m  at java.base/java.nio.file.Files.read(Unknown Source)
^[[36mzeebe_broker1 |^[[0m  at java.base/java.nio.file.Files.readAllBytes(Unknown Source)
^[[36mzeebe_broker1 |^[[0m  at io.camunda.zeebe.snapshots.impl.SnapshotChunkUtil.createSnapshotChunkFromFile(SnapshotChunkUtil.java:37)
^[[36mzeebe_broker1 |^[[0m  at io.camunda.zeebe.snapshots.impl.FileBasedSnapshotChunkReader.next(FileBasedSnapshotChunkReader.java:104)
^[[36mzeebe_broker1 |^[[0m  at io.camunda.zeebe.snapshots.impl.FileBasedSnapshotChunkReader.next(FileBasedSnapshotChunkReader.java:32)
^[[36mzeebe_broker1 |^[[0m  at io.camunda.zeebe.broker.system.partitions.impl.StateControllerImpl.onNewSnapshot(StateControllerImpl.java:194)
^[[36mzeebe_broker1 |^[[0m  at io.camunda.zeebe.snapshots.impl.FileBasedSnapshotStore.lambda$newSnapshot$11(FileBasedSnapshotStore.java:450)
^[[36mzeebe_broker1 |^[[0m  at java.base/java.util.concurrent.CopyOnWriteArrayList.forEach(Unknown Source)
^[[36mzeebe_broker1 |^[[0m  at java.base/java.util.concurrent.CopyOnWriteArraySet.forEach(Unknown Source)
^[[36mzeebe_broker1 |^[[0m  at io.camunda.zeebe.snapshots.impl.FileBasedSnapshotStore.newSnapshot(FileBasedSnapshotStore.java:450)
^[[36mzeebe_broker1 |^[[0m  at io.camunda.zeebe.snapshots.impl.FileBasedTransientSnapshot.lambda$persist$3(FileBasedTransientSnapshot.java:120)
^[[36mzeebe_broker1 |^[[0m  at io.camunda.zeebe.util.sched.ActorControl.lambda$call$0(ActorControl.java:136)
^[[36mzeebe_broker1 |^[[0m  at io.camunda.zeebe.util.sched.ActorJob.invoke(ActorJob.java:62)
^[[36mzeebe_broker1 |^[[0m  at io.camunda.zeebe.util.sched.ActorJob.execute(ActorJob.java:39)
^[[36mzeebe_broker1 |^[[0m  at io.camunda.zeebe.util.sched.ActorTask.execute(ActorTask.java:122)
^[[36mzeebe_broker1 |^[[0m  at io.camunda.zeebe.util.sched.ActorThread.executeCurrentTask(ActorThread.java:94)
^[[36mzeebe_broker1 |^[[0m  at io.camunda.zeebe.util.sched.ActorThread.doWork(ActorThread.java:78)
^[[36mzeebe_broker1 |^[[0m  at io.camunda.zeebe.util.sched.ActorThread.run(ActorThread.java:191)

Metrics:
Grafana-RES
This wasn't the first OOM error - that occurred at 14:52:27. Some other brokers also met the same problem soon after.

Regarding the question of why there was an OOM error, it's hard to tell without further information. I can ask what were their JVM configurations and see if those might have been related. However, what we can do is enable the XX:+ExitOnOutOfMemoryError by default (and also inform the affected users of this option as something they can enable now). This would have made the broker crash on the first OOM instead of running in a broken state until the disk was full.

ghost pushed a commit that referenced this issue Jun 11, 2021

Verified

This commit was created on GitHub.com and signed with GitHub’s verified signature. The key has expired.
7251: fix(dist): enable exit on OOM java opt r=npepinpe a=MiguelPires

## Description

Enables the '-XX:+ExitOnOutOfMemoryError' by default. If we get an OOM we should let the broker crash since it's very difficult to recover from it and keeping the broker running might cause other problems (e.g., #7207)

## Related issues

<!-- Which issues are closed by this PR or are related -->

related to #7207



Co-authored-by: Miguel Pires <miguel.pires@camunda.com>
@npepinpe
Copy link
Member

Let's close this for now and reopen when we have reproduced it with a heap dump.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Categorizes an issue or PR as a bug scope/broker Marks an issue or PR to appear in the broker section of the changelog severity/high Marks a bug as having a noticeable impact on the user with no known workaround
Projects
None yet
Development

No branches or pull requests

3 participants