-
Notifications
You must be signed in to change notification settings - Fork 659
Snapshottting process is stuck and no new snapshot are taken #7207
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Labels
kind/bug
Categorizes an issue or PR as a bug
scope/broker
Marks an issue or PR to appear in the broker section of the changelog
severity/high
Marks a bug as having a noticeable impact on the user with no known workaround
Comments
Some possible improvements to handle similar cases:
|
9 tasks
ghost
pushed a commit
that referenced
this issue
Jun 11, 2021
7251: fix(dist): enable exit on OOM java opt r=npepinpe a=MiguelPires ## Description Enables the '-XX:+ExitOnOutOfMemoryError' by default. If we get an OOM we should let the broker crash since it's very difficult to recover from it and keeping the broker running might cause other problems (e.g., #7207) ## Related issues <!-- Which issues are closed by this PR or are related --> related to #7207 Co-authored-by: Miguel Pires <miguel.pires@camunda.com>
Let's close this for now and reopen when we have reproduced it with a heap dump. |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Labels
kind/bug
Categorizes an issue or PR as a bug
scope/broker
Marks an issue or PR to appear in the broker section of the changelog
severity/high
Marks a bug as having a noticeable impact on the user with no known workaround
Describe the bug
A user reported broker's disk went full. Looking into the logs, I observed the following. A snapshot is taken, but we don't see a log saying "Created snapshot" which is the last step in committing a snapshot. It looks like the snapshot committing process in stuck just before that - specifically at invoking snapshot listeners.
The following is the last log related to snapshot for partition 9. A few hours after this the broker went out of disk space.
Also observed similar logs in the leaders of other partitions.
It looks like
persist
call never completes. HenceAsyncSnapshotDirector
does not take a new snapshot in the next snapshot interval because the previous snapshot has not completed.To Reproduce
Not sure how to reproduce. This was observed in a long running cluster. Brokers are not restarted ever, though there have been leader changes in between.
Expected behavior
Environment:
The text was updated successfully, but these errors were encountered: