Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Benchmark for CW32 with high backpressure, at times no throughput and frequent gateway restarts #10059

Closed
pihme opened this issue Aug 11, 2022 · 11 comments
Assignees
Labels
area/performance Marks an issue as performance related area/reliability Marks an issue as related to improving the reliability of our software (i.e. it behaves as expected) kind/bug Categorizes an issue or PR as a bug

Comments

@pihme
Copy link
Contributor

pihme commented Aug 11, 2022

The benchmark for CW32 shows severely degraded performance: http://34.77.165.228/d/NzsO1mUnk/zeebe-overview?orgId=1&var-DS_PROMETHEUS=Prometheus&var-namespace=medic-cw-32-be18e23b78-benchmark&var-pod=All&var-partition=All&from=1660132800000&to=1660212000000

Throughput is very low:
image

Frequent restarts of Gateway:
image

Back pressure is high:
image

Processing shows a cliff edge:
image

Snapshots are growing after the cliff edge:
image

@pihme pihme added the kind/bug Categorizes an issue or PR as a bug label Aug 11, 2022
@pihme
Copy link
Contributor Author

pihme commented Aug 11, 2022

@deepthidevaki metnioned this looks similar to #9862

@pihme
Copy link
Contributor Author

pihme commented Aug 11, 2022

Terminated gateway nodes were restarted due to imminent node shutdown, at least the ones that I could look at

@pihme
Copy link
Contributor Author

pihme commented Aug 11, 2022

We see a high frequency of these bugs: #10014

This is expected, because in the commit used for the benchmark the bug was not fixed yet

@pihme
Copy link
Contributor Author

pihme commented Aug 11, 2022

image

@pihme
Copy link
Contributor Author

pihme commented Aug 11, 2022

Current working theory is:

Things not explained:

  • Why did it suddenly start after running relatively smoothly for two days?
  • Why does it not recover? Even with low throughput?
  • How exactly does the cascading chain from concurrent access to response writers to OOM in gateway look like?

@pihme
Copy link
Contributor Author

pihme commented Aug 11, 2022

Had a chat with Simon. We found no explanation why gateways were being restarted more frequently after 4:00 PM August 10th and why this behavior stopped around 9:30 AM August 11th

@pihme pihme added area/performance Marks an issue as performance related area/reliability Marks an issue as related to improving the reliability of our software (i.e. it behaves as expected) labels Aug 11, 2022
@pihme
Copy link
Contributor Author

pihme commented Aug 12, 2022

@Zelldon also mentioned this issue as possibly related #7095

@Zelldon
Copy link
Member

Zelldon commented Aug 15, 2022

Just want to mention that it seems that now two nodes have a dead partition one

d

@korthout
Copy link
Member

Just want to mention that it seems that now two nodes have a dead partition one

@Zelldon There are a few errors reported about this. It appears to be unrelated to the above.

@menski
Copy link
Contributor

menski commented Aug 19, 2022

We assume this was fixed by a bug fix

@oleschoenburg could you please link the corresponding issue/PR and delete the benchmark.

@lenaschoenburg
Copy link
Member

Issues were caused by #10014 which is fixed. I'll delete the benchmark.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/performance Marks an issue as performance related area/reliability Marks an issue as related to improving the reliability of our software (i.e. it behaves as expected) kind/bug Categorizes an issue or PR as a bug
Projects
None yet
Development

No branches or pull requests

5 participants