Benchmark for CW32 with high backpressure, at times no throughput and frequent gateway restarts #10059

pihme · 2022-08-11T10:00:17Z

The benchmark for CW32 shows severely degraded performance: http://34.77.165.228/d/NzsO1mUnk/zeebe-overview?orgId=1&var-DS_PROMETHEUS=Prometheus&var-namespace=medic-cw-32-be18e23b78-benchmark&var-pod=All&var-partition=All&from=1660132800000&to=1660212000000

Throughput is very low:

Frequent restarts of Gateway:

Back pressure is high:

Processing shows a cliff edge:

Snapshots are growing after the cliff edge:

pihme · 2022-08-11T10:13:12Z

@deepthidevaki metnioned this looks similar to #9862

pihme · 2022-08-11T10:13:50Z

Terminated gateway nodes were restarted due to imminent node shutdown, at least the ones that I could look at

pihme · 2022-08-11T10:15:26Z

We see a high frequency of these bugs: #10014

This is expected, because in the commit used for the benchmark the bug was not fixed yet

pihme · 2022-08-11T10:18:36Z

pihme · 2022-08-11T11:46:17Z

Current working theory is:

Concurrent access to response writers leads a) to visible NPEs NPE NullPointerException upon writing error response #10014, b) to other (potentially non-visible) errors where streams are not being handled properly
Leads to Stream Error in in Gateway io.netty.handler.codec.http2.Http2Exception$StreamException: Stream closed before write could take place
Leads to memory leak in gateway
Leads to OOM termination https://console.cloud.google.com/logs/query;query=resource.type%3D%22k8s_container%22%0Ar[…]imestamp=2022-08-11T11:36:27.997522150Z?project=zeebe-io

Things not explained:

Why did it suddenly start after running relatively smoothly for two days?
Why does it not recover? Even with low throughput?
How exactly does the cascading chain from concurrent access to response writers to OOM in gateway look like?

pihme · 2022-08-11T12:27:31Z

Had a chat with Simon. We found no explanation why gateways were being restarted more frequently after 4:00 PM August 10th and why this behavior stopped around 9:30 AM August 11th

pihme · 2022-08-12T07:49:35Z

@Zelldon also mentioned this issue as possibly related #7095

Zelldon · 2022-08-15T06:58:41Z

Just want to mention that it seems that now two nodes have a dead partition one

korthout · 2022-08-15T09:13:49Z

Just want to mention that it seems that now two nodes have a dead partition one

@Zelldon There are a few errors reported about this. It appears to be unrelated to the above.

menski · 2022-08-19T09:03:57Z

We assume this was fixed by a bug fix

@oleschoenburg could you please link the corresponding issue/PR and delete the benchmark.

lenaschoenburg · 2022-08-19T10:21:58Z

Issues were caused by #10014 which is fixed. I'll delete the benchmark.

pihme added the kind/bug Categorizes an issue or PR as a bug label Aug 11, 2022

pihme added area/performance Marks an issue as performance related area/reliability Marks an issue as related to improving the reliability of our software (i.e. it behaves as expected) labels Aug 11, 2022

korthout mentioned this issue Aug 15, 2022

Delete existing PENDING_DEPLOYMENT causes ZeebeDbInconsistentException #10064

Closed

menski assigned lenaschoenburg Aug 19, 2022

lenaschoenburg closed this as completed Aug 19, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Benchmark for CW32 with high backpressure, at times no throughput and frequent gateway restarts #10059

Benchmark for CW32 with high backpressure, at times no throughput and frequent gateway restarts #10059

pihme commented Aug 11, 2022

pihme commented Aug 11, 2022

pihme commented Aug 11, 2022

pihme commented Aug 11, 2022

pihme commented Aug 11, 2022

pihme commented Aug 11, 2022 •

edited

pihme commented Aug 11, 2022

pihme commented Aug 12, 2022

Zelldon commented Aug 15, 2022

korthout commented Aug 15, 2022

menski commented Aug 19, 2022

lenaschoenburg commented Aug 19, 2022

Benchmark for CW32 with high backpressure, at times no throughput and frequent gateway restarts #10059

Benchmark for CW32 with high backpressure, at times no throughput and frequent gateway restarts #10059

Comments

pihme commented Aug 11, 2022

pihme commented Aug 11, 2022

pihme commented Aug 11, 2022

pihme commented Aug 11, 2022

pihme commented Aug 11, 2022

pihme commented Aug 11, 2022 • edited

pihme commented Aug 11, 2022

pihme commented Aug 12, 2022

Zelldon commented Aug 15, 2022

korthout commented Aug 15, 2022

menski commented Aug 19, 2022

lenaschoenburg commented Aug 19, 2022

pihme commented Aug 11, 2022 •

edited