-
Notifications
You must be signed in to change notification settings - Fork 658
Multiple OOM encountered on benchmark cluster #8509
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
@romansmirnov @deepthidevaki - could this be related to #7992 ? |
Happened again 2022-01-04 ~07:58:05 for zeebe-1, see https://cloudlogging.app.goo.gl/uYTT7vjcAUdKyzb9A for the moment of restart. When you look at the EDIT: Discussed shortly with @Zelldon and he mentioned that this happens for install requests (i.e. receiving a snapshot). So these logs indicate that the followers receive new snapshots every 5 minutes. That's not optimal but also not wrong. |
Please note that this transitioning happens as well on our long running benchmarks for This indicates that between 1.2.x and 1.3.0-alpha1 something was introduced that makes followers lag behind. Note that the reduction of benchmark resources was done after @npepinpe EDIT: Deeper investigation shows that older versions are also sending install requests at a similar rate, but just don't use the same transition logic and so don't log this |
Not sure. Is a heap dump available? |
No heap dumps were created |
Was it then out of direct memory? Then it might be a different problem. |
IIRC it was a kubernetes OOM, so it's not the JVM that crashed but the scheduler which killed the container. |
These OOMs also occur on benchmarks
If anyone is against deleting the |
Benchmark |
There is a memory leak! When the broker transitions between the roles, the broker stops running services and starts new services (depending on the old and new role). This includes stopping and starting the
However, when dereferencing the Basically, the role changes happen whenever there is a role change in the Raft layer, or when a new snapshot must be installed. This results in increased allocated direct memory: This is also visible in the The growing allocated direct memory correlates with the number of installed snapshots on a broker: Unfortunately, I was not able to reproduce the OOM, but I am quite confident that releasing allocated direct memory is one part of this issue here. Meaning, if there are a lot of role changes, then the unreachable direct buffer grows but which are not garbage collected and does not free the direct memory. In this example, there are 10 direct buffer but only 3 would be expected: When releasing the allocated memory, it results in a constant size (~400MB) of allocated direct memory: Also, the Also, the number of Direct Buffers in the heap dump is constant |
Just for reference, in the heap dump it is possible to execute an OQL (Open Object Query Language). The following query helps to trouble shoot memory leaks in the direct memory:
It shows all |
Related, interesting bit from Netty's documentation about their own ByteBuf allocator:
|
Do you think it might make more sense to reuse the dispatcher and just reset/clear it on transitions? |
Possibly - I don't know how resilient our dispatcher currently is, but if we assume we just reset in memory properties and zero the buffer, then that's probably still faster and less intensive on the memory than freeing the buffer, allocating a new one, and zero-ing (which the JVM does). At the same time, we know the pitfalls that come with reusing resources, so we'd have to make sure the reset/clear works correctly 😄 |
@npepinpe, thanks for your input. I also read Netty's comment about zero-ing and dug a bit into it. But I don't see this as an issue in our case, because when doing the transitions between the roles, Zeebe is not on the critical path (or datapath). Of course, the transitions should happen quickly and not take ages, especially, when Zeebe transitions to the That's why I would like to keep the scope on solving the "memory leak" in the direct memory by releasing the direct memory when closing the dispatcher (and keeping the performance topic out of the scope for now). That way, Zeebe broker node can "survive" many role transitions in a short timeframe caused by
When releasing the direct memory, there is one issue that arises: Other components (like the Stream Processor or Comand Request Handler) should not try to write to (or read from) the direct memory, otherwise, they will try to access an illegal address and the JVM crashes. My current approach would be to ensure that all relevant components are closed/notified about the event of closing the dispatcher before the actual close happens. Alternatively, the dispatcher is only opened when transitioning to the @npepinpe, please let me know if you want to discuss this issue. |
That's fine, in my opinion we're still at a stage where correctness trumps performance most of the time. And as you mentioned, I doubt the performance gain/loss is noticeable anyway. Regarding the second point, since there's no way to recover or handle a SIGSEV or SIGBUS, I'd like to have the strongest possible guarantees that we don't try to read from/write to freed memory. Are we confident that we can guarantee all components are closed before freeing the buffer? Can we offer stronger guarantees than that? Possibly not, but it doesn't hurt to spend a bit of time on that to explore our options. Because ensuring all relevant components are closed/notified is hard to do, and especially hard to future proof, in general (although maybe I misunderstood your proposal). OTOH, could we potentially delegate to the dispatcher the task of writing to the buffer? e.g. claim a segment of memory, then pass a 💭 At the same time, a SIGSEV or SIGBUS on the dispatcher will not cause any permanent issues (e.g. data corruption/loss), so I suppose it's not the worse thing that can happen, as compared to reusing the same dispatcher and potentially writing the wrong things and causing corruption. If you want, we can discuss this or brainstorm a solution - I'm free tomorrow afternoon. If that's blocking you, you can always grab someone else from the team, like Ole or Chris. |
Just a quick summary: there are two different types of "components" that writes to the dispatcher, and there is only "component" that reads from the dispatcher. Who reads from the dispatcher?
Who writes to the dispatcher?
When on the broker layer (triggered by a Raft role change or when installing a snapshot, etc.) a transition to another role is initiated, then a
Only when the transition succeeded all the other "write components" are notified by calling the |
@npepinpe We are still getting OOM's in current benchmarks. Is this something we want to work on for the 1.4.0 release? |
Could you elaborate on this? Are we getting K8s OOMs or Java OOMs? |
I'm not quite sure but for example here: http://34.77.165.228/d/I4lo7_EZk/zeebe?viewPanel=33&orgId=1&from=1647282747298&to=1647314979422&var-DS_PROMETHEUS=Prometheus&var-namespace=medic-cw-08-12c4ea63e6-benchmark&var-pod=All&var-partition=All it looks like a k8s OOM, while in this error the JVM is running OOM: https://console.cloud.google.com/errors/detail/CKSrwcbo6qihLg;service=;version=;filter=%5B%22OutOfMemory%22%5D?project=zeebe-io |
The second one could be the previous bug @romansmirnov was working on, i.e. dispatchers aren't freed (or not in a timely fashion) which results in us running out of direct memory. |
I think freeing the dispatcher's buffer eagerly was just part of the issue, i.e. it would help handle multiple consecutive transitions which might cause a burst of allocated dispatchers that isn't freed immediately. I'm not sure it was the main cause. That said, I don't think there's any harm in doing it if we can guarantee it's safe to do. We would have to look into why we didn't merge this PR - #8632. |
Re-reading the issue, we can scope it to just ensuring we're freeing the dispatcher's byte buffer eagerly to avoid bursts of transitions causing too much memory to be allocated. We will tackle ensuring resources are closed with the upcoming KR separately. |
In our benchmarks, the frequent role transitions are in followers. When a follower receives a snapshot, it closes its current follower services and installs new one. So the transitions are follower -> follower. Other transitions are usually triggered by restarts. Leader -> follower rarely happens (usually happens when the leader is restarted.) and follower -> leader happens mostly only once during the lifetime of a pod. Frequent Leader -> follower -> Leader transitions happens when there are network partitions, which is not very common in our benchmarks as well as in a production setup. One cause for having so many dispatcher buffers is that follower StreamProcessor also creates a logstream writer which opens the dispatcher. In follower, the StreamProcessor never writes to the logstream. So there is no need to create a writer, as a result no need to open a dispatcher. If we fix the StreamProcessor in follower role to not open the dispatcher, this would prevent the case where we have a lot of dispatcher buffers are open. This is only a partial fix, as we are not fixing the root cause of freeing the buffer. But this would be easy to fix and would prevent the most common case that we observe in our benchmarks as well as in a production set up. |
Did you verified that?
Because when we implemented replay on followers we implemented a noop writer, so I would expect that we have no real writter. |
@Zelldon We still create the writer in StreamProcessor, even if we are not using it. |
Nice catch @deepthidevaki 🕵️♀️ |
I tested a quick fix for not creating writer in the follower StreamProcessor. Here are the observations from the benchmark. With the fix: zeebe-1 is follower for all partitions and frequently receiving InstallRequests triggering role transitions. Base version (main branch) zeebe-2 is follower for all partitions and frequently receiving InstallRequests triggering role transitions. Direct memory usage with the fix is much less compared to the main branch. |
👍 But still, there is something else that is increasing the memory 🤔 Did I understand that right? |
You mean the increase in Process memory? I think that it can be attributed to two things - 1. rocks db memory. 2. mapped byte buffers. We can check if OOM occurs again after this fix. |
9367: Do not open dispatchers in follower role r=deepthidevaki a=deepthidevaki ## Description In follower role, StreamProcessor is running only in replay mode. When the writer is created, dispatcher is also opened which allocates direct buffer. This is unnecessary as the writer is never used. The allocated buffer consumes memory and can create memory pressure on the system. To fix this, we create the writer only after the replay is completed. ## Related issues related #8509 Co-authored-by: Deepthi Devaki Akkoorath <deepthidevaki@gmail.com>
This is not a complete fix. The main cause that Directbuffers are not freed immediately is still there. There is no real memory leak from the Dispatchers, as far as I know. So DirectBuffers will be eventually freed anyway. So the above fix will prevent the need to to aggressively free DirectBuffers. Then the question is should I look into how to free dispatcher buffers? |
I think we should close direct memory as soon as possible, considering there's no easy to way to recover when we run out of it other than crashing (correct me if I'm wrong). I think the main worry before was how do we guarantee we won't access a freed buffer? It seems like this doesn't crash as I expected, though I'm not sure if that is always guaranteed or not 🤷 |
@deepthidevaki but the rocksdb memory usage is not part of it, right? We had this issue before, right? |
@Zelldon I suspect it is either rocksdb and/or mapped buffers for the journal files. We should look into it. Rocksdb metrics doesn't show high memory usage. But from my previous experience, rocksdb uses much more memory than it reports. |
As the OOM occurred on a long running cluster after 2 weeks of constant load, after which the cluster recovered quickly, we decided to postpone working on this for now. I would personally propose to close this, as by the time we look into it again it most likely will have changed quite a bit. Happy to be challenged on this though, let me know 👍 |
Describe the bug
The benchmark cluster for branch
release-1.3.0
experienced multiple Out Of Memory (OOM) errors.This is a potential regression, although it is likely this issue exists already longer. Note that the resources for the benchmark project were reduced recently. See #8268
Occurrences
zeebe-2 @ 2021-12-27 ~11:21:45
Only a small dip in processing throughput

GC shortly spiked and then dropped

Simultaneously JVM memory usage increased from max ~200MB to spikes above 500MB and direct buffer pool memory usage doubled in this short window from ~400MB to ~860MB.

During this time, RocksDB memory usage was similar to before ~500MB per partition.

Install requests were frequently sent 🤔

It had just transitioned to INACTIVE and had closed the database, when it started to transition to FOLLOWER.
Once it opened the database it soon after stopped.
zeebe-2 @ 2021-12-28 ~09:19:45, followed by zeebe-1 @ 2021-12-28 ~09:25:15
Just before the OOM, the starter and worker restart, which might explain the loss of processing throughput.

Zeebe-2 has restarted at ~09:19:45 so the OOM should've happened just before that.

Zeebe 2

If we filter on that pod alone, we see that it was actually shortly processing as leader just before the OOM.
GC is much more quiet here before the OOM. JVM memory usage is about 600MB and direct buffer pool memory has just increased to this ~860MB again (just like before). RocksDB is still stable at 500MB per partition, no screenshot added.

Zeebe 2 did not produce any interesting logs, as far as I could tell.
Zeebe 1

Zeebe-1 also does some processing as leader shortly before its OOM, ~5 min after zeebe-2 crashed.
Zeebe-1 looks a lot like zeebe-2 when we look at the memory decomposition. Note the increase in direct pool buffer memory just before the OOM like the other cases.

Partitions fully recovered, but about 1m30s after a snapshot was committed, an actor appears blocked. This means that the health tick is no longer updated. Directly after this, the pod dies.
zeebe-2 @ 2021-12-28 ~22:50:00
Again only a small dip in processing throughput (nice and quick failover 🚀 )
Zeebe-2 was leader and processing before OOM

Interestingly, the logs just before the restart of zeebe-2 at this time, are practically identical to the logs of zeebe-2 on the first OOM (the day before on the 27th).
Zeebe-2 had just transitioned to INACTIVE and closed the database. It was transitioning to FOLLOWER again and just after it opened the database it is transitioning the StreamProcessor. Which is the same transition it OOM-ed at the day before.
If you look at the logs from before that time, for a long period (at least multiple hours) it keeps transitioning between follower and inactive and the opposite direction. It's in a loop:
This also happened the day before: https://cloudlogging.app.goo.gl/7qpb4Rammh11eqYh6
Hypothesis
Looking at the above cases, it seems that a partition gets stuck in a transition loop between FOLLOWER and INACTIVE. Perhaps we have a memory leak in transitions.
The text was updated successfully, but these errors were encountered: