-
Notifications
You must be signed in to change notification settings - Fork 147
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Very high memory usage on v2.3.3 - is this configurable? #529
Comments
Interesting! Is your bazel-remote configured with storage_mode zstd or uncompressed? It seems you access the cache via gRPC and not HTTP. Can you confirm? |
We're using zstd, and that's correct - we access the cache via gRPC |
Are you using bazel 5.0's new |
We are not currently using that. That does seem valuable though, and we'll definitely explore it. I don't suspect that would decrease the memory footprint on -- Sidenote: The bazel flag tweaks we tried yesterday evening helped, but they were insufficient. We experienced another OOM today (and were close to the wire a few other times). We're currently trying to horizontally scale the cache (based on this comment) as further mitigation. |
bazel-remote should use less memory if it's using zstd compressed storage and the clients are downloading zstd-compressed data (bazel-remote can just write compressed data from disk instead of compressing it on each request). Another experiment you could try is to run bazel-remote with the uncompressed storage mode. That would exclude zstd compression/decompression from the setup. if you still see OOMs then we would know to focus elsewhere. Saving some pprof data while memory usage is high might also be helpful. IIRC bazel-remote 1.1.0 was built with go 1.14.2, and go 1.16 switched from using MADV_FREE to MADV_DONTNEED, which might be related. |
There are some notes on the GODEBUG environment variable here, it's a comma separated list of settings: One of the settings is It might also be worth setting You can try also playing with the Re the discrepancy between the memory profile's view of memory usage and the systems, there are so many different ways to count memory usage that I think the first step is to try to understand what each tool is measuring. Is that a screenshot from top? Is it running inside docker, or outside? |
The screenshot was indeed from Thanks for those links :) |
Had the same problem when disk size passed 1Tb it would oom on 64Gb memory server. Setting GOGC=20 solved the problem. |
v2.3.9 has a new |
Hello, I can reproduce unusually high memory usage under very specific configuration.
Test is performed for a large build (~40GB of artefacts), on a single client on the same LAN. |
Turning off compression |
Bazel remote version is 2.4.3 on all servers. |
@liam-baker-sm: Thanks for the report. Which storage mode is bazel-remote using in this scenario? In the ideal setup, with bazel-remote storing zstd compressed blobs, and bazel requesting zstd blobs, they should be able to be streamed directly from the filesystem without recompression. |
We are now using GOMEMLIMIT available in never versions of go. This solves the problem of "transient spike in the live heap size" |
I added a similar suggestion to the systemd configuration example recently: 2bcc2f5 |
We're experiencing severe memory issues w/ the cache since upgrading to v2.3.3 (from v1.1.0). These were asymptomatic for most of January and February, but started causing frequent cache OOMs following our upgrade to Bazel 5 at the beginning of last week. The memory footprint was already significantly higher prior to the Bazel 5 upgrade, however.
Prior to the bazel-remote cache upgrade (which took place 01/01/2022), memory usage was minimal. Following the upgrade, the cache process regularly uses up all the memory on the host (~92g), resulting in the OOM killer killing the cache.
We noticed that the used file handles count markedly dropped following the cache upgrade as well, which leads us to believe some actions that previously relied on heavy disk usage now occur in-memory.
--
A very large chunk of the memory usage occurs during cache startup. For example, following a crash at 2:10pm, the cache was holding 70g of memory by 2:29pm, which is when the cache finally started serving requests. You can see the memory usage trend for that OOM/restart (and two prior ones) on this screenshot:
The cache logs show
Most of the memory surge occurs during the "Removing incomplete file" steps, and a second surge occurs as the LRU index is built.
Attempted Mitigations:
We attempted restricting the memory allowance for the Docker container via the
-m
docker flag in hopes to at least keep the process from OOMing, but this did not suffice - the service became unresponsive.Given that the memory issues became a much worse following the Bazel 5 upgrade, we tweaked these Bazel flags:
--experimental_remote_cache_async
flag--remote_max_connections=10
(we previously had it set to 0, which means no limit, but this didn't affect grpc connections prior to Bazel 5).Even if these help (we'll find out as tomorrow's workday picks up), we'll still be very close to running out of memory (as we were through February, before the Bazel 5 upgrade).
Is there some way to configure how much memory the bazel-remote process utilizes?
The text was updated successfully, but these errors were encountered: