Very high memory usage on v2.3.3 - is this configurable? #529

SrodriguezO · 2022-03-08T02:56:00Z

We're experiencing severe memory issues w/ the cache since upgrading to v2.3.3 (from v1.1.0). These were asymptomatic for most of January and February, but started causing frequent cache OOMs following our upgrade to Bazel 5 at the beginning of last week. The memory footprint was already significantly higher prior to the Bazel 5 upgrade, however.

Prior to the bazel-remote cache upgrade (which took place 01/01/2022), memory usage was minimal. Following the upgrade, the cache process regularly uses up all the memory on the host (~92g), resulting in the OOM killer killing the cache.

We noticed that the used file handles count markedly dropped following the cache upgrade as well, which leads us to believe some actions that previously relied on heavy disk usage now occur in-memory.

--

A very large chunk of the memory usage occurs during cache startup. For example, following a crash at 2:10pm, the cache was holding 70g of memory by 2:29pm, which is when the cache finally started serving requests. You can see the memory usage trend for that OOM/restart (and two prior ones) on this screenshot:

The cache logs show

<~21:10:00 process starts - logs are truncated, so the exact timestamp is missing, but our service wrapper simply launches this docker container>
…
… <tons of "Removing incomplete file" logs>
2022/03/07 21:24:28 Removing incomplete file: /bazel-remote/cas.v2/ff/f…
2022/03/07 21:24:28 Removing incomplete file: /bazel-remote/cas.v2/ff/f…
2022/03/07 21:24:29 Removing incomplete file: /bazel-remote/cas.v2/ff/f…
2022/03/07 21:24:29 Sorting cache files by atime.
2022/03/07 21:26:26 Building LRU index.
2022/03/07 21:29:41 Finished loading disk cache files.
2022/03/07 21:29:41 Loaded 54823473 existing disk cache items.
2022/03/07 21:29:41 Mangling non-empty instance names with AC keys: disabled
2022/03/07 21:29:41 gRPC AC dependency checks: enabled
2022/03/07 21:29:41 experimental gRPC remote asset API: disabled
2022/03/07 21:29:41 Starting gRPC server on address :8081
2022/03/07 21:29:41 Starting HTTP server on address :8080
2022/03/07 21:29:41 HTTP AC validation: enabled
2022/03/07 21:29:41 Starting HTTP server for profiling on address :8082
2022/03/07 21:29:42 GRPC CAS HEAD … OK
2022/03/07 21:29:42 GRPC CAS HEAD … OK
2022/03/07 21:29:42 GRPC CAS HEAD … OK
…

Most of the memory surge occurs during the "Removing incomplete file" steps, and a second surge occurs as the LRU index is built.

Attempted Mitigations:
We attempted restricting the memory allowance for the Docker container via the -m docker flag in hopes to at least keep the process from OOMing, but this did not suffice - the service became unresponsive.

Given that the memory issues became a much worse following the Bazel 5 upgrade, we tweaked these Bazel flags:

We unset the --experimental_remote_cache_async flag
We set --remote_max_connections=10 (we previously had it set to 0, which means no limit, but this didn't affect grpc connections prior to Bazel 5).

Even if these help (we'll find out as tomorrow's workday picks up), we'll still be very close to running out of memory (as we were through February, before the Bazel 5 upgrade).

Is there some way to configure how much memory the bazel-remote process utilizes?

The text was updated successfully, but these errors were encountered:

ulrfa · 2022-03-08T10:45:42Z

Interesting!

Is your bazel-remote configured with storage_mode zstd or uncompressed?

It seems you access the cache via gRPC and not HTTP. Can you confirm?

SrodriguezO · 2022-03-08T16:42:00Z

We're using zstd, and that's correct - we access the cache via gRPC

mostynb · 2022-03-08T22:34:34Z

Are you using bazel 5.0's new --experimental_remote_cache_compression flag? If so, I would recommend upgrading bazel-remote to v2.3.4 and trying a bazel version with this fix: bazelbuild/bazel@8ebd70b

SrodriguezO · 2022-03-08T22:54:08Z

Are you using bazel 5.0's new --experimental_remote_cache_compression flag? If so, I would recommend upgrading bazel-remote to v2.3.4 and trying a bazel version with this fix: bazelbuild/bazel@8ebd70b

We are not currently using that. That does seem valuable though, and we'll definitely explore it.

I don't suspect that would decrease the memory footprint on bazel-remote though, right? Is there any way to cap memory usage atm?

--

Sidenote: The bazel flag tweaks we tried yesterday evening helped, but they were insufficient. We experienced another OOM today (and were close to the wire a few other times).

We're currently trying to horizontally scale the cache (based on this comment) as further mitigation.

mostynb · 2022-03-08T23:15:50Z

bazel-remote should use less memory if it's using zstd compressed storage and the clients are downloading zstd-compressed data (bazel-remote can just write compressed data from disk instead of compressing it on each request).

Another experiment you could try is to run bazel-remote with the uncompressed storage mode. That would exclude zstd compression/decompression from the setup. if you still see OOMs then we would know to focus elsewhere. Saving some pprof data while memory usage is high might also be helpful.

IIRC bazel-remote 1.1.0 was built with go 1.14.2, and go 1.16 switched from using MADV_FREE to MADV_DONTNEED, which might be related.

SrodriguezO · 2022-03-09T17:09:31Z

Good idea on the pprofs. I'm a bit confused though. The in-use memory profiles don't seem to account for even half of the memory the service is using:

The cumulative memory in use according to the pprof is ~20GB, but the service was using ~55GB at that time.

It seems the vast majority of the memory usage reported by the pprof is around file loading. At least during this snapshot.

The alloc memory profiles might shed some insight into memory usage spikes that might not have been happening when I took the profile. If I'm interpreting this correctly, a large chunk of memory usage during writes was during zstd encoding, so we might indeed get some benefit from --experimental_remote_cache_compression once Bazel 5.1 goes out.

Other large chunks seem to be during disk cache writes and grpc responses. Tightening --remote_max_connections hopefully helps there.

Some questions:

Why is the in-use memory on the profile not accounting for ⅗ of the reported memory usage?
Is there a way to maybe make GC more aggressive for this service? Normalized load on our host remained fairly low during the incidents, so we could probably swing that to keep memory usage down.
Regarding your coment about the go MADV_FREE -> MADV_DONTNEED change, do you know if there's a way to toggle that for this service?
Is there any other info I can share to help troubleshoot this issue?

Also, thank you for your prompt replies, I really appreciate that you're helping me work through this :)

-Sergio

mostynb · 2022-03-09T22:04:40Z

There are some notes on the GODEBUG environment variable here, it's a comma separated list of settings:
https://pkg.go.dev/runtime?utm_source=godoc#hdr-Environment_Variables

One of the settings is madvdontneed=0 to use MADV_FREE (the old setting) instead of MADV_DONTNEED. You can read a little about what they mean here:
https://man7.org/linux/man-pages/man2/madvise.2.html

It might also be worth setting gctrace=1 to get some GC stats in your logs.

You can try also playing with the GOGC environment variable, to trigger GC more often (also described in the pkg.go link above).

Re the discrepancy between the memory profile's view of memory usage and the systems, there are so many different ways to count memory usage that I think the first step is to try to understand what each tool is measuring. Is that a screenshot from top? Is it running inside docker, or outside?

SrodriguezO · 2022-03-09T22:43:23Z

The screenshot was indeed from top, running outside the container. The container is just docker run <flags> buchgr/bazel-remote-cache:v2.3.3

Thanks for those links :)

tobbe76 · 2022-05-18T19:30:53Z

Had the same problem when disk size passed 1Tb it would oom on 64Gb memory server. Setting GOGC=20 solved the problem.

mostynb · 2022-09-01T16:36:17Z

v2.3.9 has a new --zstd_implementation cgo mode, which might reduce memory usage. Please let me know if it helps.

liam-baker-sm · 2023-09-28T03:59:43Z

Hello, I can reproduce unusually high memory usage under very specific configuration.

GRPC connection between the bazel build and the bazel-remote server.
Compressed transfer ( --experimental_remote_cache_compression )
Toplelvel download ( --remote_download_toplevel )
With this combination, memory use on the cache server reaches 10GB
Removing --remote_download_toplevel memory use on this server does not exceed 3GB.

Test is performed for a large build (~40GB of artefacts), on a single client on the same LAN.
Server is for local office use and has a HTTP proxy backend defined, pointing to the main CI cache.

liam-baker-sm · 2023-09-28T04:12:59Z

Turning off compression --experimental_remote_cache_compression only and running with --remote_download_toplevel the server memory use peaks at 5.1GB. I suspect, based on the bazel output, this is the result of "queing up" fetches and multiplexing them in parallel over the grpc channel (currently there are 300 concurrent fetches in progress over 5 grpc connections).

liam-baker-sm · 2023-09-28T04:18:01Z

Bazel remote version is 2.4.3 on all servers.

mostynb · 2023-09-28T12:22:37Z

@liam-baker-sm: Thanks for the report.

Which storage mode is bazel-remote using in this scenario? In the ideal setup, with bazel-remote storing zstd compressed blobs, and bazel requesting zstd blobs, they should be able to be streamed directly from the filesystem without recompression.

tobbe76 · 2023-09-29T19:17:10Z

We are now using GOMEMLIMIT available in never versions of go. This solves the problem of "transient spike in the live heap size"
https://tip.golang.org/doc/gc-guide

mostynb · 2023-09-29T20:26:03Z

We are now using GOMEMLIMIT available in never versions of go. This solves the problem of "transient spike in the live heap size" https://tip.golang.org/doc/gc-guide

I added a similar suggestion to the systemd configuration example recently: 2bcc2f5

liam-baker-sm · 2023-10-02T05:53:08Z

@mostynb The bazel-remote I ran the test against is using uncompressed storage, due to #524 The instance points to another bazel-remote using --http_proxy.url

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Very high memory usage on v2.3.3 - is this configurable? #529

Very high memory usage on v2.3.3 - is this configurable? #529

SrodriguezO commented Mar 8, 2022 •

edited

ulrfa commented Mar 8, 2022

SrodriguezO commented Mar 8, 2022

mostynb commented Mar 8, 2022

SrodriguezO commented Mar 8, 2022

mostynb commented Mar 8, 2022

SrodriguezO commented Mar 9, 2022

mostynb commented Mar 9, 2022

SrodriguezO commented Mar 9, 2022

tobbe76 commented May 18, 2022

mostynb commented Sep 1, 2022

liam-baker-sm commented Sep 28, 2023

liam-baker-sm commented Sep 28, 2023 •

edited

liam-baker-sm commented Sep 28, 2023

mostynb commented Sep 28, 2023

tobbe76 commented Sep 29, 2023

mostynb commented Sep 29, 2023

liam-baker-sm commented Oct 2, 2023

Very high memory usage on v2.3.3 - is this configurable? #529

Very high memory usage on v2.3.3 - is this configurable? #529

Comments

SrodriguezO commented Mar 8, 2022 • edited

ulrfa commented Mar 8, 2022

SrodriguezO commented Mar 8, 2022

mostynb commented Mar 8, 2022

SrodriguezO commented Mar 8, 2022

mostynb commented Mar 8, 2022

SrodriguezO commented Mar 9, 2022

mostynb commented Mar 9, 2022

SrodriguezO commented Mar 9, 2022

tobbe76 commented May 18, 2022

mostynb commented Sep 1, 2022

liam-baker-sm commented Sep 28, 2023

liam-baker-sm commented Sep 28, 2023 • edited

liam-baker-sm commented Sep 28, 2023

mostynb commented Sep 28, 2023

tobbe76 commented Sep 29, 2023

mostynb commented Sep 29, 2023

liam-baker-sm commented Oct 2, 2023

SrodriguezO commented Mar 8, 2022 •

edited

liam-baker-sm commented Sep 28, 2023 •

edited