Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Very high memory usage on v2.3.3 - is this configurable? #529

Open
SrodriguezO opened this issue Mar 8, 2022 · 17 comments
Open

Very high memory usage on v2.3.3 - is this configurable? #529

SrodriguezO opened this issue Mar 8, 2022 · 17 comments

Comments

@SrodriguezO
Copy link

SrodriguezO commented Mar 8, 2022

We're experiencing severe memory issues w/ the cache since upgrading to v2.3.3 (from v1.1.0). These were asymptomatic for most of January and February, but started causing frequent cache OOMs following our upgrade to Bazel 5 at the beginning of last week. The memory footprint was already significantly higher prior to the Bazel 5 upgrade, however.

Prior to the bazel-remote cache upgrade (which took place 01/01/2022), memory usage was minimal. Following the upgrade, the cache process regularly uses up all the memory on the host (~92g), resulting in the OOM killer killing the cache.

usable_mem

We noticed that the used file handles count markedly dropped following the cache upgrade as well, which leads us to believe some actions that previously relied on heavy disk usage now occur in-memory.

remote_file_handles

--

A very large chunk of the memory usage occurs during cache startup. For example, following a crash at 2:10pm, the cache was holding 70g of memory by 2:29pm, which is when the cache finally started serving requests. You can see the memory usage trend for that OOM/restart (and two prior ones) on this screenshot:

remote_mem_used_2

The cache logs show

<~21:10:00 process starts - logs are truncated, so the exact timestamp is missing, but our service wrapper simply launches this docker container>
…
… <tons of "Removing incomplete file" logs>
2022/03/07 21:24:28 Removing incomplete file: /bazel-remote/cas.v2/ff/f…
2022/03/07 21:24:28 Removing incomplete file: /bazel-remote/cas.v2/ff/f…
2022/03/07 21:24:29 Removing incomplete file: /bazel-remote/cas.v2/ff/f…
2022/03/07 21:24:29 Sorting cache files by atime.
2022/03/07 21:26:26 Building LRU index.
2022/03/07 21:29:41 Finished loading disk cache files.
2022/03/07 21:29:41 Loaded 54823473 existing disk cache items.
2022/03/07 21:29:41 Mangling non-empty instance names with AC keys: disabled
2022/03/07 21:29:41 gRPC AC dependency checks: enabled
2022/03/07 21:29:41 experimental gRPC remote asset API: disabled
2022/03/07 21:29:41 Starting gRPC server on address :8081
2022/03/07 21:29:41 Starting HTTP server on address :8080
2022/03/07 21:29:41 HTTP AC validation: enabled
2022/03/07 21:29:41 Starting HTTP server for profiling on address :8082
2022/03/07 21:29:42 GRPC CAS HEAD … OK
2022/03/07 21:29:42 GRPC CAS HEAD … OK
2022/03/07 21:29:42 GRPC CAS HEAD … OK
…

Most of the memory surge occurs during the "Removing incomplete file" steps, and a second surge occurs as the LRU index is built.

Attempted Mitigations:
We attempted restricting the memory allowance for the Docker container via the -m docker flag in hopes to at least keep the process from OOMing, but this did not suffice - the service became unresponsive.

Given that the memory issues became a much worse following the Bazel 5 upgrade, we tweaked these Bazel flags:

  • We unset the --experimental_remote_cache_async flag
  • We set --remote_max_connections=10 (we previously had it set to 0, which means no limit, but this didn't affect grpc connections prior to Bazel 5).

Even if these help (we'll find out as tomorrow's workday picks up), we'll still be very close to running out of memory (as we were through February, before the Bazel 5 upgrade).

Is there some way to configure how much memory the bazel-remote process utilizes?

@ulrfa
Copy link
Contributor

ulrfa commented Mar 8, 2022

Interesting!

Is your bazel-remote configured with storage_mode zstd or uncompressed?

It seems you access the cache via gRPC and not HTTP. Can you confirm?

@SrodriguezO
Copy link
Author

We're using zstd, and that's correct - we access the cache via gRPC

@mostynb
Copy link
Collaborator

mostynb commented Mar 8, 2022

Are you using bazel 5.0's new --experimental_remote_cache_compression flag? If so, I would recommend upgrading bazel-remote to v2.3.4 and trying a bazel version with this fix: bazelbuild/bazel@8ebd70b

@SrodriguezO
Copy link
Author

Are you using bazel 5.0's new --experimental_remote_cache_compression flag? If so, I would recommend upgrading bazel-remote to v2.3.4 and trying a bazel version with this fix: bazelbuild/bazel@8ebd70b

We are not currently using that. That does seem valuable though, and we'll definitely explore it.

I don't suspect that would decrease the memory footprint on bazel-remote though, right? Is there any way to cap memory usage atm?

--

Sidenote: The bazel flag tweaks we tried yesterday evening helped, but they were insufficient. We experienced another OOM today (and were close to the wire a few other times).

We're currently trying to horizontally scale the cache (based on this comment) as further mitigation.

@mostynb
Copy link
Collaborator

mostynb commented Mar 8, 2022

bazel-remote should use less memory if it's using zstd compressed storage and the clients are downloading zstd-compressed data (bazel-remote can just write compressed data from disk instead of compressing it on each request).

Another experiment you could try is to run bazel-remote with the uncompressed storage mode. That would exclude zstd compression/decompression from the setup. if you still see OOMs then we would know to focus elsewhere. Saving some pprof data while memory usage is high might also be helpful.

IIRC bazel-remote 1.1.0 was built with go 1.14.2, and go 1.16 switched from using MADV_FREE to MADV_DONTNEED, which might be related.

@SrodriguezO
Copy link
Author

Good idea on the pprofs. I'm a bit confused though. The in-use memory profiles don't seem to account for even half of the memory the service is using:
bazel-remote_mem-top
bazel-remote_pprof-heap_inuse-space
bazel-remote_pprof-heap_inuse_cumulative-sort

The cumulative memory in use according to the pprof is ~20GB, but the service was using ~55GB at that time.

It seems the vast majority of the memory usage reported by the pprof is around file loading. At least during this snapshot.

The alloc memory profiles might shed some insight into memory usage spikes that might not have been happening when I took the profile. If I'm interpreting this correctly, a large chunk of memory usage during writes was during zstd encoding, so we might indeed get some benefit from --experimental_remote_cache_compression once Bazel 5.1 goes out.
bazel-remote_pprof-heap_alloc_cumulative-sort

Other large chunks seem to be during disk cache writes and grpc responses. Tightening --remote_max_connections hopefully helps there.

Some questions:

  • Why is the in-use memory on the profile not accounting for ⅗ of the reported memory usage?
  • Is there a way to maybe make GC more aggressive for this service? Normalized load on our host remained fairly low during the incidents, so we could probably swing that to keep memory usage down.
  • Regarding your coment about the go MADV_FREE -> MADV_DONTNEED change, do you know if there's a way to toggle that for this service?
  • Is there any other info I can share to help troubleshoot this issue?

Also, thank you for your prompt replies, I really appreciate that you're helping me work through this :)

-Sergio

@mostynb
Copy link
Collaborator

mostynb commented Mar 9, 2022

There are some notes on the GODEBUG environment variable here, it's a comma separated list of settings:
https://pkg.go.dev/runtime?utm_source=godoc#hdr-Environment_Variables

One of the settings is madvdontneed=0 to use MADV_FREE (the old setting) instead of MADV_DONTNEED. You can read a little about what they mean here:
https://man7.org/linux/man-pages/man2/madvise.2.html

It might also be worth setting gctrace=1 to get some GC stats in your logs.

You can try also playing with the GOGC environment variable, to trigger GC more often (also described in the pkg.go link above).

Re the discrepancy between the memory profile's view of memory usage and the systems, there are so many different ways to count memory usage that I think the first step is to try to understand what each tool is measuring. Is that a screenshot from top? Is it running inside docker, or outside?

@SrodriguezO
Copy link
Author

The screenshot was indeed from top, running outside the container. The container is just docker run <flags> buchgr/bazel-remote-cache:v2.3.3

Thanks for those links :)

@tobbe76
Copy link

tobbe76 commented May 18, 2022

Had the same problem when disk size passed 1Tb it would oom on 64Gb memory server. Setting GOGC=20 solved the problem.

@mostynb
Copy link
Collaborator

mostynb commented Sep 1, 2022

v2.3.9 has a new --zstd_implementation cgo mode, which might reduce memory usage. Please let me know if it helps.

@liam-baker-sm
Copy link

Hello, I can reproduce unusually high memory usage under very specific configuration.

  • GRPC connection between the bazel build and the bazel-remote server.
  • Compressed transfer ( --experimental_remote_cache_compression )
  • Toplelvel download ( --remote_download_toplevel )
    With this combination, memory use on the cache server reaches 10GB
    Removing --remote_download_toplevel memory use on this server does not exceed 3GB.

Test is performed for a large build (~40GB of artefacts), on a single client on the same LAN.
Server is for local office use and has a HTTP proxy backend defined, pointing to the main CI cache.

@liam-baker-sm
Copy link

liam-baker-sm commented Sep 28, 2023

Turning off compression --experimental_remote_cache_compression only and running with --remote_download_toplevel the server memory use peaks at 5.1GB. I suspect, based on the bazel output, this is the result of "queing up" fetches and multiplexing them in parallel over the grpc channel (currently there are 300 concurrent fetches in progress over 5 grpc connections).

@liam-baker-sm
Copy link

Bazel remote version is 2.4.3 on all servers.

@mostynb
Copy link
Collaborator

mostynb commented Sep 28, 2023

@liam-baker-sm: Thanks for the report.

Which storage mode is bazel-remote using in this scenario? In the ideal setup, with bazel-remote storing zstd compressed blobs, and bazel requesting zstd blobs, they should be able to be streamed directly from the filesystem without recompression.

@tobbe76
Copy link

tobbe76 commented Sep 29, 2023

We are now using GOMEMLIMIT available in never versions of go. This solves the problem of "transient spike in the live heap size"
https://tip.golang.org/doc/gc-guide

@mostynb
Copy link
Collaborator

mostynb commented Sep 29, 2023

We are now using GOMEMLIMIT available in never versions of go. This solves the problem of "transient spike in the live heap size" https://tip.golang.org/doc/gc-guide

I added a similar suggestion to the systemd configuration example recently: 2bcc2f5

@liam-baker-sm
Copy link

@mostynb The bazel-remote I ran the test against is using uncompressed storage, due to #524 The instance points to another bazel-remote using --http_proxy.url

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants