Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reject requests instead of crashing when overloaded #638

Open
ulrfa opened this issue Feb 10, 2023 · 15 comments
Open

Reject requests instead of crashing when overloaded #638

ulrfa opened this issue Feb 10, 2023 · 15 comments

Comments

@ulrfa
Copy link
Contributor

ulrfa commented Feb 10, 2023

We experience bazel-remote crashes when local SSD is overloaded by write operations. It would be preferable if bazel-remote would reject requests instead of crashing.

The crash message contains: runtime: program exceeds 10000-thread limit and among the listed goroutines, there are almost 10000 stack traces containing either diskCache.Put or diskCache.removeFile, doing syscalls. They originate from both HTTP and gRPC requests.

What do you think about rejecting incoming requests when reaching a configurable number of max concurrent diskCache.Put invocations? Or are there better ways?

We have not experienced overload from read requests, the network interface is probably saturated before that would happen. Therefore, it seems beneficial to still allow read requests and get cache hits even when writes are rejected.

I’m uncertain about proper relation between the semaphore constant (currently 5000) for the diskCache.removeFile, the number of allowed diskCache.Put and Go’s default 10 000 operating system threads. Would it make sense to set one limit for the sum of diskCache.removeFile and diskCache.Put? Should bazel-remote support tuning https://pkg.go.dev/runtime/debug#SetMaxThreads?

It is important to not reject or hinder Prometheus HTTP requests, because the metrics are even more important in overload conditions.

@mostynb
Copy link
Collaborator

mostynb commented Feb 10, 2023

This is an interesting problem.

It would be useful for me (and potentially for other large users) if you could share some details of your storage, the normal load, and the spikes you're seing (either here, or via email if you prefer not to share this info publicly). Does the client handle the service outage gracefully? And do you have bazel-remote configured to restart and hopefully recover from the overload?

The first things to suggest if you want to maintain performance would be:

  • Switch to faster storage that might be able to handle the load without falling behind.
  • Partition the cache (eg separate cache with separate storage for each platform, or for each release branch, or similar).

If the storage is falling behind due to the number of requests, then I doubt that increasing runtime/debug.SetMaxThreads would help for long (though it probably depends on the duration of the spikes). You're welcome to experiment with this, of course (it would be a one-line patch to try a higher limit).

I think it would be reasonable to try to make bazel-remote attempt to recognise when the storage is falling behind and reject new requests (eg with the unofficial HTTP 529 status code or gRPC RESOURCE_EXHAUSTED or UNAVAILABLE - though I'm not sure if bazel handles those gracefully). Write actions could for example use semaphore#Weighted.TryAcquire on the same semaphore used for file removal.

Prometheus metric requests don't touch the filesystem, and should be infrequent enough anyway to be left as-is.

mostynb added a commit to mostynb/bazel-remote that referenced this issue Feb 12, 2023
…erloaded

We have been using a file removal semaphore with weight 5,000 (half of Go's
default 10,000 maximum OS threads, beyond which Go will crash), in an attempt
to avoid crashing when the filesystem/storage layer can't keep up with our
requirements.

This change renames that semaphore to `diskWaitSem` and also uses it for
disk-write operations. When the semaphore cannot be acquired for disk-writes,
we return HTTP 503 (service unavailable) or gRPC RESOURCE_EXHAUSTED error codes
to the client.

Relates to buchgr#638
@ulrfa
Copy link
Contributor Author

ulrfa commented Feb 12, 2023

Thanks @mostynb!

We have patched bazel client with a custom load balancer. It distributes load over several bazel-remote instances. The patch contains logic to ignore cache instances for a minute if experienced remote cache issues, and during that time handle requests as if they were cache misses. That works well for local builds with remote caches, and should be OK also with rejects.

That patch is not yet compatible with remote execution. For remote execution we use bazel's --remote_local_fallback, so the builds does not fail when rejected, but rejects are still problematic when local host are not dimensioned for building locally.

In both cases, the bazel client behaviour allows the remote cache to recover from the overload.

Our normal load is about 500 – 1500 write requests/second per bazel-remote instance.

We had three crashes. Our monitoring show request rate calculated over 20 second intervals, so I don't know exactly how high or short the spikes were:

  • Crash with 3000 write requests/second for 20 seconds. I’ve seen similar spikes also that did not cause crash.
  • Crash with 6000 write requests/second for 20 seconds. At the crash there were 14000 fileRemoval goroutines (5000 active, rest waiting for semaphore)
  • Crash with 6000 write requests/seconds for 60 seconds. At the crash there were 1000 fileRemovall goroutines.

I guess the numbers above can vary for different kind of builds and different file sizes.

Thanks for #639 and your quick response! I will try to arrange for us to try it!

I’m uncertain about what constants to choose, but I’m afraid 5000 could result in unnecessary rejects for very short high spikes. Especially if there are writes of large files that cause huge number of small file evictions. I’m thinking about trying something like:

  • semaphoreWeight := int64(50000)
  • SetMaxThreads(60000)
  • ulimit -u 65536

@ulrfa
Copy link
Contributor Author

ulrfa commented Mar 10, 2023

Sorry for later answer, took time to get hardware and to experiment.

When trying #639, bazel-remote rejects requests instead of crashing, which is good. But it rejected requests even when low load. Because writing one single large file can result in eviction of many thousands small files, and those eviction spikes can be very unpredictable. We tried increasing the semaphoreWeight from 5000 to 50000 and MaxThreads correspondingly, but it did not help, just as you predicted. We experienced the go runtime never release started phreads, therefore I suspect already 5000 concurrent file removal syscalls is more than desired.

However, the good news: We experienced that if the cache is not full and no evictions occurs, then no crashes, request latency is fine and number of threads forked by the go runtime was below 2000.

In other words, the overload seems to be caused by the evictions. What if we can minimize evictions at peak load to increase max throughput? And improve write latency by avoid performing evictions as part of handling incoming write requests? We experimented with a single go routine that evicted files slowly one at a time in background proactively when lru.currentSize is above a threshold. That single go routine could not evict files as fast as they were written at peaks (~2 minutes long peaks), but easily caught up between the peaks. The lru.currentSize fluctuated only ~400 MB. And the performance was good, similar to when no evictions at all.

Other scenarios might be different, but I guess there would be plenty of margin if evicting proactively when lru.currentSize for example is above 95% of lru.maxSize. Using disk space as buffer and wasting 5% disk space to manage peaks would be a good trade-off for us.

As extra safety, perhaps incoming disk.Put could still also perform evictions in the rare case that lru.currentSize would reach lru.maxSize.

What do you think @mostynb?

@mostynb
Copy link
Collaborator

mostynb commented Mar 13, 2023

Thanks for the update, lots of things to consider here.

I have been thinking about changing the file removal, so that when under load we process the removal of small blobs asynchronously at a slower rate. We might temporarily go over the cache size limit, but if we reduce the effective cache size limit that would kind of be equivalent to proactive evictions when reaching the limit, and wouldn't require polling the cache size.

@ulrfa
Copy link
Contributor Author

ulrfa commented Mar 13, 2023

Interesting! What do you have in mind for implementing “asynchronously at a slower rate”? Can it avoid creating a goroutine for each blob? Example: If using 5% of a 2 TB disk as buffer for blobs scheduled for eviction, there can be many such blobs. Maybe 100 000 blobs, maybe more. I'm not sure how the go runtime would handle that many goroutines, but if a crash printed 100 000 stacktraces for me to read, then it would be cumbersome. 😃

I imagine having one threshold size that triggers evictions and another upper max size where bazel-remote would start rejecting write requests.

@mostynb
Copy link
Collaborator

mostynb commented Mar 14, 2023

golang.org/x/sync/semaphore is very small, we could fork it and add functions which expose the reserved/current size of the semaphore, and that could be used as a heuristic for when we're under high disk load.

eg if we added a TryAcquireIfUnder(n int64, limit int64) function, then we could allow immediate removal for small blobs if that succeeds with a smaller limit (eg 1000). And if we fail to acquire the semaphore, remove the blob at a slower rate using a separate mechanism that is lightweight enough. Note that file removal already happens in a separate goroutine while not holding the index lock.

@ulrfa
Copy link
Contributor Author

ulrfa commented Mar 14, 2023

then we could allow immediate removal for small blobs

I'm striving to perform evictions with low priority (compared to the handling of incoming requests) rather than immediate removal of blobs. In order to not only mange the peak load, but also to reduce latency for incoming requests regardless of load.

@AlessandroPatti
Copy link
Contributor

@mostynb @ulrfa I was experimenting a bit for a solution to this issue and would love your feedback on this PoC: #695. The idea is to have a fixed number of go routines that pick eviction tasks from a queue. Both the number of goroutines and the max length of the queue are configurable so that there is more freedom in deciding how many resources to allocate to eviction and when to consider the service overloaded.
The risk is that the disk can fall behind and as the queue grows larger than the number of go routines the cache size will go above the limit, but this seems already the case when there are more than 5000 eviction happening in parallel with the current semaphore-based solution. This solution should also work in the @ulrfa described where eviction falls behind during peak traffic, but can recover later consuming the queued tasks when the traffic reduces, without having tens of thousands of go routine pending, competing over the semaphore.

ulrfa pushed a commit to ulrfa/bazel-remote that referenced this issue Sep 1, 2023
…erloaded

We have been using a file removal semaphore with weight 5,000 (half of Go's
default 10,000 maximum OS threads, beyond which Go will crash), in an attempt
to avoid crashing when the filesystem/storage layer can't keep up with our
requirements.

This change renames that semaphore to `diskWaitSem` and also uses it for
disk-write operations. When the semaphore cannot be acquired for disk-writes,
we return HTTP 503 (service unavailable) or gRPC RESOURCE_EXHAUSTED error codes
to the client.

Relates to buchgr#638
@ulrfa
Copy link
Contributor Author

ulrfa commented Sep 1, 2023

Thanks @AlessandroPatti!

Status update from me:

I did several experiments and surprisingly found that deleting lots of files sequentially with a single goroutine, is much faster than starting separate go routines and removing them in parallel, despite SSDs with high IOPS performance. I implemented a new set of patches based on those findings and we have used them in production a few months. We are very happy with them, since they do not only avoid the crashes on high load, but also improves the performance in scenarios with many cache evictions.

I intended to push them upstream before, but I did not have time to rebase and write test cases, so it did not happen. But now I pushed them as is: #696

My set of patches are based on @mostynb's previous commit and adds 3 parts. The "Optimize file removals (part 1)" is based on a queue just as @AlessandroPatti's PoC. Two people independently arrive at the same conclusion is a good sign! 😃

@AlessandroPatti, have you benchmarked any scenario where having more than one gorutine for deleting files from the queue, is improving the performance?

@mostynb, have you been thinking more about this?

@AlessandroPatti
Copy link
Contributor

@ulrfa Thanks for sharing your findings!

I haven't run any "scientific" benchmark on this but from experimenting, performing eviction in parallel seems generally better. This is what I tried:

# Run bazel remote, with <concurrency> value
$ bazel run :bazel-remote -- --storage_mode uncompressed -max_size 1 -dir /tmp/cache --enable_endpoint_metrics --
max_concurrent_evictions <concurrency> --max_queued_evictions 100000 --access_log_level none

# In another terminal, monitor the state of the eviction queue with
$ watch -n 0.1 'curl -L http://localhost:8080/metrics | grep eviction'

# In yet another terminal, generate a bunch of random file of 1KB and 1MB and send them to bazel remote
$ mkdir /tmp/data
$ for u in K M; do for i in `seq 1 2000`; do dd if=/dev/urandom of="/tmp/data/$i.1$u.data" bs=1$u count=1; done; done
$ for u in M K M K M; do for i in `seq 1 2000`; do curl -X PUT --upload-file /tmp/data/$i.1$u.data "http://localhost:8080/cas/$(sha256sum /tmp/data/$i.1$u.data | cut -d ' ' -f1)" & done; done

By running the above I can see the eviction queue growing when the concurrency is too low, while it's somewhat stable for concurrency ~100.
I'll be happy to run a more thorough benchmark if you could share the scenario you've been testing so we can compare the solutions.

On another note, the solution you've shared seems to throttle PUT requests with a semaphore, but not GET requests which could however also require eviction if the blob is fetched from the backend. Curious if you've tried using a proxy backend and this is not an issue from your experience?

@mostynb
Copy link
Collaborator

mostynb commented Sep 4, 2023

(I'm busier than normal at the moment, but wanted to mention that my goal is to merge some form of #680 this week, then turn my attention back to this issue.)

@ulrfa
Copy link
Contributor Author

ulrfa commented Sep 4, 2023

Most benchmarks I have been running are re-sending traffic patterns from recorded real sessions. I need to think about if I can share something from those.

But for cache eviction performance I propose the following benchmark:

  1. Pre-populate a cas directory structure with 2621440 x 4 KB dummy files = 10 GB.

     fill_cache.py cache_pre_filled 4096 2621440
    
  2. Start bazel-remote with --max_size 1 and messure how long time it takes to evict files from 10 GB down to 1 GB at bazel startup

     cp -r cache_pre_filled cache
     bazel-remote --dir cache --max_size 1
     rm -rf cache
    

In order to messure in step 2 above, I prepared commits with additional logging of "Duration for loading and evicting files" for the #695 and #696 pull requests, and also master:

fill_cache.py is doing:

import os
import os.path
import sys
import hashlib
import binascii
import random
import pathlib

baseDir = sys.argv[1]
entrySizeBytes = int(sys.argv[2])
numberOfEntries = int(sys.argv[3])
totalSizeBytes = entrySizeBytes * numberOfEntries

def getRandomNumberWith9Digits():
     return str(random.randint(111111111,999999999))

def bytesToGigabyte(bytes):
    return bytes / (1024 * 1024 * 1024)

def addFile(baseDir, size):
    content = os.urandom(size)
    m = hashlib.sha256()
    m.update(content)
    hash = binascii.hexlify(m.digest()).decode('ascii')
    filename = '%s/cas.v2/%s/%s-%s.v1' % (baseDir, hash[0:2], hash, getRandomNumberWith9Digits())
    directoriesToCreate = pathlib.Path(os.path.dirname(filename))
    directoriesToCreate.mkdir(parents=True, exist_ok=True)
    dataFile = open(filename, "wb")
    dataFile.write(content)
    dataFile.close()
    os.chmod(filename, 0o666) # bazel-remote use special permissions to detect partially written files.

print('Going to fill %s with dummy entries: %d x %d bytes = %d GB' %
      (baseDir, entrySizeBytes, numberOfEntries, bytesToGigabyte(totalSizeBytes)))

for x in range(numberOfEntries):
    addFile(baseDir, entrySizeBytes)

On one of our cache servers (72 CPUs, 2 x SSDs in RAID-0 config, Linux, XFS), I get the following result:

PR #696 (queue with single consuming goroutine):

  bazel-remote_7db7b9c --dir cache --max_size 1
    Duration for loading and evicting files: 44 s
    Duration for loading and evicting files: 45 s
    Duration for loading and evicting files: 45 s

PR #695 (queue with various numbers of consuming goroutines):

  bazel-remote_6368e28 --dir cache --max_size 1 --max_queued_evictions 2621440 --max_concurrent_evictions 1
    Duration for loading and evicting files: 49 s
    Duration for loading and evicting files: 48 s
    Duration for loading and evicting files: 46 s

  bazel-remote_6368e28 --dir cache --max_size 1 --max_queued_evictions 2621440 --max_concurrent_evictions 10
    Duration for loading and evicting files: 61 s

  bazel-remote_6368e28 --dir cache --max_size 1 --max_queued_evictions 2621440 --max_concurrent_evictions 100 
    Duration for loading and evicting files: 142 s
    Duration for loading and evicting files: 76 s
    Duration for loading and evicting files: 72 s
    Duration for loading and evicting files: 96 s

  bazel-remote_6368e28 --dir cache --max_size 1 --max_queued_evictions 2621440 --max_concurrent_evictions 1000
    Duration for loading and evicting files: 173 s
    Duration for loading and evicting files: 130 s

Latest master (goroutine per file and semaphore weight 5000):

  bazel-remote_871a9f5 --dir cache --max_size 1
    Duration for loading and evicting files: 121 s
    Duration for loading and evicting files: 126 s

It would be interesting to know if you get a similar pattern on your systems? Would you like to try that @AlessandroPatti?

By running the above I can see the eviction queue growing when the concurrency is too low, while it's somewhat stable for concurrency ~100.

Do you think it is possible that the eviction queue is more stable for concurrency ~100, not because old files are removed faster, but due to those 100 goroutines are slowing down the other goroutines that are writing new files, since competing for resources?

On another note, the solution you've shared seems to throttle PUT requests with a semaphore, but not GET requests which could however also require eviction if the blob is fetched from the backend. Curious if you've tried using a proxy backend and this is not an issue from your experience?

I'm not using the proxy backend and don't have much experience of it. (I'm using sharding instead). The throttling via semaphore in #696 is not because it can result in eviction, but because each PUT file write seems to consume one operating system thread. However, I think you are right about that some additional handling is needed for such GET requests with proxy since they probably also can result in blocking file write syscalls. Would it make sense to throttle with the same semaphore also in the end of disk.get method, e.g. after getting a proxy cache hit and going to write it?

@ulrfa
Copy link
Contributor Author

ulrfa commented Sep 4, 2023

As reference with same pre-filled cache of 10 GB tiny files:

$ time rm -rf cache

real	0m41.752s
user	0m0.693s
sys	0m40.106s

A naive attempt to use parallellism via xargs is only making it worse:

$ time bash -c "find cache -type f |xargs -P 8 rm"

real	0m43.159s
user	0m5.016s
sys	2m55.062s

If deleting complete subdirectories, then parallelism seems to help slightly. (Maybe not deleting files in same directory from different concurrent processes helps? But I don't think it is worth having separate eviction queues per subdirectory, because there will be other writes of new files going on concurrently anyway.)

$ time bash -c "find cache/cas.v2/* -type d |xargs -n 32 -P 8 rm -r"

real	0m33.703s
user	0m1.403s
sys	3m39.368s

@AlessandroPatti
Copy link
Contributor

Thanks @ulrfa I added a small benchmark in 420f4ad that seems to somewhat confirm what you've experienced. Increasing the go routines does not make it worse but doesn't make it better either.

BenchmarkEvictions/Benchmark_1_4096_409600000_0-12         	       1	14267783375 ns/op
BenchmarkEvictions/Benchmark_2_4096_409600000_0-12         	       1	14621566334 ns/op
BenchmarkEvictions/Benchmark_4_4096_409600000_0-12         	       1	14346213500 ns/op
BenchmarkEvictions/Benchmark_8_4096_409600000_0-12         	       1	14201499582 ns/op
BenchmarkEvictions/Benchmark_10_4096_409600000_0-12        	       1	14226219458 ns/op
BenchmarkEvictions/Benchmark_100_4096_409600000_0-12       	       1	14782357375 ns/op
BenchmarkEvictions/Benchmark_1000_4096_409600000_0-12      	       1	14658689708 ns/op

@ulrfa
Copy link
Contributor Author

ulrfa commented Sep 17, 2023

Thanks for running benchmark @AlessandroPatti!

Both #695 and #696 are introducing queues for evicted files, and I think both our benchmarks motivates that approach. I think next step is to hear what Mostyn thinks about a queue for files to be removed. But I guess Mostyn is still busy. I'm also busy at the moment, that is why I'm responding with delay, sorry for that.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants