Local disk cache #529

ola-rozenfeld · 2024-01-24T00:59:16Z

Implements an optional local disk cache (LRU, CAS + Action Cache) used for remote action outputs.
Goma has this feature, and many of our Reclient customers requested it since (usually because of slow internet connections).

In my benchmarks on Linux with a relatively good internet connection this was not able to reduce the runtime of a fully cached build by a statistically significant amount. But it didn't hurt either.

The current implementation does GC asynchronously, to save time. The priority queue implementation using the container/heap library is taken verbatim from the Golang documentation, plus I improved it a bit to double the array size when needed instead of extending it by one element at a time.

ola-rozenfeld · 2024-01-24T01:39:34Z

This is ready for review now. I'm not sure about the lint errors -- it complains about unwrapped errors from other packages, but from the code I can see, we do this literally everywhere, so I'm choosing to ignore unless asked otherwise.

ola-rozenfeld · 2024-01-24T01:04:21Z

go/pkg/diskcas/diskcas.go

+	newSize := uint64(atomic.AddInt64(&d.sizeBytes, dg.Size))
+	if newSize > d.maxCapacityBytes {
+		select {
+		case d.gcReq <- atomic.AddUint64(&d.gcTick, 1):


The only reason we count the GC cycles here is for testing. I couldn't find another way to wait for GC to be over in tests.

ola-rozenfeld · 2024-01-26T14:02:19Z

Update: Brave have tried this change and see no difference, because we still need to do an Action Cache RPC. I will now add optional Action Cache to this change to see how much improvement they get with it.

gkousik · 2024-01-29T14:21:24Z

Thanks for the PR Ola! I generally like and agree with having a local disk cache feature, but I want to ask some due diligence questions.

In slow internet connection scenarios, has racing exec strategy proven to be helpful? Is local disk cache faster than racing in those scenarios?
In my benchmarks on Linux with a relatively good internet connection this was not able to reduce the runtime of a fully cached build by a statistically significant amount. But it didn't hurt either.

This is very interesting - what was the size of your disk cache and what was the total output downloaded by the build? I am assuming this is a Chrome build? Its surprising that network transfer + write ~= local_disk_copy for a full build.
I think StoreCas also could suffer from a race condition where the local build modifies the file after writing it out. Perhaps we need to validate digest after cp is finished?
Finally a request, assuming we go ahead (I just want to know answers to (1) and (2)), could this be split up into smaller CLs? I have some readability comments that maybe easier to do with smaller changes :)

ola-rozenfeld · 2024-01-29T17:05:19Z

Hey Kousik and Anas! Good questions! 1 - @goodov from Brave has tried this out on a slow connection -- maybe Aleksey can comment on how it compares to racing, because I know he has tried that as well (although it's not yet the default setting we recommend to customers). I'm wondering if there's a way for me to artificially create a VM with a slow connection to try it out -- maybe get a VM in the farthest possible region from our backend?...

2 - I tried a Chromium build on Linux with 50gb cache size, and the actual cache size was 26GB. Which, btw, implies that RBE download metrics that it displays at the end of the build are waaaay off (it usually says RBE Stats: down 5.78 GB).

3 - Good point! I didn't think the output files would be locally modified (do they? Do you know which actions do that?). I'll add handling of this case and a unit test. But I'm struggling to imagine this being a race condition unless it's literally two concurrent actions that output the same file -- and if that's the case, isn't it already a race condition in the build itself?...

4 - Yes, agreed, I did it in one go because I wanted @goodov to be able to give it a try easily. I think this can naturally break into 3 PRs -- first, introduce the diskcache with CAS-only plus unit tests, then second PR adds Action Cache functionality plus unit tests, then 3rd PR introduces the client flags plus instrumentation to actually use the diskcache in the client. Does that sound good?

Thank you!

ola-rozenfeld · 2024-01-31T22:42:20Z

Updates:

Changed the directory structure following Marc-Antoine's (@maruel) suggestion (based on Git and Isolate implementations) -- am now splitting the cache directory into 256 subdirectories based on 2-byte prefix of the digest, and using the other bytes for the name. Also moved the "_ac" to suffix. Both of these should improve performance, especially on Windows.
Split this PR out into Local disk cache: part 1 #530 (will send out part 2 after part 1 is in).

I'm leaving this PR here for reference to the whole thing.

ola-rozenfeld · 2024-02-11T17:54:00Z

@mrahs Friendly ping! :-) Let me know if you prefer to review the whole thing or in chunks (e.g. #530).

Thank you!

mrahs · 2024-02-13T17:55:05Z

@mrahs Friendly ping! :-) Let me know if you prefer to review the whole thing or in chunks (e.g. #530).

Thank you!

Sorry I'm late. I'm trying to understand the context of such cache within past discussions and future requirements.

I tried a Chromium build on Linux with 50gb cache size, and the actual cache size was 26GB. Which, btw, implies that RBE download metrics that it displays at the end of the build are waaaay off (it usually says RBE Stats: down 5.78 GB).

Ah, interesting. Any chance you've been able to spot the cause of this mismatch?

maruel · 2024-05-23T00:42:49Z

go/pkg/diskcache/diskcache.go

+				}
+				it := &qitem{
+					key: k,
+					lat: fileInfoToAccessTime(info),


Why use the last access time instead of last modified time?
I don't believe that using the file system to track this information is a good idea. My implementations saved this as a dumb json file and that was very simple, I don't care about the actual time stamp, just the ordering; thus the file is a flat list of files in LRU for eviction order. This also removes the need for the priorityQueue type.

maruel · 2024-05-23T00:46:18Z

go/pkg/diskcache/diskcache.go

+				d.mu.Lock()
+				it := heap.Pop(d.queue).(*qitem)
+				d.mu.Unlock()
+				size, err := d.getItemSize(it.key)


I don't understand why this does a stat on the file when the size is already in the digest.

maruel · 2024-05-23T00:47:13Z

go/pkg/diskcache/diskcache.go

+				diskOpsStart := time.Now()
+				// We only delete the files, and not the prefix directories, because the prefixes are not worth worrying about.
+				if err := os.Remove(d.getPath(it.key)); err != nil {
+					log.Errorf("Error removing file: %w", err)


It'd be good if the error would be accumulated, so that Shutdown() could be renamed to Close() and return the accumulated errors.

maruel · 2024-05-23T00:53:02Z

go/pkg/diskcache/diskcache.go

+	return key{digest: dg, isCas: !isAc}, nil
+}
+
+func (d *DiskCache) getPath(k key) string {


Optional nit: I would have preferred files to have a uniform file name "%s-%d" % (hash, size), instead of having the size as an extension. Then you could use ".cas" as the marker for cas files. IMHO it would be a bit cleaner.

maruel · 2024-05-23T00:55:56Z

go/pkg/diskcache/diskcache.go

+	if q.n == cap(q.items) {
+		// Resize the queue
+		old := q.items
+		q.items = make([]*qitem, 2*cap(old)) // Initial capacity needs to be > 0.


if old == 0 { old = 256 }

then you can remove the comment

maruel · 2024-05-23T00:57:47Z

go/pkg/diskcache/diskcache.go

+
+type key struct {
+	digest digest.Digest
+	isCas  bool


Is isCas meant to do a performance optimization later? I don't see any behavior change based on this member.

ola-rozenfeld · 2024-05-23T16:50:27Z

Thank you, Mark-Antoine -- I'm still tinkering with the whole thing, I just added stats and benchmark tests (not pushed yet), and I'm trying out different variations to see what's fastest. I hope I can get it all done by Monday.

maruel · 2024-06-04T20:11:11Z

go/pkg/client/cas_download.go

@@ -129,6 +140,11 @@ func (c *Client) DownloadOutputs(ctx context.Context, outs map[string]*TreeOutpu
 		if err := cache.Update(absPath, md); err != nil {
 			return fullStats, err
 		}
+		if c.diskCache != nil {
+			if err := c.diskCache.StoreCas(output.Digest, absPath); err != nil {


Do you think it could be done as part of DownloadFiles() ? Otherwise this adds a serial latency.

ola-rozenfeld force-pushed the ola-disk-cache branch 15 times, most recently from abb4792 to 8979a8d Compare January 24, 2024 22:22

ola-rozenfeld commented Jan 24, 2024

View reviewed changes

ola-rozenfeld marked this pull request as draft January 26, 2024 14:11

ola-rozenfeld force-pushed the ola-disk-cache branch 2 times, most recently from b42ecc2 to 50846d6 Compare January 27, 2024 18:41

ola-rozenfeld changed the title ~~Local disk CAS~~ Local disk cache Jan 27, 2024

gkousik requested a review from mrahs January 29, 2024 14:21

ola-rozenfeld marked this pull request as ready for review January 29, 2024 16:42

ola-rozenfeld force-pushed the ola-disk-cache branch from efa688d to a0206e2 Compare January 31, 2024 22:03

ola-rozenfeld mentioned this pull request Jan 31, 2024

Local disk cache: part 1 #530

Open

ola-rozenfeld force-pushed the ola-disk-cache branch from 3590779 to 2677570 Compare February 1, 2024 21:59

ola-rozenfeld force-pushed the ola-disk-cache branch 2 times, most recently from b01e309 to c800a0e Compare February 11, 2024 17:50

ola-rozenfeld added 4 commits May 12, 2024 11:12

Local disk cache

035c61c

Restructuring the cache file/directory layout for better performance.

fab8b36

Addressing comments

2f3d13b

Cherrypicking Marc-Antoine's comments

b994285

ola-rozenfeld force-pushed the ola-disk-cache branch 8 times, most recently from 3cc60db to 7a31b23 Compare May 19, 2024 20:36

ola-rozenfeld force-pushed the ola-disk-cache branch 2 times, most recently from 2b812c8 to 9eb7483 Compare May 20, 2024 23:10

maruel reviewed May 23, 2024

View reviewed changes

ola-rozenfeld force-pushed the ola-disk-cache branch 2 times, most recently from 9888072 to 8094227 Compare May 27, 2024 16:33

Adding DiskCache stats

e9e596f

ola-rozenfeld force-pushed the ola-disk-cache branch from 8094227 to e9e596f Compare May 27, 2024 16:49

maruel reviewed Jun 4, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Local disk cache #529

Local disk cache #529

ola-rozenfeld commented Jan 24, 2024 •

edited

ola-rozenfeld commented Jan 24, 2024 •

edited

ola-rozenfeld Jan 24, 2024

ola-rozenfeld commented Jan 26, 2024

gkousik commented Jan 29, 2024

ola-rozenfeld commented Jan 29, 2024

ola-rozenfeld commented Jan 31, 2024 •

edited

ola-rozenfeld commented Feb 11, 2024

mrahs commented Feb 13, 2024

maruel May 23, 2024

maruel May 23, 2024

maruel May 23, 2024

maruel May 23, 2024

maruel May 23, 2024

maruel May 23, 2024

ola-rozenfeld commented May 23, 2024

maruel Jun 4, 2024

Local disk cache #529

Are you sure you want to change the base?

Local disk cache #529

Conversation

ola-rozenfeld commented Jan 24, 2024 • edited

ola-rozenfeld commented Jan 24, 2024 • edited

Choose a reason for hiding this comment

ola-rozenfeld commented Jan 26, 2024

gkousik commented Jan 29, 2024

ola-rozenfeld commented Jan 29, 2024

ola-rozenfeld commented Jan 31, 2024 • edited

ola-rozenfeld commented Feb 11, 2024

mrahs commented Feb 13, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ola-rozenfeld commented May 23, 2024

Choose a reason for hiding this comment

ola-rozenfeld commented Jan 24, 2024 •

edited

ola-rozenfeld commented Jan 24, 2024 •

edited

ola-rozenfeld commented Jan 31, 2024 •

edited