Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bazel-remote sometimes reports files as missing, although they are present in s3 cache #649

Open
theospears opened this issue Mar 10, 2023 · 1 comment

Comments

@theospears
Copy link

Summary

We have run into cases where bazel, in conjunction with bazel-remote, reports that files are missing in the remote cache, even though we can see they were present at the time of the build in the s3 bucket bazel-remote is configured to use. We are not certain of the cause, but suspect it may happen when the local bazel-remote disk cache is full, and cannot make sufficient space through garbage collection due to reservations.

It would be great if:

  1. These were reported as a server error rather than a file-not-found, for less confusion
  2. Bazel-remote printed more debugging information to its logs in these situations, so we could be confident in the underlying cause.

Details of what we observed

We had a recent bazel build fail with the following error:

ERROR: [...]/BUILD:54:13: scala [...] failed: Exec failed due to IOException: 59 errors during bulk transfer:
java.io.IOException: Failed to fetch file with hash '1af370e58497b6fe4330f331fc144826b89c3d18a836a88de17d990752f36609' because it does not exist remotely. --remote_download_outputs=minimal does not work if your remote cache evicts files during builds.
java.io.IOException: Failed to fetch file with hash '6da9ed4d305424f7a35c4d0492307e287098a31ac44bfb4625d3691f706afde9' because it does not exist remotely. --remote_download_outputs=minimal does not work if your remote cache evicts files during builds.
... many more missing files ...

The s3 console verified 1af370e58497b6fe4330f331fc144826b89c3d18a836a88de17d990752f36609 existed in the cas, and had for several days.

The bazel-remote logs we keep showed:

> grep 1af370e58497b6fe4330f331fc144826b89c3d18a836a88de17d990752f36609  bazel-remote.log
2023/03/08 21:14:42 S3 CONTAINS asana-sandbox-testville-bazel-cache-us-west-2 cas.v2/1a/1af370e58497b6fe4330f331fc144826b89c3d18a836a88de17d990752f36609 OK
2023/03/08 21:14:42 GRPC CAS HEAD 1af370e58497b6fe4330f331fc144826b89c3d18a836a88de17d990752f36609 OK
2023/03/08 21:50:23 GRPC BYTESTREAM READ BLOB NOT FOUND: 1af370e58497b6fe4330f331fc144826b89c3d18a836a88de17d990752f36609
2023/03/08 22:07:14 S3 CONTAINS asana-sandbox-testville-bazel-cache-us-west-2 cas.v2/1a/1af370e58497b6fe4330f331fc144826b89c3d18a836a88de17d990752f36609 OK
2023/03/08 22:07:14 GRPC CAS HEAD 1af370e58497b6fe4330f331fc144826b89c3d18a836a88de17d990752f36609 OK
2023/03/08 22:07:14 S3 DOWNLOAD asana-sandbox-testville-bazel-cache-us-west-2 cas.v2/1a/1af370e58497b6fe4330f331fc144826b89c3d18a836a88de17d990752f36609 OK
2023/03/08 22:07:14 GRPC BYTESTREAM READ COMPLETED blobs/1af370e58497b6fe4330f331fc144826b89c3d18a836a88de17d990752f36609/1166808

We observed

  1. BYTESTREAM READ BLOB NOT FOUND error
  2. bazel-remote was later able to download the same file

Further investigation showed many other instances of BYTESTREAM READ BLOB NOT FOUND for different blobs at the same time.

Hypothesis

The relevant blob not found error comes from here:

if rc == nil {
msg := fmt.Sprintf("GRPC BYTESTREAM READ BLOB NOT FOUND: %s", hash)

It looks like it happens if we don't find the file, but also haven't otherwise encountered an error. Based on the logs it looks like it doesn't get as far as checking S3 for the file. We suspect, but cannot verify, this is hitting the disk space check here:

if sumLargerThan(size, c.reservedSize, c.maxSize) {
// If size + c.reservedSize is larger than c.maxSize
// then we cannot evict enough items to make enough
// space.
return false, nil
}

@mostynb
Copy link
Collaborator

mostynb commented Mar 11, 2023

Thanks for the bug report- I think your diagnosis is correct.

This PR makes bazel-remote log something in this situation: #650
But you might need to increase your cache size to avoid this problem with the amount of load you're placing on the server.

mostynb added a commit to mostynb/bazel-remote that referenced this issue Mar 11, 2023
mostynb added a commit to mostynb/bazel-remote that referenced this issue Mar 11, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants