Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Build fails on Bazel 7.0 when remote_download_toplevel flag is enabled #730

Open
sanju-naik opened this issue Jan 31, 2024 · 8 comments
Open

Comments

@sanju-naik
Copy link

After upgrading to Bazel 7.0.0 and enabling remote_download_toplevel flag we are noticing our builds are failing intermittently while downloading cached artifacts from remote Cache.

2 errors we get are:

Exec failed due to IOException: Connection reset
Exec failed due to IOException: null

There are no other details in the log. Other things we noticed are :

  • This happens when artifacts are 100% cached i.e download everything from Cache.
  • Also noticed when the job fails, the module it shows as downloading at the end of the logs is always same, not sure if it has anything to do with that Module?
@mostynb
Copy link
Collaborator

mostynb commented Feb 1, 2024

Are there any relevant errors or warnings in the bazel-remote log when this occurs?

@sanju-naik
Copy link
Author

Today when one of our jobs failed, I got this error log in the job. Does this help in any way to debug this issue?

---8<---8<--- Exception details ---8<---8<---
java.io.IOException: Failed to read @-argument 'bazel-out/ios_arm64-opt-ios-arm64-min12.0-applebin_ios-ST-ee6c0995fb68/bin/<Module>/<Target>.swiftmodule-0.params' from file '/private/var/tmp/_bazel_runner/55c1db80066b6bd30a81b2a1c9b5244e/execroot/__main__/bazel-out/ios_arm64-opt-ios-arm64-min12.0-applebin_ios-ST-ee6c0995fb68/bin/<Module>/<Target>.swiftmodule-0.params'.
	at com.google.devtools.build.lib.worker.WorkerSpawnRunner.expandArgument(WorkerSpawnRunner.java:315)
	at com.google.devtools.build.lib.worker.WorkerSpawnRunner.createWorkRequest(WorkerSpawnRunner.java:246)
	at com.google.devtools.build.lib.worker.WorkerSpawnRunner.execInWorker(WorkerSpawnRunner.java:416)
	at com.google.devtools.build.lib.worker.WorkerSpawnRunner.exec(WorkerSpawnRunner.java:206)
	at com.google.devtools.build.lib.exec.AbstractSpawnStrategy.exec(AbstractSpawnStrategy.java:159)
	at com.google.devtools.build.lib.exec.AbstractSpawnStrategy.exec(AbstractSpawnStrategy.java:119)
	at com.google.devtools.build.lib.exec.SpawnStrategyResolver.exec(SpawnStrategyResolver.java:45)
	at com.google.devtools.build.lib.analysis.actions.SpawnAction.execute(SpawnAction.java:261)
	at com.google.devtools.build.lib.skyframe.SkyframeActionExecutor$ActionRunner.executeAction(SkyframeActionExecutor.java:1148)
	at com.google.devtools.build.lib.skyframe.SkyframeActionExecutor$ActionRunner.run(SkyframeActionExecutor.java:1065)
	at com.google.devtools.build.lib.skyframe.ActionExecutionState.runStateMachine(ActionExecutionState.java:165)
	at com.google.devtools.build.lib.skyframe.ActionExecutionState.getResultOrDependOnFuture(ActionExecutionState.java:94)
	at com.google.devtools.build.lib.skyframe.SkyframeActionExecutor.executeAction(SkyframeActionExecutor.java:562)
	at com.google.devtools.build.lib.skyframe.ActionExecutionFunction.checkCacheAndExecuteIfNeeded(ActionExecutionFunction.java:859)
	at com.google.devtools.build.lib.skyframe.ActionExecutionFunction.computeInternal(ActionExecutionFunction.java:333)
	at com.google.devtools.build.lib.skyframe.ActionExecutionFunction.compute(ActionExecutionFunction.java:171)
	at com.google.devtools.build.skyframe.AbstractParallelEvaluator$Evaluate.run(AbstractParallelEvaluator.java:461)
	at com.google.devtools.build.lib.concurrent.AbstractQueueVisitor$WrappedRunnable.run(AbstractQueueVisitor.java:414)
	at java.base/java.util.concurrent.ForkJoinTask$RunnableExecuteAction.exec(Unknown Source)
	at java.base/java.util.concurrent.ForkJoinTask.doExec(Unknown Source)
	at java.base/java.util.concurrent.ForkJoinPool$WorkQueue.topLevelExec(Unknown Source)
	at java.base/java.util.concurrent.ForkJoinPool.scan(Unknown Source)
	at java.base/java.util.concurrent.ForkJoinPool.runWorker(Unknown Source)
	at java.base/java.util.concurrent.ForkJoinWorkerThread.run(Unknown Source)
Caused by: java.io.FileNotFoundException: /private/var/tmp/_bazel_runner/55c1db80066b6bd30a81b2a1c9b5244e/execroot/__main__/bazel-out/ios_arm64-opt-ios-arm64-min12.0-applebin_ios-ST-ee6c0995fb68/bin/<Module>/<Target>.swiftmodule-0.params (No such file or directory)
	at java.base/java.io.FileInputStream.open0(Native Method)
	at java.base/java.io.FileInputStream.open(Unknown Source)
	at java.base/java.io.FileInputStream.<init>(Unknown Source)
	at com.google.devtools.build.lib.unix.UnixFileSystem.createFileInputStream(UnixFileSystem.java:497)
	at com.google.devtools.build.lib.vfs.AbstractFileSystem.createMaybeProfiledInputStream(AbstractFileSystem.java:90)
	at com.google.devtools.build.lib.vfs.AbstractFileSystem.getInputStream(AbstractFileSystem.java:59)
	at com.google.devtools.build.lib.vfs.Path.getInputStream(Path.java:765)
	at com.google.devtools.build.lib.vfs.FileSystemUtils$1.openStream(FileSystemUtils.java:354)
	at com.google.common.io.ByteSource$AsCharSource.openStream(ByteSource.java:474)
	at com.google.common.io.CharSource.openBufferedStream(CharSource.java:126)
	at com.google.common.io.CharSource.readLines(CharSource.java:336)
	at com.google.devtools.build.lib.vfs.FileSystemUtils.readLines(FileSystemUtils.java:834)
	at com.google.devtools.build.lib.worker.WorkerSpawnRunner.expandArgument(WorkerSpawnRunner.java:310)
	... 23 more
---8<---8<--- End of exception details ---8<---8<---

@mostynb
Copy link
Collaborator

mostynb commented Feb 8, 2024

I don't know bazel internals, but this stack trace looks like this is failing when trying to execute the action on the client side. Have you tried reporting this error to the bazel project?

@mostynb
Copy link
Collaborator

mostynb commented Feb 8, 2024

Also, I think the bazel-remote logs would be important to check here- are there any warnings or errors there?

@sanju-naik
Copy link
Author

Also, I think the bazel-remote logs would be important to check here- are there any warnings or errors there?

We are seeing these failures on our scheduled pipelines and most of the time these jobs fail during night, and the next day I have a hard time collecting logs from bazel-remote because it keeps logging every event to the log file so by the time I check there are a lot of logs & couldn't figure out the ones specific to these jobs.

Is there a quick way to get logs associated with a particular job?

@sanju-naik
Copy link
Author

Also we are still on version 2.3.9. Have we added any fixes related to Bazel 7 in the latest releases?

@mostynb
Copy link
Collaborator

mostynb commented Feb 11, 2024

Also, I think the bazel-remote logs would be important to check here- are there any warnings or errors there?

We are seeing these failures on our scheduled pipelines and most of the time these jobs fail during night, and the next day I have a hard time collecting logs from bazel-remote because it keeps logging every event to the log file so by the time I check there are a lot of logs & couldn't figure out the ones specific to these jobs.

Is there a quick way to get logs associated with a particular job?

I think it depends a bit on the logging options that you are using. If you have timestamps enabled you can jump to a time just before the error and scan from there. Alternatively if you have access logs enabled you might be able to search for a blob or ActionResult hash from the error (if you have something like that in the bazel logs). Or maybe you could just grep the bazel-remote logs for "error" or "warning" (ignoring case) and see if there's anything interesting.

Also we are still on version 2.3.9. Have we added any fixes related to Bazel 7 in the latest releases?

The releases page has a high-level changelog: https://github.com/buchgr/bazel-remote/releases - but I don't think there are any changes specifically related to bazel 7.

@liam-baker-sm
Copy link

Currently we have many bazel 7.0.0 remote_download_toplevel builds each day using a bazel-remote cache without problem.
IOException: Connection reset would suggest the connection was dropped.
Do you use HTTP(S) or GRPC(S) for the cache url in bazel?
Is there a proxy between your bazel clients and the bazel-remote server (even on the same machine)?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants