Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TLS handshake timeout when listing bazel versions from GCS #1627

Open
rickeylev opened this issue May 4, 2023 · 3 comments
Open

TLS handshake timeout when listing bazel versions from GCS #1627

rickeylev opened this issue May 4, 2023 · 3 comments
Assignees

Comments

@rickeylev
Copy link
Contributor

I've been regularly seeing an error that looks like some issue with the CI scripts talking to GCS.

The error is a bit confusing because it looks like some problem uploading the test logs ("can't find file"), but also looks like some problem "listing bazel versions in GCS" (whatever that means).

Pressing retry on build kite almost always fixes this, so it's some sort of flake.

Agent: bk-windows-bt5g

run: https://buildkite.com/bazel/rules-python-python/builds/4779#_

bazel --output_user_root=C:/b test --flaky_test_attempts=3 --build_tests_only --local_test_jobs=8 --show_progress_rate_limit=5 --curses=yes --color=yes --terminal_columns=143 --show_timestamps --verbose_failures --jobs=30 --announce_rc --experimental_repository_cache_hardlinks --disk_cache= --experimental_build_event_json_file_path_conversion=false --build_event_json_file=C:\temp\tmpikfc45yc\test_bep.json --google_default_credentials --remote_cache=remotebuildexecution.googleapis.com --remote_instance_name=projects/bazel-untrusted/instances/default_instance --remote_timeout=60 --remote_max_connections=200 --remote_default_platform_properties=properties:{name:"cache-silo-key" value:"6a21cacbec775043b8cb5b49849575502cf8f7a8f5d7f28ce34e6c5d2982f753"} --remote_download_toplevel --test_env=LocalAppData --test_env=BAZELISK_USER_AGENT -- ...
--
  | C:\temp\tmpikfc45yc\bazelci-agent.exe artifact upload --delay=5 --mode=buildkite --build_event_json_file=C:\temp\tmpikfc45yc\test_bep.json
  | May 04 17:16:00.138 ERROR bazelci_agent::artifact::upload: The system cannot find the file specified. (os error 2)
  | May 04 17:16:00.138 ERROR bazelci_agent::artifact::upload: The system cannot find the file specified. (os error 2)
  | May 04 17:16:00.138 ERROR bazelci_agent::artifact::upload: The system cannot find the file specified. (os error 2)
  | May 04 17:16:00.138 ERROR bazelci_agent::artifact::upload: The system cannot find the file specified. (os error 2)
  | May 04 17:16:00.138 ERROR bazelci_agent::artifact::upload: The system cannot find the file specified. (os error 2)
  | Error: The system cannot find the file specified. (os error 2)
  | Exception in thread Thread-1:
  | Traceback (most recent call last):
  | File "C:\python3\lib\threading.py", line 973, in _bootstrap_inner
  | self.run()
  | File "C:\python3\lib\threading.py", line 910, in run
  | self._target(*self._args, **self._kwargs)
  | File "c:\b\bk-windows-bt5g\bazel\rules-python-python\bazelci.py", line 2424, in upload_test_logs_from_bep
  | execute_command(
  | File "c:\b\bk-windows-bt5g\bazel\rules-python-python\bazelci.py", line 2474, in execute_command
  | return subprocess.run(
  | File "C:\python3\lib\subprocess.py", line 528, in run
  | raise CalledProcessError(retcode, process.args,
  | subprocess.CalledProcessError: Command '['C:\\temp\\tmpikfc45yc\\bazelci-agent.exe', 'artifact', 'upload', '--delay=5', '--mode=buildkite', '--build_event_json_file=C:\\temp\\tmpikfc45yc\\test_bep.json']' returned non-zero exit status 1.
  | 2023/05/04 17:16:04 could not resolve the version 'latest' to an actual version number: unable to determine latest version: could not list Bazel versions in GCS bucket: could not list GCS objects at https://www.googleapis.com/storage/v1/b/bazel/o?delimiter=/: could not fetch https://www.googleapis.com/storage/v1/b/bazel/o?delimiter=/: Get "https://www.googleapis.com/storage/v1/b/bazel/o?delimiter=/": net/http: TLS handshake timeout
  | bazel test failed with exit code 1
@fweikert
Copy link
Member

fweikert commented May 4, 2023

The underlying "could not resolve the version " issue is from Bazelisk. I'm surprised that bazelci-agent.exe fails, too.

@meteorcloudy
Copy link
Member

"could not resolve the version " issue is from Bazelisk

Can we retry in Bazelisk for such errors?

@rickeylev
Copy link
Contributor Author

It looks like a variation of this same problem occurs when Bazelisk downloads Bazel, too: https://buildkite.com/bazel/rules-python-python/builds/5493#018a1f23-af2d-4943-b2db-3013e7c3391f

Using Bazel version | 26m 56s
-- | --
  |  
  |  
  | bazel info output_base
  | 2023/08/22 21:26:02 Downloading https://releases.bazel.build/6.3.2/release/bazel-6.3.2-linux-x86_64...
  | 2023/08/22 21:52:58 could not download Bazel: could not copy from https://releases.bazel.build/6.3.2/release/bazel-6.3.2-linux-x86_64 to /var/lib/buildkite-agent/.cache/bazelisk/downloads/bazelbuild/bazel-6.3.2-linux-x86_64/bin/download450623830: stream error: stream ID 1; INTERNAL_ERROR

It indicates it took 26 minutes to execute that. Quite the grace period! That might be good (because it's doing retries and download resumption), or it might be bad (because its just trying once and simply timing out after 26m).

I wonder if its possible to pre-populate the bazelisk cache? I don't know how these VMs (or whatever they are) are setup, but if they had the bazelisk cache pre-populated with the commonly used bazel versions, then no download would be necessary, largely avoiding the issue (at the cost of potentially slower VM setup, I guess?)

FWIW, these sort of network issues aren't too uncommon. Internally, we'd see chocolately installs regularly fail because of all sorts of network issues.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants