Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Occasional slow startup times during Booting builder step #283

Open
3 tasks done
niodice opened this issue Oct 25, 2023 · 8 comments
Open
3 tasks done

Occasional slow startup times during Booting builder step #283

niodice opened this issue Oct 25, 2023 · 8 comments

Comments

@niodice
Copy link

niodice commented Oct 25, 2023

Contributing guidelines

I've found a bug, and:

  • The documentation does not mention anything about my problem
  • There are no open or closed issues that are related to my problem

Description

I see that, about 25% of the time, that the Booting builder step can take quite some time to execute. I am running on Amazon EKS and we are using self hosted runners deployed using https://github.com/actions/actions-runner-controller. Before my runner becomes available to pick up a workflow, I see that the docker daemon has started successfully:

time="2023-10-25T21:08:30.983564648Z" level=info msg="Daemon has completed initialization"
time="2023-10-25T21:08:31.036285627Z" level=info msg="API listen on /var/run/docker.sock"

Expected behaviour

Consistent startup time

Actual behaviour

Startup time is very slow at time.

Repository URL

No response

Workflow run URL

No response

YAML workflow

The relevant parts of the workflow are:


jobs:
  publish-docker:
    name: publish docker image
    runs-on: self-hosted
    steps:
      - uses: actions/checkout@v3
      - name: setup docker buildx
        uses: docker/setup-buildx-action@v3

Workflow logs

Wed, 25 Oct 2023 21:08:49 GMT
  /usr/local/bin/docker buildx inspect --bootstrap --builder builder-9ff3bc3c-4aa4-43d7-b3b0-9566551cacb9
Wed, 25 Oct 2023 21:08:49 GMT
  #1 [internal] booting buildkit
Wed, 25 Oct 2023 21:08:49 GMT
  #1 pulling image moby/buildkit:buildx-stable-1
Wed, 25 Oct 2023 21:08:57 GMT
  #1 pulling image moby/buildkit:buildx-stable-1 7.5s done
Wed, 25 Oct 2023 21:08:57 GMT
  #1 creating container buildx_buildkit_builder-9ff3bc3c-4aa4-43d7-b3b0-9566551cacb90
Wed, 25 Oct 2023 21:11:12 GMT
  #1 143.0 error: failed to list workers: Unavailable: connection error: desc = "transport: error while dialing: dial unix /run/buildkit/buildkitd.sock: connect: no such file or directory"
Wed, 25 Oct 2023 21:11:12 GMT
  #1 creating container buildx_buildkit_builder-9ff3bc3c-4aa4-43d7-b3b0-9566551cacb90 135.5s done
Wed, 25 Oct 2023 21:11:12 GMT
  #1 ERROR: exit code 1
Wed, 25 Oct 2023 21:11:12 GMT
  ------
Wed, 25 Oct 2023 21:11:12 GMT
   > [internal] booting buildkit:
Wed, 25 Oct 2023 21:11:12 GMT
  143.0 error: failed to list workers: Unavailable: connection error: desc = "transport: error while dialing: dial unix /run/buildkit/buildkitd.sock: connect: no such file or directory"
Wed, 25 Oct 2023 21:11:12 GMT
  ------
Wed, 25 Oct 2023 21:11:12 GMT
  Name:          builder-9ff3bc3c-4aa4-43d7-b3b0-9566551cacb9
Wed, 25 Oct 2023 21:11:12 GMT
  Driver:        docker-container
Wed, 25 Oct 2023 21:11:12 GMT
  Last Activity: 2023-10-25 21:08:49 +0000 UTC
Wed, 25 Oct 2023 21:11:12 GMT
  
Wed, 25 Oct 2023 21:11:12 GMT
  Nodes:
Wed, 25 Oct 2023 21:11:12 GMT
  Name:     builder-9ff3bc3c-4aa4-43d7-b3b0-9566551cacb90
Wed, 25 Oct 2023 21:11:12 GMT
  Endpoint: unix:///var/run/docker.sock
Wed, 25 Oct 2023 21:11:12 GMT
  Error:    Get "http://%2Fvar%2Frun%2Fdocker.sock/v1.24/containers/buildx_buildkit_builder-9ff3bc3c-4aa4-43d7-b3b0-9566551cacb90/json": context deadline exceeded

The delay seems to be in the step that produces this log line:

  Error:    Get "http://%2Fvar%2Frun%2Fdocker.sock/v1.24/containers/buildx_buildkit_builder-9ff3bc3c-4aa4-43d7-b3b0-9566551cacb90/json": context deadline exceeded

BuildKit logs

No response

Additional info

No response

@deanb-everc
Copy link

bump, also experiencing this, and this fails my workflows..

@zpiazza-combocurve
Copy link

Same here

@ChrisPage-AT
Copy link

I'm also seeing this but it just fails the workflows instead of being slow. It doesn't happen with every build but it does force us to rebuild when it fails. My logs are exactly the same as above, and show up under the "Booting builder" dropdown of the docker/setup-buildx-action@v3 action.

@niodice
Copy link
Author

niodice commented Jan 26, 2024

@crazy-max I wonder if you might have some expertise in how docker and/or the buildx plugin interact, and what we might look for in terms of debugging the slowness.

@ivadimko
Copy link

bump, we also fase issues with docker/setup-buildx-action@v3

affair on v2 it worked great so we had to rollback :(

@LucasAgin
Copy link

LucasAgin commented Feb 16, 2024

Also seeing this failing on our self-hosted runners, this is after running for ~6-7 minutes. Also using action runner controller on AWS EKS.

It is periodic, some builds eventually work, some do not.

Run docker/setup-buildx-action@v3
Docker info
Buildx version
Creating a new builder instance
Booting builder
  /usr/bin/docker buildx inspect --bootstrap --builder builder-e9dc3275-XXXX-XXXX-95e9-2c1fcb2a5c6a
  #1 [internal] booting buildkit
  #1 pulling image moby/buildkit:buildx-stable-1
  #1 pulling image moby/buildkit:buildx-stable-1 2.3s done
  #1 creating container buildx_buildkit_builder-e9dc3275-XXXX-XXXX-95e9-2c1fcb2a5c6a0
  #1 467.0 error: failed to list workers: Unavailable: connection error: desc = "transport: error while dialing: dial unix /run/buildkit/buildkitd.sock: connect: no such file or directory"
  #1 creating container buildx_buildkit_builder-e9dc3275-XXXX-XXXX-95e9-2c1fcb2a5c6a0 464.7s done
  #1 ERROR: exit code 1
  ------
   > [internal] booting buildkit:
  467.0 error: failed to list workers: Unavailable: connection error: desc = "transport: error while dialing: dial unix /run/buildkit/buildkitd.sock: connect: no such file or directory"
  ------
  ERROR: exit code 1
Error: The process '/usr/bin/docker' failed with exit code 1

@devonjones
Copy link

We had this problem, and we found the solution: you must set limits and requests for resources for the arc runners. There's no default requests/limits, which means that kube is free to schedule pods anywhere. This is why the error is happening at random. If a build comes in that uses lots of containers at once, kube is probably killing containers to maintain health. You have to have requests and limits configured to get kube to do intelligent scheduling.

The bug here is that arc should set a sensible default, or it should error. In their helm chart they comment that they don't set a default because the user should be setting it. That's a reasonable impulse, but instead of having no requests/limits, the helm chart should fail if the user isn't setting them.

https://github.com/actions/actions-runner-controller/blob/master/charts/actions-runner-controller/values.yaml#L127

@gete76
Copy link

gete76 commented Apr 8, 2024

@devonjones, those errors do pop up when a node's drive is slowed down by runner overload, but I don't think that's the root cause in this issue since this does seem to be similar to the issue that just mentioned this ticket ^. I've also seen this error on our github hosted runners.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

9 participants