Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

runc run failed: stat: no such file or directory with warm BuildKit cache after 23.0 upgrade #44943

Closed
zeyugao opened this issue Feb 7, 2023 · 15 comments
Labels
area/builder/buildkit Issues affecting buildkit area/builder kind/bug Bugs are bugs. The cause may or may not be known at triage time so debugging may be needed. priority/P1 Important: P1 issues are a top priority and a must-have for the next release. status/confirmed version/23.0
Milestone

Comments

@zeyugao
Copy link

zeyugao commented Feb 7, 2023

Description

I have some images built with BUILDKIT_INLINE_CACHE=1. Modify the Dockerfile and rebuild it.

docker build \
--tag gcr.io/fuzzbench/base-image \
--build-arg BUILDKIT_INLINE_CACHE=1 \
--file docker/base-image/Dockerfile \
.
[+] Building 0.7s (7/13)

 => [internal] load .dockerignore                                                                                    0.2s
 => => transferring context: 118B                                                                                    0.0s
 => [internal] load build definition from Dockerfile                                                                 0.3s
 => => transferring dockerfile: 2.54kB                                                                               0.0s
 => [internal] load metadata for docker.io/library/ubuntu:focal                                                      0.0s
 => [1/9] FROM docker.io/library/ubuntu:focal                                                                        0.0s
 => [internal] load build context                                                                                    0.2s
 => => transferring context: 38B                                                                                     0.0s
 => CACHED [2/9] RUN apt-get update     && apt-get install -y ca-certificates     && sed -i "s@http://.*archive.ubu  0.0s
 => ERROR [3/9] RUN ls                                                                                               0.4s
------

 > [3/9] RUN ls:
#0 0.341 runc run failed: unable to start container process: exec: "/bin/sh": stat /bin/sh: no such file or directory

Images without BUILDKIT_INLINE_CACHE=1 seem fine.

It seems that it is not related to #44918

Reproduce

Build some images with BUILDKIT_INLINE_CACHE=1 in 20.10 and upgrade to 23.0.0. Modify the Dockerfile and rebuild it.

Expected behavior

To work as it worked before 23.0.0.

docker version

Client: Docker Engine - Community
 Version:           23.0.0
 API version:       1.42
 Go version:        go1.19.5
 Git commit:        e92dd87
 Built:             Wed Feb  1 17:49:08 2023
 OS/Arch:           linux/amd64
 Context:           default

Server: Docker Engine - Community
 Engine:
  Version:          23.0.0
  API version:      1.42 (minimum version 1.12)
  Go version:       go1.19.5
  Git commit:       d7573ab
  Built:            Wed Feb  1 17:49:08 2023
  OS/Arch:          linux/amd64
  Experimental:     false
 containerd:
  Version:          1.6.16
  GitCommit:        31aa4358a36870b21a992d3ad2bef29e1d693bec
 runc:
  Version:          1.1.4
  GitCommit:        v1.1.4-0-g5fd4c4d
 docker-init:
  Version:          0.19.0
  GitCommit:        de40ad0

docker info

Client:
 Context:    default
 Debug Mode: false
 Plugins:
  buildx: Docker Buildx (Docker Inc.)
    Version:  v0.10.2
    Path:     /usr/libexec/docker/cli-plugins/docker-buildx
  compose: Docker Compose (Docker Inc.)
    Version:  v2.15.1
    Path:     /usr/libexec/docker/cli-plugins/docker-compose
  scan: Docker Scan (Docker Inc.)
    Version:  v0.23.0
    Path:     /usr/libexec/docker/cli-plugins/docker-scan

Server:
 Containers: 56
  Running: 1
  Paused: 0
  Stopped: 55
 Images: 723
 Server Version: 23.0.0
 Storage Driver: overlay2
  Backing Filesystem: extfs
  Supports d_type: true
  Using metacopy: false
  Native Overlay Diff: true
  userxattr: false
 Logging Driver: json-file
 Cgroup Driver: cgroupfs
 Cgroup Version: 1
 Plugins:
  Volume: local
  Network: bridge host ipvlan macvlan null overlay
  Log: awslogs fluentd gcplogs gelf journald json-file local logentries splunk syslog
 Swarm: inactive
 Runtimes: io.containerd.runc.v2 nvidia runc
 Default Runtime: runc
 Init Binary: docker-init
 containerd version: 31aa4358a36870b21a992d3ad2bef29e1d693bec
 runc version: v1.1.4-0-g5fd4c4d
 init version: de40ad0
 Security Options:
  apparmor
  seccomp
   Profile: builtin
 Kernel Version: 5.13.0-37-generic
 Operating System: Ubuntu 20.04.5 LTS
 OSType: linux
 Architecture: x86_64
 CPUs: 96
 Total Memory: 502.6GiB
 Name: cocoa
 ID: 6AXB:OEBV:VSR6:AYW3:N23Z:4LFF:4BRU:JUKX:4KOH:ROLW:OEQG:CP54
 Docker Root Dir: /mnt/data/docker
 Debug Mode: false
 HTTP Proxy: http://xxx
 HTTPS Proxy: http://xxx
 No Proxy: localhost,127.0.0.1
 Username: xxx
 Registry: https://index.docker.io/v1/
 Experimental: false
 Insecure Registries:
  127.0.0.0/8
 Live Restore Enabled: false

Additional Info

No response

@zeyugao zeyugao added kind/bug Bugs are bugs. The cause may or may not be known at triage time so debugging may be needed. status/0-triage labels Feb 7, 2023
@thaJeztah
Copy link
Member

@neersighted
Copy link
Member

neersighted commented Feb 7, 2023

I am able to reproduce this with the following method:

  • Set DOCKER_BUILDKIT=1
  • Build with docker build -t test . on 20.10.23
  • Upgrade to 23.0.0
  • Modify Dockerfile
  • Build with docker build -t test . on 23.0.0

The failure looks the same:

#0 0.314 runc run failed: unable to start container process: exec: "/bin/sh": stat /bin/sh: no such file or directory

@neersighted neersighted added priority/P1 Important: P1 issues are a top priority and a must-have for the next release. and removed status/0-triage labels Feb 7, 2023
@neersighted neersighted added this to the 23.0.1 milestone Feb 7, 2023
@thaJeztah
Copy link
Member

/cc @tonistiigi @crazy-max

@neersighted
Copy link
Member

I've provided a complete reproduction at https://gist.github.com/neersighted/8bccac67a66ad2163c3e6ff02d412f29 to remove ambiguity for those doing the RCA. It seems plausible to me that only some base images result in this behavior, as I have had trouble getting it to reproduce on alpine:latest or alpine:edge.

@neersighted neersighted changed the title runc failed with when using inline cache from 20.10 in 23.0.0 runc run failed: stat: no such file or directory with warm BuildKit cache after 23.0 upgrade Feb 7, 2023
@neersighted
Copy link
Member

It's worth noting that it appears to me that BUILDKIT_INLINE_CACHE=1 is not necessary here -- instead, it looks like only some base images trigger this behavior.

@neersighted
Copy link
Member

Some more updates: I don't think this is based on base image like postulated in #44947; instead it appears to be based on a hot/cold cache.

This can be mitigated by dropping the cache: docker builder prune -a. A regular docker builder prune does not suffice.

@TyIsI
Copy link

TyIsI commented Feb 8, 2023

After updating to 5:23.0.0-1~ubuntu.22.04~jammy earlier today, it looks like I'm running into this bug with another BUILDKIT/build x build (maven:3-openjdk-11 image).

ii  docker-buildx-plugin                  0.10.2-1~ubuntu.22.04~jammy                  amd64        Docker Buildx cli plugin.
ii  docker-ce                             5:23.0.0-1~ubuntu.22.04~jammy                amd64        Docker: the open-source application container engine
ii  docker-ce-cli                         5:23.0.0-1~ubuntu.22.04~jammy                amd64        Docker CLI: the open-source application container engine
ii  docker-ce-rootless-extras             5:23.0.0-1~ubuntu.22.04~jammy                amd64        Rootless support for Docker.
ii  docker-compose-plugin                 2.15.1-1~ubuntu.22.04~jammy                  amd64        Docker Compose (V2) plugin for the Docker CLI.
ii  docker-scan-plugin                    0.23.0~ubuntu-jammy                          amd64        Docker scan cli plugin.

Disabling BUILDKIT fixed it.

docker builder prune -a seems to hang. (strace is showing a bunch of futex timeouts.)

root@buildhost:~# date ; docker builder prune -a -f
Wed Feb  8 01:16:11 AM UTC 2023
^CERROR: rpc error: code = Canceled desc = context canceled
root@buildhost:~# date
Wed Feb  8 01:29:02 AM UTC 2023

Do you want me to create a new issue for this or add to docker/buildx#1595?

@neersighted
Copy link
Member

It's hard to tell what you are seeing -- can you please post the complete output so we can determine if this is a novel issue or the same one?

@TyIsI
Copy link

TyIsI commented Feb 8, 2023

Dockerfile

FROM maven:3-openjdk-11 AS build

ARG SENTRY_RELEASE

WORKDIR /usr/src/app

COPY pom.xml /usr/src/app

RUN --mount=type=cache,target=/root/.m2 mvn dependency:go-offline -B

COPY src /usr/src/app/src

COPY tools/generate-sentry-build-properties.sh /usr/src/app/

RUN /usr/src/app/generate-sentry-build-properties.sh

RUN --mount=type=cache,target=/root/.m2 mvn package

FROM tomcat:9-jdk11-openjdk

EXPOSE 8080

COPY --from=build /usr/src/app/target/webapp.war /usr/local/tomcat/webapps/

.env:

COMPOSE_DOCKER_CLI_BUILD=1
DOCKER_BUILDKIT=1

Relevant section/changes from daemon.json:

{
       ...,
        "data-root": "/data/docker",
        "features": {
                "buildkit": true
        },
       ...
}

After having ran both docker builder prune -a and docker system prune -a, I current seem to not be able to build anything.

But I can confirm that I had the exact same issue as reported:

"/bin/sh": stat /bin/sh: no such file or directory

Just to be sure. I'm going to give my build machine a reboot.

But let me know if you want the strace files. (I'm available on Keybase.)

@TyIsI
Copy link

TyIsI commented Feb 8, 2023

Output from docker compose:

$ docker compose build
[+] Building 3.0s (13/15)                                                                                                                                                     
 => [internal] load .dockerignore                                                                                                                                        0.4s
 => => transferring context: 167B                                                                                                                                        0.0s
 => [internal] load build definition from Dockerfile.dev                                                                                                                 0.6s
 => => transferring dockerfile: 535B                                                                                                                                     0.0s
 => [internal] load metadata for docker.io/library/tomcat:9-jdk11-openjdk                                                                                                0.6s
 => [internal] load metadata for docker.io/library/maven:3-openjdk-11                                                                                                    0.7s
 => [build 1/8] FROM docker.io/library/maven:3-openjdk-11@sha256:805f366910aea2a91ed263654d23df58bd239f218b2f9562ff51305be81fa215                                        0.0s
 => CACHED [stage-1 1/2] FROM docker.io/library/tomcat:9-jdk11-openjdk@sha256:171affbd3c2ab043eb98700f06ef63c4531c65063846891f49d924b08c523972                           0.1s
 => [internal] load build context                                                                                                                                        0.2s
 => => transferring context: 13.72kB                                                                                                                                     0.0s
 => CACHED [build 2/8] WORKDIR /usr/src/app                                                                                                                              0.0s
 => CACHED [build 3/8] COPY pom.xml /usr/src/app                                                                                                                         0.0s
 => CACHED [build 4/8] RUN --mount=type=cache,target=/root/.m2 mvn dependency:go-offline -B                                                                              0.0s
 => CACHED [build 5/8] COPY src /usr/src/app/src                                                                                                                         0.0s
 => CACHED [build 6/8] COPY tools/generate-sentry-build-properties.sh /usr/src/app/                                                                                      0.0s
 => ERROR [build 7/8] RUN /usr/src/app/generate-sentry-build-properties.sh                                                                                               1.2s
------                                                                                                                                                                        
 > [build 7/8] RUN /usr/src/app/generate-sentry-build-properties.sh:
#0 0.690 runc run failed: unable to start container process: exec: "/bin/sh": stat /bin/sh: no such file or directory

Content of tools/generate-sentry-build-properties.sh:

#!/bin/bash

echo "release=${SENTRY_RELEASE}" >src/main/resources/sentry.properties

@neersighted
Copy link
Member

Hmm, okay, this looks like the same issue. The odd thing is the failure to docker builder prune -a. Can you restart your daemon and see if it still hangs?

@TyIsI
Copy link

TyIsI commented Feb 8, 2023

Not entirely sure what happened there. But I just tested, and after last night's reboot, I was able to run docker builder prune -a and after I just changed my build settings to enable buildkit again, I was in fact able to build my image.

Just to summarize:

  • Rebooted machine
  • Got build error (again) with buildkit enabled
  • Disabled buildkit
  • Successfully built with buildkit disabled
  • Enabled buildkit
  • Successfully ran docker builder prune -a
  • Successfully built with buildkit enabled

@neersighted
Copy link
Member

Fixed in #44959

@yster-gaurav
Copy link

[ypbackend:develop 16/17] RUN chmod +x wait-for-it.sh:
#0 3.548 chmod: cannot access 'wait-for-it.sh': No such file or directory

When trying to run
docker compose -f docker-compose.yml -f docker-compose.enterprise.yml up --build -d

@neersighted
Copy link
Member

This is a closed/resolved issue, and you have not provided any context for the error message. If you are reproducing something specific to compose on 23.0.1, please file an issue on the compose repo with reproduction instructions.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/builder/buildkit Issues affecting buildkit area/builder kind/bug Bugs are bugs. The cause may or may not be known at triage time so debugging may be needed. priority/P1 Important: P1 issues are a top priority and a must-have for the next release. status/confirmed version/23.0
Projects
None yet
Development

No branches or pull requests

5 participants