Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

pass store is uninitialized errors causing failed builds and broken layers with DOCKER_BUILDKIT=1 builds #39440

Closed
chadlwilson opened this issue Jun 30, 2019 · 3 comments

Comments

@chadlwilson
Copy link

chadlwilson commented Jun 30, 2019

Description

We use multi-stage docker builds with multiple targets andDOCKER_BUILDKIT=1 to produce faster builds and only build the required layers necessary for the target.

We observe some kind of race condition between parallel image pulls at the start of the buildkit build that has three possible bad outcomes we have observed

  1. docker build entirely fails due to a corrupt earlier layer when a later RUN step is depending on something from an earlier layer/step
  2. docker build entirely fails with pass store is uninitialized
  3. build completes, but the built image is corrupt in some way

In all cases, if you look at the buildkit log you will see one of the load metadata steps has an error

error: "rpc error: code = Unknown desc = error getting credentials - err: exit status 1, out: `pass store is uninitialized`"

Usually the build has continued despite this error on an earlier step.

The failures are sporadic, and appear to be some kind of race condition between parallel load metadata steps, which are happening at the same time. The failed step has usually taken ~500-1000ms before getting the error above.

Steps to reproduce the issue:
Unclear exactly how to reproduce the issue deterministically, however appears to involve

  1. GUESS Have pass installed on the Linux machine that you are using so that docker-credentials-pass is in action
  2. Create a multi-stage docker file which has ability to build/pull multiple source images in parallel at the start of the docker build. The more parallel load metadata steps buildkit is able to do, the more likely to seems to be to experience a failure.
  3. enable export DOCKER_BUILDKIT=1
  4. run docker build
  5. Repeat until it fails

Describe the results you received:
Sometimes images are just corrupt. Other times docker build will fail with an error implying something is missing from an earlier layer that is required by a later layer.

Here is an example of such a failed build (case 1 in my failure modes listed above). Note the error in step #4 which then causes failure in #7 because a file from the JDK layer is not able to be appended to.

#2 [internal] load .dockerignore
#2       digest: sha256:48d8cfb6234c77ae48eb3c6017270fcc2838a2e21f6d3d95e72f76ba0dfd6284
#2         name: "[internal] load .dockerignore"
#2      started: 2019-06-28 07:28:12.800807647 +0000 UTC
#2    completed: 2019-06-28 07:28:12.813592769 +0000 UTC
#2     duration: 12.785122ms
#2 transferring context: 124B done


#1 [internal] load build definition from Dockerfile
#1       digest: sha256:31ae22f211d51ce81ea56f37331bb1518e28b6c73cbdf740ac0cbbcf4a00e16e
#1         name: "[internal] load build definition from Dockerfile"
#1      started: 2019-06-28 07:28:12.80064242 +0000 UTC
#1    completed: 2019-06-28 07:28:12.821557951 +0000 UTC
#1     duration: 20.915531ms
#1 transferring dockerfile: 1.20kB done


#4 [internal] load metadata for docker.io/adoptopenjdk/openjdk11:jre-11.0.3...
#4       digest: sha256:37c22fdc153ffce7b0f29a7e6c22dad15face27d7f448613913ff016caeb3324
#4         name: "[internal] load metadata for docker.io/adoptopenjdk/openjdk11:jre-11.0.3_7-alpine"
#4      started: 2019-06-28 07:28:12.841592088 +0000 UTC
#4    completed: 2019-06-28 07:28:13.815817721 +0000 UTC
#4     duration: 974.225633ms
#4        error: "rpc error: code = Unknown desc = error getting credentials - err: exit status 1, out: `pass store is uninitialized`"


#3 [internal] load metadata for docker.io/library/gradle:5.4.1-jre11-slim
#3       digest: sha256:1f000afa4c9bdd451ee85c15544177561493ab2bb8cd89aac6dba378970c1540
#3         name: "[internal] load metadata for docker.io/library/gradle:5.4.1-jre11-slim"
#3      started: 2019-06-28 07:28:12.841590969 +0000 UTC
#3    completed: 2019-06-28 07:28:15.32183242 +0000 UTC
#3     duration: 2.480241451s


#11 [build 2/5] COPY --chown=gradle:gradle build.gradle settings.gradle /hom...
#11       digest: sha256:573235b83dd58c7b801fe278ccbbdf6a06eeb22298fc9247bd1edddbaecf7996
#11         name: "[build 2/5] COPY --chown=gradle:gradle build.gradle settings.gradle /home/gradle/app/"
#11      started: 2019-06-28 07:28:15.356601186 +0000 UTC
#11    completed: 2019-06-28 07:28:15.356601186 +0000 UTC
#11     duration: 0s
#11       cached: true


#12 [build 3/5] RUN gradle downloadDependencies -PisCI
#12       digest: sha256:c7ba0d7783cf2637d48c9c3344665876ebd31beb8b54a77d5075a675e38669e5
#12         name: "[build 3/5] RUN gradle downloadDependencies -PisCI"
#12      started: 2019-06-28 07:28:15.356601186 +0000 UTC
#12    completed: 2019-06-28 07:28:15.356643023 +0000 UTC
#12     duration: 41.837µs
#12       cached: true


#5 [internal] helper image for file operations
#5       digest: sha256:e4c43493b1bb2320466c52248e886014710025b273f224bb96a23b8fb7a1d193
#5         name: "[internal] helper image for file operations"
#5      started: 2019-06-28 07:28:15.323444622 +0000 UTC
#5    completed: 2019-06-28 07:28:15.323627107 +0000 UTC
#5     duration: 182.485µs
#5      started: 2019-06-28 07:28:15.356726294 +0000 UTC
#5    completed: 2019-06-28 07:28:15.356753524 +0000 UTC
#5     duration: 27.23µs
#5       cached: true


#9 [build 1/5] FROM docker.io/library/gradle:5.4.1-jre11-slim@sha256:a46bc1...
#9       digest: sha256:4ee9280ee94ecb44d531d56a448831979b6719f39d09d2971d62301f753fc207
#9         name: "[build 1/5] FROM docker.io/library/gradle:5.4.1-jre11-slim@sha256:a46bc1557bdfb12f9a521a75322f70f96704651e32726f1afd6ae702687f6d12"
#9      started: 2019-06-28 07:28:15.323495875 +0000 UTC
#9    completed: 2019-06-28 07:28:15.323636014 +0000 UTC
#9     duration: 140.139µs


#10 [internal] load build context
#10       digest: sha256:e98e6a743b63dd75fd3b9c413a8bb0bbf589633632911f53504ade68aa720fd1
#10         name: "[internal] load build context"
#10      started: 2019-06-28 07:28:15.323940598 +0000 UTC
#10    completed: 2019-06-28 07:28:15.323997436 +0000 UTC
#10     duration: 56.838µs
#10      started: 2019-06-28 07:28:15.324187031 +0000 UTC
#10    completed: 2019-06-28 07:28:15.354017582 +0000 UTC
#10     duration: 29.830551ms
#10 transferring context: 205.60kB 0.0s done


#13 [build 4/5] COPY --chown=gradle:gradle src /home/gradle/app/src
#13       digest: sha256:e6c6d48b9b2ec2a116d143b49a51665d63a527599002abd9759052c5d089e85f
#13         name: "[build 4/5] COPY --chown=gradle:gradle src /home/gradle/app/src"
#13      started: 2019-06-28 07:28:15.356835543 +0000 UTC
#13    completed: 2019-06-28 07:28:15.611531387 +0000 UTC
#13     duration: 254.695844ms


#14 [build 5/5] RUN gradle -x check build -PisCI     && mkdir -p build/depen...
#14       digest: sha256:70bc724d53d2d816dc12aad86c20aa42bf52dbe0c7552518f2fa75f159c4c2a2
#14         name: "[build 5/5] RUN gradle -x check build -PisCI     && mkdir -p build/dependency     && unzip build/libs/*.jar -d build/dependency"
#14      started: 2019-06-28 07:28:15.613872901 +0000 UTC
#6 ...

#6 [stage-1 1/6] FROM docker.io/adoptopenjdk/openjdk11:jre-11.0.3_7-alpine
#6       digest: sha256:156a749a21500b1ec38e0aebb94251db5756d1825662f85c41095af059fc0a99
#6         name: "[stage-1 1/6] FROM docker.io/adoptopenjdk/openjdk11:jre-11.0.3_7-alpine"
#6      started: 2019-06-28 07:28:15.324147408 +0000 UTC
#6      started: 2019-06-28 07:28:16.442492369 +0000 UTC
#6    completed: 2019-06-28 07:28:16.442526457 +0000 UTC
#6     duration: 34.088µs
#6       cached: true
#6 resolve docker.io/adoptopenjdk/openjdk11:jre-11.0.3_7-alpine 1.1s done


#7 [stage-1 2/6] RUN echo -e "\nnetworkaddress.cache.ttl=60\n" >> ${JAVA_HO...
#7       digest: sha256:ee5399ebfb2b5bcc0cf8128121d3f344d3a164b7b3ae04f0bba808e496fec4dc
#7         name: "[stage-1 2/6] RUN echo -e \"\\nnetworkaddress.cache.ttl=60\\n\" >> ${JAVA_HOME}/conf/security/java.security"
#7      started: 2019-06-28 07:28:16.442586959 +0000 UTC
#7 0.339 /bin/sh: can't create /conf/security/java.security: nonexistent directory
#7    completed: 2019-06-28 07:28:16.849273416 +0000 UTC
#7     duration: 406.686457ms
#7        error: "executor failed running [/bin/sh -c echo -e \"\\nnetworkaddress.cache.ttl=$JAVA_DNS_TTL_SECONDS\\n\" >> ${JAVA_HOME}/conf/security/java.security]: exit code: 1"


#14 [build 5/5] RUN gradle -x check build -PisCI     && mkdir -p build/depen...
#14    completed: 2019-06-28 07:28:16.937727929 +0000 UTC
#14     duration: 1.323855028s
#14        error: "executor failed running [/bin/sh -c gradle -x check build ${GRADLE_CI_FLAGS}     && mkdir -p build/dependency     && unzip build/libs/*.jar -d build/dependency]: exit code: 137: context canceled: context canceled"

executor failed running [/bin/sh -c echo -e "\nnetworkaddress.cache.ttl=$JAVA_DNS_TTL_SECONDS\n" >> ${JAVA_HOME}/conf/security/java.security]: exit code: 1

The Dockerfile that produced the above error is

FROM gradle:5.4.1-jre11-slim as build

ARG GRADLE_CI_FLAGS=""
ENV GRADLE_OPTS="-Xms64m -Xmx192m -Dorg.gradle.daemon=false"
ENV GRADLE_USER_HOME=/home/gradle/.gradle-keep
COPY --chown=gradle:gradle build.gradle settings.gradle /home/gradle/app/
WORKDIR /home/gradle/app
RUN gradle downloadDependencies ${GRADLE_CI_FLAGS}
COPY --chown=gradle:gradle src /home/gradle/app/src
RUN gradle -x check build ${GRADLE_CI_FLAGS} \
    && mkdir -p build/dependency \
    && unzip build/libs/*.jar -d build/dependency

FROM adoptopenjdk/openjdk11:jre-11.0.3_7-alpine

# Change the default JVM DNS cache
ARG JAVA_DNS_TTL_SECONDS=60
RUN echo -e "\nnetworkaddress.cache.ttl=${JAVA_DNS_TTL_SECONDS}\n" >> ${JAVA_HOME}/conf/security/java.security

RUN addgroup -g 1000 appuser && \
    adduser -D -u 1000 -G appuser appuser
USER appuser
WORKDIR /app
ARG DEPENDENCY=/home/gradle/app/build/dependency
COPY --from=build ${DEPENDENCY}/BOOT-INF/lib /app/lib
COPY --from=build ${DEPENDENCY}/META-INF /app/META-INF
COPY --from=build ${DEPENDENCY}/BOOT-INF/classes /app/classes

ENTRYPOINT ["java","-cp","lib/*:classes","com.blah.MyApplication"]
EXPOSE 8080 9080

Describe the results you expected:
Builds to succeed without building corrupted images.

Additional information you deem important (e.g. issue happens only occasionally):
Only appears occasionally, but reasonably frequently for us (maybe 1 in 20 builds?).

Output of docker version:

Client:
-- | --
Version:           18.09.1
API version:       1.39
Go version:        go1.10.6
Git commit:        4c52b90
Built:             Wed Jan  9 19:35:23 2019
OS/Arch:           linux/amd64
Experimental:      false

Server: Docker Engine - Community
Engine:
Version:          18.09.6
PI version:      1.39 (minimum version 1.12)
Go version:       go1.10.8
Git commit:       481bc77
Built:            Sat May  4 01:59:36 2019
OS/Arch:          linux/amd64
Experimental:     false

Output of docker info:

Containers: 0
 Running: 0
 Paused: 0
 Stopped: 0
Images: 3
WARNING: API is accessible on http://0.0.0.0:2375 without encryption.
         Access to the remote API is equivalent to root access on the host. Refer
         to the 'Docker daemon attack surface' section in the documentation for
         more information: https://docs.docker.com/engine/security/security/#docker-daemon-attack-surface
WARNING: No swap limit support
Server Version: 18.09.6
Storage Driver: overlay2
 Backing Filesystem: extfs
 Supports d_type: true
 Native Overlay Diff: true
Logging Driver: json-file
Cgroup Driver: cgroupfs
Plugins:
 Volume: local
 Network: bridge host macvlan null overlay
 Log: awslogs fluentd gcplogs gelf journald json-file local logentries splunk syslog
Swarm: inactive
Runtimes: runc
Default Runtime: runc
Init Binary: docker-init
containerd version: bb71b10fd8f58240ca47fbb579b9d1028eea7c84
runc version: 2b18fe1d885ee5083ef9f0838fee39b62d653e30
init version: fec3683
Security Options:
 apparmor
 seccomp
  Profile: default
Kernel Version: 4.4.0-1075-aws
Operating System: Ubuntu 16.04.3 LTS
OSType: linux
Architecture: x86_64
CPUs: 2
Total Memory: 3.858GiB
Name: ip-10-0-189-69
ID: H3KR:VJMM:SQZP:INOK:NNTM:HOFJ:WQ5V:ZHXF:T2AC:HBIL:MFQC:6VAY
Docker Root Dir: /mnt/docker
Debug Mode (client): false
Debug Mode (server): false
Registry: https://index.docker.io/v1/
Labels:
Experimental: false
Insecure Registries:
 127.0.0.0/8
Live Restore Enabled: false
Product License: Community Engine

Additional environment details (AWS, VirtualBox, physical, etc.):
AWS instances

@tonistiigi
Copy link
Member

One of the differences is that buildkit can ask credentials concurrently. This seemed to be problematic for some credential helpers(with a similar race) so the requests were patched to be sequential in moby/buildkit#982 . Note that this fix is in the client side.

@chadlwilson
Copy link
Author

Thanks @tonistiigi! Apologies if this should have been raised in buildkit itself; I wasn't clear about this aspect. Is your feeling that this sequential patch is likely to address the issue, or worth testing at least? If so, are you able to help me understand which buildkit/docker engine version this fix might be part of so I can keep an eye out?

Any comment on the observed behavior that buildkit seems to just blindly continue despite the errors it has experienced on earlier layers?

@tonistiigi
Copy link
Member

I think it's worth testing. All 19.03 versions should have it.

Any comment on the observed behavior that buildkit seems to just blindly continue despite the errors it has experienced on earlier layers?

Yes, it probably could be improved. The issue is that there are 2 separate components for accessing the image. First only loads the metadata (in the dockerfile frontend) and the second does the actual pull. Metadata error is not considered fatal because in the case of actual error pull should fail anyway. But in this case, it seems to be flaky and the pull actually doesn't fail.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants