Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CI-Flake: bud-multiple-platform-no-run fails: error changing to intended-new-root directory #3710

Closed
cevich opened this issue Jan 12, 2022 · 44 comments

Comments

@cevich
Copy link
Member

cevich commented Jan 12, 2022

Description

With fairly high frequently this test is failing, but sometimes it passes. This is affecting CI for all PRs for main and branch-testing itself.

Steps to reproduce the issue:

  1. Run CI on a PR or main branch
  2. bud-multiple-platform-no-run test fails (example from PR 3706)
  3. Re-run the task, all tests pass (sometimes)

Describe the results you received:


[+0853s] # [linux/amd64] [2/2] COMMIT
[+0853s] # Getting image source signatures
...cut...
[+0853s] # [linux/s390x] [2/2] COMMIT
[+0853s] # Getting image source signatures
...cut...
[+0853s] # [linux/ppc64le] [2/2] COMMIT
[+0853s] # Writing manifest to image destination
...cut...
[+0853s] # Writing manifest to image destination
[+0853s] # Storing signatures
[+0853s] # 6df1a5aaabbbb36097fefb40bff149c3623dc9008c8f3708fcc1da7dd564bd58
[+0853s] # --> da98c4525b6
[+0853s] # da98c4525b600114fc89b8ced5fa3c4fb7f681adc65f9cfc31dc7e67640d839c
[+0853s] # error building at STEP "COPY --from=0 /root/Dockerfile.no-run /root/": checking on sources under "/var/tmp/buildah_tests.gbywtb/root/overlay/c0299fe974ccf3e964e1b623e5ee121afc0e63b08dfe9ad37da2b932f75cf0df/merged": error in copier subprocess: error changing to intended-new-root directory "/var/tmp/buildah_tests.gbywtb/root/overlay/c0299fe974ccf3e964e1b623e5ee121afc0e63b08dfe9ad37da2b932f75cf0df/merged": chdir /var/tmp/buildah_tests.gbywtb/root/overlay/c0299fe974ccf3e964e1b623e5ee121afc0e63b08dfe9ad37da2b932f75cf0df/merged: no such file or directory
[+0853s] # [ rc=125 (** EXPECTED 0 **) ]
[+0853s] # #/vvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvv
[+0853s] # #| FAIL: exit code is 125; expected 0
[+0853s] # #\^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[+0853s] # /var/tmp/go/src/github.com/containers/buildah/tests

Describe the results you expected:

[+0853s] ok 282 bud-multiple-platform-no-run

Output of rpm -q buildah or apt list buildah:

Buildah compiled at runtime for PR #3706

conmon-2.0.30-2.fc34-x86_64
containernetworking-plugins-1.0.1-1.fc34-x86_64
containers-common-1-21.fc34-noarch
container-selinux-2.170.0-2.fc34-noarch
crun-1.4-1.fc34-x86_64
libseccomp-2.5.3-1.fc34-x86_64
package cri-o-runc is not installed
package libseccomp2 is not installed
podman-3.4.2-1.fc34-x86_64
runc-1.0.2-2.fc34-x86_64
skopeo-1.5.2-1.fc34-x86_64
slirp4netns-1.1.12-2.fc34-x86_64

Output of buildah version:

Version:         1.24.0-dev
Go Version:      go1.16.12
Image Spec:      1.0.2-dev
Runtime Spec:    1.0.2-dev
CNI Spec:        1.0.0
libcni Version:  v1.0.1
image Version:   5.18.0
Git Commit:      0e6980f
Built:           Wed Jan 12 10:36:14 2022
OS/Arch:         linux/amd64
BuildPlatform:   linux/amd64

Output of podman version if reporting a podman build issue:

N/A

Output of cat /etc/*release:

Fedora 35

Output of uname -a:

Fedora 35

Output of cat /etc/containers/storage.conf:

Defaults on Fedora 35

@vrothberg
Copy link
Member

@edsantiago, Mr. Holmes, do you have data points that would suggest when the flake has started?

@edsantiago
Copy link
Collaborator

My tools for analyzing buildah flakes aren't as polished as those for podman, sorry. This is the best I can do on short notice.

@cevich
Copy link
Member Author

cevich commented Jan 13, 2022

Oh wow, this has been around for quite a while. I thought it was a new thing. Thanks for the data Ed.

@cevich
Copy link
Member Author

cevich commented Jan 13, 2022

@edsantiago
Copy link
Collaborator

Just now in Integration fedora-34 w/ overlay in #3761

@cevich
Copy link
Member Author

cevich commented Feb 1, 2022

Last night on 'main' in cirrus-cron - also F34 + overlay.

@edsantiago
Copy link
Collaborator

Seeing a persistent flake in the podman build-under-bud tests, same test, different error message:

....
         # Storing signatures
         # 674eef7a6d059cb2477b713f2935a25c6bcc4dc4456352b96fc2b56f93db560d
         # --> dceea6611a9
         # dceea6611a9eee252a60342b9ecbab67c9d4fc65180dd85a1383a9c1acb34ada
         # Error: error creating build container: writing blob: adding layer with blob "sha256:25be9552a8196e51e8dbb75dae1cfe46cad31f01b456c6569017abd31ee1f9b9": ApplyLayer exit status 1 stdout:  stderr: Error after fallback to chroot: no such file or directory
         # [ rc=125 (** EXPECTED 0 **) ]

[bud] 248 bud-multiple-platform-no-run

@cevich
Copy link
Member Author

cevich commented Feb 17, 2022

This continues to flake fairly regularly in daily branch runs. Latest example.

@flouthoc
Copy link
Collaborator

I think there is a race on c/storage where layer gets removed by another parallel build. We could remove flakes if we can make multi-arch build serially. @nalind Could we make multi-arch builds serial until race is removed from upstream. I believe one the SetNames PR should help but could be different race as well.

diff --git a/imagebuildah/build.go b/imagebuildah/build.go
index 77d8b6d5..a71e58aa 100644
--- a/imagebuildah/build.go
+++ b/imagebuildah/build.go
@@ -243,12 +243,12 @@ func BuildDockerfiles(ctx context.Context, store storage.Store, options define.B
                        logPrefix = "[" + platforms.Format(platformSpec) + "] "
                }
                builds.Go(func() error {
+                       instancesLock.Lock()
                        thisID, thisRef, err := buildDockerfilesOnce(ctx, store, logger, logPrefix, platformOptions, paths, files)
                        if err != nil {
                                return err
                        }
                        id, ref = thisID, thisRef
-                       instancesLock.Lock()
                        instances = append(instances, instance{
                                ID:       thisID,
                                Platform: platformSpec,

@nalind
Copy link
Member

nalind commented Feb 23, 2022

I'm pretty sure the race here happens when we need to cancel a build because one stage failed, but others are still running. The Executor Delete()s the affected StageExecutor, which deletes its builder, while its Execute() method is still running.

@flouthoc
Copy link
Collaborator

@nalind I'm not sure about that but for me it happens even when build was expected to pass and in uncertain manner. For me some failures were also as image not known which is specifically thrown from c/storage.

I also hope that issue is at buildah layer then it would be much easier to get a permanent fix.

@github-actions
Copy link

A friendly reminder that this issue had no activity for 30 days.

@edsantiago
Copy link
Collaborator

Very funny, github-actions. This is still an enormous annoyance.

@flouthoc
Copy link
Collaborator

@edsantiago Thanks for reminding, I'll start looking at this again, hopefully with catch race this time.

@flouthoc
Copy link
Collaborator

flouthoc commented Mar 29, 2022

@nalind I think we have three different races happening here:

  • (Happens in both rootless and rootful): I believe race happens on pull API cause I think c/storage does not maps arch and name together so pull from one arch overrides another pull causing error creating build container: error locating pulled image "registry.access.redhat.com/ubi8-micro:latest" name in containers storage: registry.access.redhat.com/ubi8-micro:latest: image not known

This typically happens because most of our pulled image resolution happens by name in build

- `Build A` pulls `alpine (arm64)`
- `Build A` writes pulled image to storage.

- `Build B` pulls `alpine (amd64)`
- `Build B` writes pulled image to storage.

* `Build A` tries to access the pulled image by `LookupImage` by name but its overrided by `Build B` to point to a different `arch`

Mostly I think it happens between https://github.com/containers/common/blob/main/libimage/pull.go#L54 and https://github.com/containers/common/blob/main/libimage/pull.go#L164, we can have a global runtime lock to prevent this.

All of these three race scenario looks unrelated to each other at first instance.

@flouthoc
Copy link
Collaborator

Although these races are legit but it should not flake in CI if we are already expecting things to be sequential in CI with --jobs=0 here https://github.com/containers/buildah/blob/main/tests/bud.bats#L3611 above PR should address that.

@cevich
Copy link
Member Author

cevich commented Mar 29, 2022

we can have a global runtime lock to prevent this.

I don't know the code, but that sounds like it might prevent parallel pulling, no? IMHO, pulling in parallel is almost universally desirable since bandwidth is vastly cheaper than engineer-time 😀

@flouthoc
Copy link
Collaborator

@cevich We can make it granular so it only locks for pull with same name, so it should still allow other pulls in parallel. Anyways problem could be entirely different.

@edsantiago @cevich I have a small question, do we have this flake since inception of bud-multiple-platform-no-run or was this flake introduced somewhere in the middle. Following question would help me diagnose if any commit in c/storage, c/image or buildah introduced these race conditions later.

@edsantiago
Copy link
Collaborator

The first instance I see was 2021-10-08 but unfortunately those logs are gone.

cevich added a commit to cevich/buildah that referenced this issue Mar 30, 2022
Ref: containers#3710

Signed-off-by: Chris Evich <cevich@redhat.com>
edsantiago added a commit to edsantiago/libpod that referenced this issue Mar 30, 2022
The bud-multiple-platform-no-run test is flaking way too much.
Disable it. See containers/buildah#3710

Signed-off-by: Ed Santiago <santiago@redhat.com>
@cevich
Copy link
Member Author

cevich commented Jun 10, 2022

@github-actions
Copy link

A friendly reminder that this issue had no activity for 30 days.

@edsantiago
Copy link
Collaborator

Dum de dum, still happening

@github-actions
Copy link

A friendly reminder that this issue had no activity for 30 days.

@cevich
Copy link
Member Author

cevich commented Aug 22, 2022

Oddly, I haven't seen this pop up recently, @edsantiago and @lsm5, do you remember anything in podman-monitor for August?

@edsantiago
Copy link
Collaborator

On podman, last seen Aug 2

@cevich
Copy link
Member Author

cevich commented Aug 23, 2022

Interesting, I remember this issue reproducing a lot more frequently. We saw it all the time (several times per week) in Buildah and Podman CI. Still, I s'pose maybe timings could have changed causing it to flake less 😞 . In any case, I also don't recall any recent activity on a fix either. @flouthoc where are we at with this, do you recall if anything else was done?

@flouthoc
Copy link
Collaborator

@cevich I don't think this is fixed at all, we don't see it occurring as frequently now because we tweaked concurrency in tests for multi-platform builds so it's just suppressed. Last time when i was working on this i was unable to catch the race maybe i'll give it a shot again.

@cevich
Copy link
Member Author

cevich commented Aug 23, 2022

maybe i'll give it a shot again.

Seems even more difficult now, at least in "wild-CI". Though generally I agree that some kind of reproducer would be helpful here. Do you have a notion of where the problem comes from and perhaps can that code be instrumented to force it to occur more reliably/frequently?

@cevich
Copy link
Member Author

cevich commented Aug 23, 2022

e.g. if you want to spawn a VM with more CPUs from hack/get_ci_vm.sh, you can edit the script locally and change the GCLOUD_CPUS value as needed.

@edsantiago
Copy link
Collaborator

New failure (yesterday), in actual buildah CI, not podman cron:

Writing manifest to image destination
Storing signatures
--> 89040f772e5
89040f772e587c7b96afb6cdce3c406adcc4c4c8db898f57aae968e166a24c70
error building at STEP "COPY --from=0 /root/Dockerfile.no-run /root/": checking on sources under "/var/tmp/buildah_tests.vvt47q/root/overlay/30e0274533d907207a6ead3e729247f61d1e86e8f10f18c0377790369041f200/merged": error in copier subprocess: error changing to intended-new-root directory "/var/tmp/buildah_tests.vvt47q/root/overlay/30e0274533d907207a6ead3e729247f61d1e86e8f10f18c0377790369041f200/merged": chdir /var/tmp/buildah_tests.vvt47q/root/overlay/30e0274533d907207a6ead3e729247f61d1e86e8f10f18c0377790369041f200/merged: no such file or directory
[ rc=125 (** EXPECTED 0 **) ]

@github-actions
Copy link

A friendly reminder that this issue had no activity for 30 days.

@github-actions
Copy link

A friendly reminder that this issue had no activity for 30 days.

@github-actions
Copy link

A friendly reminder that this issue had no activity for 30 days.

@cevich
Copy link
Member Author

cevich commented Jan 19, 2023

IMHO we can close this. I haven't seen any occurrences happen in the daily jobs for quite a while.

@cevich cevich closed this as completed Jan 19, 2023
@edsantiago
Copy link
Collaborator

I wonder if this might have something to do with why we're not seeing the flake any more?

buildah/tests/bud.bats

Lines 5285 to 5288 in 4f8706b

# Note: [This is a bug] jobs=1 is intentionally set here since --jobs=0 sets
# concurrency to maximum which uncovers all sorts of race condition causing
# flakes in CI. Please put this back to --jobs=0 when https://github.com/containers/buildah/issues/3710
# is resolved.

@cevich
Copy link
Member Author

cevich commented Jan 25, 2023

Damn, I bet that's exactly the reason 😞 I would be in favor of removing that or possibly making it conditional on an env-var or something.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Aug 29, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

7 participants