CI-Flake: bud-multiple-platform-no-run fails: error changing to intended-new-root directory #3710

cevich · 2022-01-12T19:34:45Z

Description

With fairly high frequently this test is failing, but sometimes it passes. This is affecting CI for all PRs for main and branch-testing itself.

Steps to reproduce the issue:

Run CI on a PR or main branch
bud-multiple-platform-no-run test fails (example from PR 3706)
Re-run the task, all tests pass (sometimes)

Describe the results you received:


[+0853s] # [linux/amd64] [2/2] COMMIT
[+0853s] # Getting image source signatures
...cut...
[+0853s] # [linux/s390x] [2/2] COMMIT
[+0853s] # Getting image source signatures
...cut...
[+0853s] # [linux/ppc64le] [2/2] COMMIT
[+0853s] # Writing manifest to image destination
...cut...
[+0853s] # Writing manifest to image destination
[+0853s] # Storing signatures
[+0853s] # 6df1a5aaabbbb36097fefb40bff149c3623dc9008c8f3708fcc1da7dd564bd58
[+0853s] # --> da98c4525b6
[+0853s] # da98c4525b600114fc89b8ced5fa3c4fb7f681adc65f9cfc31dc7e67640d839c
[+0853s] # error building at STEP "COPY --from=0 /root/Dockerfile.no-run /root/": checking on sources under "/var/tmp/buildah_tests.gbywtb/root/overlay/c0299fe974ccf3e964e1b623e5ee121afc0e63b08dfe9ad37da2b932f75cf0df/merged": error in copier subprocess: error changing to intended-new-root directory "/var/tmp/buildah_tests.gbywtb/root/overlay/c0299fe974ccf3e964e1b623e5ee121afc0e63b08dfe9ad37da2b932f75cf0df/merged": chdir /var/tmp/buildah_tests.gbywtb/root/overlay/c0299fe974ccf3e964e1b623e5ee121afc0e63b08dfe9ad37da2b932f75cf0df/merged: no such file or directory
[+0853s] # [ rc=125 (** EXPECTED 0 **) ]
[+0853s] # #/vvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvv
[+0853s] # #| FAIL: exit code is 125; expected 0
[+0853s] # #\^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[+0853s] # /var/tmp/go/src/github.com/containers/buildah/tests

Describe the results you expected:

[+0853s] ok 282 bud-multiple-platform-no-run

Output of rpm -q buildah or apt list buildah:

Buildah compiled at runtime for PR #3706

conmon-2.0.30-2.fc34-x86_64
containernetworking-plugins-1.0.1-1.fc34-x86_64
containers-common-1-21.fc34-noarch
container-selinux-2.170.0-2.fc34-noarch
crun-1.4-1.fc34-x86_64
libseccomp-2.5.3-1.fc34-x86_64
package cri-o-runc is not installed
package libseccomp2 is not installed
podman-3.4.2-1.fc34-x86_64
runc-1.0.2-2.fc34-x86_64
skopeo-1.5.2-1.fc34-x86_64
slirp4netns-1.1.12-2.fc34-x86_64

Output of buildah version:

Version:         1.24.0-dev
Go Version:      go1.16.12
Image Spec:      1.0.2-dev
Runtime Spec:    1.0.2-dev
CNI Spec:        1.0.0
libcni Version:  v1.0.1
image Version:   5.18.0
Git Commit:      0e6980f
Built:           Wed Jan 12 10:36:14 2022
OS/Arch:         linux/amd64
BuildPlatform:   linux/amd64

Output of podman version if reporting a podman build issue:

N/A

Output of cat /etc/*release:

Fedora 35

Output of uname -a:

Fedora 35

Output of cat /etc/containers/storage.conf:

Defaults on Fedora 35

The text was updated successfully, but these errors were encountered:

vrothberg · 2022-01-13T08:49:05Z

@edsantiago, Mr. Holmes, do you have data points that would suggest when the flake has started?

edsantiago · 2022-01-13T12:37:08Z

My tools for analyzing buildah flakes aren't as polished as those for podman, sorry. This is the best I can do on short notice.

Integration fedora-33 w/ overlay
Integration fedora-34 w/ overlay
Integration fedora-35 w/ overlay
Integration ubuntu-2104 w/ overlay
- [11-01 14:06] -- build: history should not contain ARG values. OTOH RUN history should contain final values. #3609
- [11-01 23:26] -- build: history should not contain ARG values. OTOH RUN history should contain final values. #3609
Integration ubuntu-2110 w/ overlay

cevich · 2022-01-13T15:02:58Z

Oh wow, this has been around for quite a while. I thought it was a new thing. Thanks for the data Ed.

cevich · 2022-01-13T18:11:48Z

Another Ubuntu example from last night's cirrus-cron run.

edsantiago · 2022-02-01T14:22:15Z

Just now in Integration fedora-34 w/ overlay in #3761

cevich · 2022-02-01T15:37:09Z

Last night on 'main' in cirrus-cron - also F34 + overlay.

edsantiago · 2022-02-09T18:13:08Z

Seeing a persistent flake in the podman build-under-bud tests, same test, different error message:

....
         # Storing signatures
         # 674eef7a6d059cb2477b713f2935a25c6bcc4dc4456352b96fc2b56f93db560d
         # --> dceea6611a9
         # dceea6611a9eee252a60342b9ecbab67c9d4fc65180dd85a1383a9c1acb34ada
         # Error: error creating build container: writing blob: adding layer with blob "sha256:25be9552a8196e51e8dbb75dae1cfe46cad31f01b456c6569017abd31ee1f9b9": ApplyLayer exit status 1 stdout:  stderr: Error after fallback to chroot: no such file or directory
         # [ rc=125 (** EXPECTED 0 **) ]

[bud] 248 bud-multiple-platform-no-run

fedora-35 : bud podman fedora-35 root host
- PR #13129
  - 02-04 11:57
- PR #12961
  - 01-21 11:02
- PR #12913
  - 01-18 18:59
fedora-35 : bud remote fedora-35 root host [remote]
- PR #13129
  - 02-04 11:57
- PR #12924
  - 01-19 14:04

cevich · 2022-02-17T21:27:18Z

This continues to flake fairly regularly in daily branch runs. Latest example.

edsantiago · 2022-02-21T17:45:13Z

[unknown image] : Containerized Integration
PR history, build: add pre-allowed variables e.g proxy vars to history if explicitly specified #3782
- 02-15 10:49
[unknown image] : Integration fedora-34 w/ overlay
[unknown image] : Integration fedora-35 w/ overlay
[unknown image] : Integration ubuntu-2110 w/ overlay
- PR caps: fix buildah run --cap-add=all #3766
  - 02-03 09:46
- PR build(deps): bump github.com/onsi/gomega from 1.17.0 to 1.18.0 #3736
  - 01-24 05:17
[unknown image] : Integration ubuntu-2110 w/ vfs
- PR stage_executor: Add support for inline FROM --platform= within Containerfile/Dockerfile #3757
  - 02-03 14:22

flouthoc · 2022-02-23T17:15:57Z

I think there is a race on c/storage where layer gets removed by another parallel build. We could remove flakes if we can make multi-arch build serially. @nalind Could we make multi-arch builds serial until race is removed from upstream. I believe one the SetNames PR should help but could be different race as well.

diff --git a/imagebuildah/build.go b/imagebuildah/build.go
index 77d8b6d5..a71e58aa 100644
--- a/imagebuildah/build.go
+++ b/imagebuildah/build.go
@@ -243,12 +243,12 @@ func BuildDockerfiles(ctx context.Context, store storage.Store, options define.B
                        logPrefix = "[" + platforms.Format(platformSpec) + "] "
                }
                builds.Go(func() error {
+                       instancesLock.Lock()
                        thisID, thisRef, err := buildDockerfilesOnce(ctx, store, logger, logPrefix, platformOptions, paths, files)
                        if err != nil {
                                return err
                        }
                        id, ref = thisID, thisRef
-                       instancesLock.Lock()
                        instances = append(instances, instance{
                                ID:       thisID,
                                Platform: platformSpec,

nalind · 2022-02-23T17:38:55Z

I'm pretty sure the race here happens when we need to cancel a build because one stage failed, but others are still running. The Executor Delete()s the affected StageExecutor, which deletes its builder, while its Execute() method is still running.

flouthoc · 2022-02-23T17:55:16Z

@nalind I'm not sure about that but for me it happens even when build was expected to pass and in uncertain manner. For me some failures were also as image not known which is specifically thrown from c/storage.

I also hope that issue is at buildah layer then it would be much easier to get a permanent fix.

github-actions · 2022-03-26T00:08:50Z

A friendly reminder that this issue had no activity for 30 days.

edsantiago · 2022-03-26T01:25:39Z

Very funny, github-actions. This is still an enormous annoyance.

flouthoc · 2022-03-28T10:31:56Z

@edsantiago Thanks for reminding, I'll start looking at this again, hopefully with catch race this time.

flouthoc · 2022-03-29T09:00:26Z

@nalind I think we have three different races happening here:

(Happens in both rootless and rootful): I believe race happens on pull API cause I think c/storage does not maps arch and name together so pull from one arch overrides another pull causing error creating build container: error locating pulled image "registry.access.redhat.com/ubi8-micro:latest" name in containers storage: registry.access.redhat.com/ubi8-micro:latest: image not known

This typically happens because most of our pulled image resolution happens by name in build

- `Build A` pulls `alpine (arm64)`
- `Build A` writes pulled image to storage.

- `Build B` pulls `alpine (amd64)`
- `Build B` writes pulled image to storage.

* `Build A` tries to access the pulled image by `LookupImage` by name but its overrided by `Build B` to point to a different `arch`

Mostly I think it happens between https://github.com/containers/common/blob/main/libimage/pull.go#L54 and https://github.com/containers/common/blob/main/libimage/pull.go#L164, we can have a global runtime lock to prevent this.

(Happens in root): Underlying overlay merged dir getting removed too early, this I'm unable to track something in c/storage unmounts the layer from another build. Logs for the same: https://cirrus-ci.com/task/5715007043272704?logs=integration_test#L1199
(Happens in rootless mostly): Layer is not written at all. Logs for the same: https://cirrus-ci.com/task/5204188027158528?logs=integration_test#L1201

All of these three race scenario looks unrelated to each other at first instance.

flouthoc · 2022-03-29T09:33:46Z

Although these races are legit but it should not flake in CI if we are already expecting things to be sequential in CI with --jobs=0 here https://github.com/containers/buildah/blob/main/tests/bud.bats#L3611 above PR should address that.

cevich · 2022-03-29T14:56:25Z

we can have a global runtime lock to prevent this.

I don't know the code, but that sounds like it might prevent parallel pulling, no? IMHO, pulling in parallel is almost universally desirable since bandwidth is vastly cheaper than engineer-time 😀

flouthoc · 2022-03-30T09:19:47Z

@cevich We can make it granular so it only locks for pull with same name, so it should still allow other pulls in parallel. Anyways problem could be entirely different.

@edsantiago @cevich I have a small question, do we have this flake since inception of bud-multiple-platform-no-run or was this flake introduced somewhere in the middle. Following question would help me diagnose if any commit in c/storage, c/image or buildah introduced these race conditions later.

edsantiago · 2022-03-30T11:36:17Z

The first instance I see was 2021-10-08 but unfortunately those logs are gone.

Ref: containers#3710 Signed-off-by: Chris Evich <cevich@redhat.com>

The bud-multiple-platform-no-run test is flaking way too much. Disable it. See containers/buildah#3710 Signed-off-by: Ed Santiago <santiago@redhat.com>

cevich · 2022-06-10T16:21:03Z

Happened in buildah-main yesterday.

github-actions · 2022-07-11T00:10:26Z

A friendly reminder that this issue had no activity for 30 days.

edsantiago · 2022-07-21T14:18:38Z

Dum de dum, still happening

github-actions · 2022-08-21T00:12:20Z

A friendly reminder that this issue had no activity for 30 days.

cevich · 2022-08-22T20:42:45Z

Oddly, I haven't seen this pop up recently, @edsantiago and @lsm5, do you remember anything in podman-monitor for August?

edsantiago · 2022-08-22T20:58:51Z

On podman, last seen Aug 2

cevich · 2022-08-23T14:20:40Z

Interesting, I remember this issue reproducing a lot more frequently. We saw it all the time (several times per week) in Buildah and Podman CI. Still, I s'pose maybe timings could have changed causing it to flake less 😞 . In any case, I also don't recall any recent activity on a fix either. @flouthoc where are we at with this, do you recall if anything else was done?

flouthoc · 2022-08-23T14:25:06Z

@cevich I don't think this is fixed at all, we don't see it occurring as frequently now because we tweaked concurrency in tests for multi-platform builds so it's just suppressed. Last time when i was working on this i was unable to catch the race maybe i'll give it a shot again.

cevich · 2022-08-23T14:33:18Z

maybe i'll give it a shot again.

Seems even more difficult now, at least in "wild-CI". Though generally I agree that some kind of reproducer would be helpful here. Do you have a notion of where the problem comes from and perhaps can that code be instrumented to force it to occur more reliably/frequently?

cevich · 2022-08-23T14:35:03Z

e.g. if you want to spawn a VM with more CPUs from hack/get_ci_vm.sh, you can edit the script locally and change the GCLOUD_CPUS value as needed.

edsantiago · 2022-09-20T18:56:09Z

New failure (yesterday), in actual buildah CI, not podman cron:

Writing manifest to image destination
Storing signatures
--> 89040f772e5
89040f772e587c7b96afb6cdce3c406adcc4c4c8db898f57aae968e166a24c70
error building at STEP "COPY --from=0 /root/Dockerfile.no-run /root/": checking on sources under "/var/tmp/buildah_tests.vvt47q/root/overlay/30e0274533d907207a6ead3e729247f61d1e86e8f10f18c0377790369041f200/merged": error in copier subprocess: error changing to intended-new-root directory "/var/tmp/buildah_tests.vvt47q/root/overlay/30e0274533d907207a6ead3e729247f61d1e86e8f10f18c0377790369041f200/merged": chdir /var/tmp/buildah_tests.vvt47q/root/overlay/30e0274533d907207a6ead3e729247f61d1e86e8f10f18c0377790369041f200/merged: no such file or directory
[ rc=125 (** EXPECTED 0 **) ]

github-actions · 2022-11-18T00:13:07Z

A friendly reminder that this issue had no activity for 30 days.

github-actions · 2022-12-19T00:10:03Z

A friendly reminder that this issue had no activity for 30 days.

github-actions · 2023-01-19T00:10:33Z

A friendly reminder that this issue had no activity for 30 days.

cevich · 2023-01-19T17:28:07Z

IMHO we can close this. I haven't seen any occurrences happen in the daily jobs for quite a while.

edsantiago · 2023-01-23T12:30:12Z

I wonder if this might have something to do with why we're not seeing the flake any more?

buildah/tests/bud.bats

Lines 5285 to 5288 in 4f8706b

    
           # Note: [This is a bug] jobs=1 is intentionally set here since --jobs=0 sets 
        
           # concurrency to maximum which uncovers all sorts of race condition causing 
        
           # flakes in CI. Please put this back to --jobs=0 when https://github.com/containers/buildah/issues/3710 
        
           # is resolved.

cevich · 2023-01-25T18:27:09Z

Damn, I bet that's exactly the reason 😞 I would be in favor of removing that or possibly making it conditional on an env-var or something.

edsantiago added the flakes label Feb 1, 2022

cevich mentioned this issue Feb 22, 2022

SEGV in bud-multiple-platform-no-run test #3791

Closed

edsantiago mentioned this issue Feb 24, 2022

DO NOT MERGE: Run tests with https://github.com/containers/storage/pull/1140 containers/podman#13315

Closed

edsantiago mentioned this issue Mar 24, 2022

readConmonPipeData: try to improve error containers/podman#13637

Merged

github-actions bot added the stale-issue label Mar 26, 2022

edsantiago removed the stale-issue label Mar 26, 2022

flouthoc mentioned this issue Mar 29, 2022

imagebuildah: if user sets --jobs=0 they expect no parallelism #3874

Closed

cevich added a commit to cevich/buildah that referenced this issue Mar 30, 2022

Skip frequently flaking bud-multiple-platform-no-run

56ab5af

Ref: containers#3710 Signed-off-by: Chris Evich <cevich@redhat.com>

edsantiago added a commit to edsantiago/libpod that referenced this issue Mar 30, 2022

Disable flaking buildah-bud test

1f2fe6a

The bud-multiple-platform-no-run test is flaking way too much. Disable it. See containers/buildah#3710 Signed-off-by: Ed Santiago <santiago@redhat.com>

edsantiago mentioned this issue Mar 30, 2022

WIP: Disable flaking buildah-bud test containers/podman#13718

Closed

cevich mentioned this issue Jun 22, 2022

Being-saved images create errors up the stack containers/storage#595

Open

github-actions bot added the stale-issue label Jul 11, 2022

edsantiago removed the stale-issue label Jul 11, 2022

github-actions bot added the stale-issue label Aug 21, 2022

github-actions bot removed the stale-issue label Oct 16, 2022

github-actions bot added the stale-issue label Nov 18, 2022

rhatdan removed the stale-issue label Nov 18, 2022

github-actions bot added the stale-issue label Dec 19, 2022

rhatdan removed the stale-issue label Dec 19, 2022

github-actions bot added the stale-issue label Jan 19, 2023

cevich closed this as completed Jan 19, 2023

github-actions bot added the locked - please file new issue/PR label Aug 29, 2023

github-actions bot locked as resolved and limited conversation to collaborators Aug 29, 2023

CI-Flake: bud-multiple-platform-no-run fails: error changing to intended-new-root directory #3710

CI-Flake: bud-multiple-platform-no-run fails: error changing to intended-new-root directory #3710

Comments

cevich commented Jan 12, 2022

vrothberg commented Jan 13, 2022

edsantiago commented Jan 13, 2022

cevich commented Jan 13, 2022

cevich commented Jan 13, 2022

edsantiago commented Feb 1, 2022

cevich commented Feb 1, 2022 • edited

edsantiago commented Feb 9, 2022

[bud] 248 bud-multiple-platform-no-run

cevich commented Feb 17, 2022

edsantiago commented Feb 21, 2022

flouthoc commented Feb 23, 2022

nalind commented Feb 23, 2022

flouthoc commented Feb 23, 2022

github-actions bot commented Mar 26, 2022

edsantiago commented Mar 26, 2022

flouthoc commented Mar 28, 2022

flouthoc commented Mar 29, 2022 • edited

flouthoc commented Mar 29, 2022

cevich commented Mar 29, 2022

flouthoc commented Mar 30, 2022

edsantiago commented Mar 30, 2022

cevich commented Jun 10, 2022

github-actions bot commented Jul 11, 2022

edsantiago commented Jul 21, 2022

github-actions bot commented Aug 21, 2022

cevich commented Aug 22, 2022

edsantiago commented Aug 22, 2022

cevich commented Aug 23, 2022

flouthoc commented Aug 23, 2022

cevich commented Aug 23, 2022

cevich commented Aug 23, 2022

edsantiago commented Sep 20, 2022

github-actions bot commented Nov 18, 2022

github-actions bot commented Dec 19, 2022

github-actions bot commented Jan 19, 2023

cevich commented Jan 19, 2023

edsantiago commented Jan 23, 2023

cevich commented Jan 25, 2023

cevich commented Feb 1, 2022 •

edited

flouthoc commented Mar 29, 2022 •

edited