-
Notifications
You must be signed in to change notification settings - Fork 59
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
podman cannot start containers using 36.20220906.3.2
, but can with 36.20220820.3.0
#1305
Comments
How do you overlay |
I overlayed An edited butane config is below. It is missing many containers and some systemd timers, but this one should be mostly self-contained.
My
|
I was able to reproduce with the butane configuration provided. One thing to note is that you must run This doesn't seem to be FCOS specific as I was able to reproduce on Fedora 36. I've isolated the issue down to kernel bump from 5.18 to 5.19. Using FCOS Here is the same error from Fedora 36:
|
@mheon @alexlarsson can you take a look at this from the podman / quadlet side of things? It looks like something in the 5.19 kernel is causing issues with containers run with quadlet / podman. Kernel 5.18 works with no issues. We are seeing overlay error messages and the containers are not starting.
Reproducer on Fedora 36 with kernel 5.19 or later:
|
Never seen this before, and 5.19 has been out for a long enough time that I suspect we would've heard about serious incompatibilities already in Podman upstream (I'm running the same kernel minor version right now without issues, albeit on Fedora not FCOS). Still, that error does seem to be coming straight out of c/storage's overlayfs code. @nalind Any thoughts? |
In #1305 (comment) @mike-nguyen shows a reproducer on a Fedora VM (not FCOS). So if you have one of those handy you should be able to try it yourself. |
Thanks for looking into it. Note, that it can be reproduces with |
Fedora CoreOS is Fedora. I will continue to fight against this terminology. Alternative terms are e.g. "dnf-based Fedora" or "traditional Fedora" or so. |
Bigger picture, a problem we have now is too many of our tests are highly synthetic - we have an upgrade test, but that upgrade test only validates we can get from one version of the OS to another, it doesn't verify containers still work. This is a big gap. (Also true of podman's upstream CI as far as I know) |
We do have upgrade testing, but entirely within Podman versions. We do no validation as to whether OS or kernel updates break us in our upstream CI (honestly, this hasn't come up often) |
See containers/storage#1308, maybe its related. |
containers/storage#1308 seems to be about running multiple containers using the same image, but I don't do that and still encounter this issue with
Affects rootful containers only, rootless containers work just fine. Anyway, it still might be the same issue, just a different trigger. Too bad we don't have the patch in FCOS yet. Just for the record, the issue isn't related to quadlet for sure: I neither use quadlet, nor any other layered package. Since we already have a reproducer, I refrain from providing a reproducer myself. Once again rolling back to |
Any updates on this? Still not working with |
@giuseppe - any ideas what could be the underlying issue here? |
could you try running |
With idmapped mounts disabled all containers spin up just fine on FCOS 37.20221211.3.0 👍 Thank you @giuseppe so far 👍 This should allow me to finally re-enable upgrades (it was disabled waaaay to long...). Is there a way to make this config persistent? Since you asked this elsewhere @giuseppe, all my containers are stored on btrfs filesystems (one subvolume per container to be more precise). This probably is the only major difference from "regular" FCOS (at least as far as I can think of right now...); I neither use layered packages, nor do I modify anything in Here's one of the affected Systemd services:[Unit]
Description=Podman container 'bind'
Wants=network-online.target container-network-bind.service
After=network-online.target container-network-bind.service
RequiresMountsFor=%t/containers
RequiresMountsFor=/srv/containers/bind
[Service]
Type=notify
NotifyAccess=all
Environment=PODMAN_SYSTEMD_UNIT=%n
ExecStartPre=/bin/rm -f %t/%n.ctr-id
ExecStart=/usr/bin/podman run --cidfile=%t/%n.ctr-id --sdnotify=conmon --cgroups=no-conmon --replace -dt --name bind --label io.containers.autoupdate=registry --subuidname bind --uidmap 65536:100000007:1 --uidmap 65537:100000002:1 --subgidname bind --gidmap 65536:100000007:1 --gidmap 65537:100000002:1 --mount type=bind,src=/srv/containers/bind/config/local-zones,dst=/etc/named/local-zones,ro=true --mount type=bind,src=/srv/containers/acme/data/live/dot.example.com,dst=/etc/named/ssl/dns-over-tls,ro=true --mount type=bind,src=/srv/containers/bind/config/ssl/dhparams.pem,dst=/etc/named/ssl/dhparams.pem,ro=true --mount type=bind,src=/srv/containers/bind/data,dst=/var/named --net bind --hostname ns.example.com -p 192.0.2.1:53:53/tcp -p 192.0.2.1:53:53/udp -p 192.0.2.1:853:853/tcp -p [2001:db8::1]:53:53/udp -p [2001:db8::1]:53:53/tcp -p [2001:db8::1]:853:853/tcp ghcr.io/sgsgermany/bind:latest
ExecStop=/usr/bin/podman stop --ignore --cidfile=%t/%n.ctr-id
TimeoutStopSec=70
Restart=on-failure
[Install]
WantedBy=default.target I've masked the hostname, global IPv4 and IPv6. The container's sources can be found here: https://github.com/SGSGermany/bind My subuid scheme looks like the following:
And the corresponding entries in
|
I'm no big fan of pinging issues, but since it's a major issue for my systems: If fixing this will take some time, I'd like to ask again how to persist @giuseppe's workaround? I don't see any difference with idmapped mounts disabled, so making the workaround persistent is totally fine for now and would allow me to upgrade FCOS. Unfortunately I didn't find any config option, nor anything online, or in the man pages. Thanks! 👍
|
There is no way to make it persistent. It is lost on reboots. Maybe you could temporarily use a systemd oneshot service to create it? |
That's unfortunate, but yeah, good idea, a systemd oneshot service should do until this is fixed. Thanks @giuseppe, looking forward to an actual fix 👍 Should we create a new issue in Just for the record and for others with the same issue, here's the Systemd service I came up with. I was having a hard time figuring out how to trigger the creation of [Unit]
Description=Disable idmapped overlayfs mounts of Podman containers (bugfix)
Before=container-bind.service
RequiresMountsFor=%t/containers
[Service]
Type=oneshot
ExecStart=/bin/sh -c 'if [ -f /run/containers/storage/overlay/idmapped-lower-dir-true ]; then mv /run/containers/storage/overlay/idmapped-lower-dir-true /run/containers/storage/overlay/idmapped-lower-dir-false; else if [ ! -d /run/containers/storage/overlay ]; then mkdir -p -m 700 /run/containers/storage/overlay; fi; touch /run/containers/storage/overlay/idmapped-lower-dir-false; fi'
RemainAfterExit=true
[Install]
RequiredBy=container-bind.service |
I'm interested in what the actual problem is here and what the fix would be too. @giuseppe can you help guide us on that front? |
@dustymabe I am trying Fedora CoreOS 37.20221211.3.0 with Podman 4.4.0-dev, but still getting errors similar to containers/storage#1308 (comment) , From today:
@giuseppe is it related or what? |
@dustymabe filled an issue containers/podman#17171, not sure if its related tho |
I am also seeing this on podman 4.4.1/ Kernel 6.1.13-100.fc36.x86_64 / Fedora 36 along with the same dmesg output.
|
Replacing @ykuksenko I've written a small Systemd service to persists the workaround, see #1305 (comment), you just have to add the container's Systemd service to the |
3. here is my systemd service unit
# /etc/systemd/system/container-speedtest-exporter.service
# container-speedtest-exporter.service
I am not sure but based on the export/reimport workaround it seems the issue has something to do with on disk image configuration. I am not sure how to check that though. I will keep this container in the broken state for a while in case there are other questions. edit: fixed numbering |
I upgraded another system from Fedora 34 to Fedora 36, skipping 35, and had the same issue happen there too. Only 1 of 2 containers were affected there. The dmesg output is slightly different - namely the code at the end. On that digital ocean system:
Using the export, delete, import image approach worked. I had not noticed before but container image tags are lost in that process. They are regained from my registry when I restart the container. I do not have a way to go back on this system. |
I'm still not sure if there is something actionable in this ticket. Are the issues with podman resolved? If not can we open new bugs against https://github.com/containers/podman ? |
Not sure either... @giuseppe provided a workaround that works fine in production, but the issue persists. I've just opened a reference issue against containers/podman, see containers/podman#18435 |
Some feedback from containers/podman#18435: This issue was fixed with Podman 4.5.0, which just landed in stable FCOS. I guess we can close this now. Thanks everyone! 👍 |
Thanks for the feedback @PhrozenByte! |
Describe the bug
I am using an overlayed
quadlet
to generate systemd units, but I cannot start the container withpodman
alone.Reproduction steps
Steps to reproduce the behavior:
36.20220906.3.2
and rebootExpected behavior
dmesg
shows (maybe the id is wrong):Note, that also removing all images (
podman images -q|xargs podman rmi
) does not resolve the situation.The systemd file starting mariadb is the following.
The text was updated successfully, but these errors were encountered: