Podman hangs, possible deadlock ? #16062

lukasmrtvy · 2022-10-05T18:19:17Z

Is this a BUG REPORT or FEATURE REQUEST? (leave only one on its own line)

/kind bug

Description
Seems that I am suffering from the very same problem as mentioned in #1574 ( 4y old ), but with the latest Podman version ( 4.2.0 ).

My scenario is a little bit different. I am running "custom scheduler" in a container with a mounted podman.socket and its running via systemd service. This scheduler is responsible for creating/managing/deleting ( there are also other operations like attaching, streaming logs, stats, etc involved ) and also for waiting for containers on behalf. Seems that when I restart this scheduler, Podman will become unavailable. Also using --userns=auto for containers created via "custom scheduler".

Steps to reproduce the issue:
( Not relevant due to its own business logic )

systemctl start scheduler.service ( podman run -v "/run/podman/podman.sock:/run/podman/podman.sock" myscheduler )
systemctl restart scheduler.service

Describe the results you received:
Podman is not accessible via CLI and API

Describe the results you expected:
Podman API is accessible via CLI and API

Additional information you deem important (e.g. issue happens only occasionally):

Related logs:
journald -u podman --no-pager | grep 04b5091d21ad06885e48915d9539997c1b58745a52e3178ce12995cf6b82f944

...
Oct 05 16:50:32 ip-10-2-22-82 podman[5526]: @ - - [05/Oct/2022:16:50:27 +0000] "POST /containers/04b5091d21ad06885e48915d9539997c1b58745a52e3178ce12995cf6b82f944/start HTTP/1.1" 204 0 "" ""
Oct 05 16:50:37 ip-10-2-22-82 podman[5526]: @ - - [05/Oct/2022:16:50:32 +0000] "GET /containers/04b5091d21ad06885e48915d9539997c1b58745a52e3178ce12995cf6b82f944/json HTTP/1.1" 200 9033 "" ""
Oct 05 17:00:32 ip-10-2-22-82 podman[5526]: @ - - [05/Oct/2022:16:50:22 +0000] "POST /containers/04b5091d21ad06885e48915d9539997c1b58745a52e3178ce12995cf6b82f944/attach?stream=true&stdout=true&stderr=true HTTP/1.1" 200 0 "" ""
Oct 05 17:00:32 ip-10-2-22-82 podman[5526]: 2022-10-05 17:00:32.984086144 +0000 UTC m=+1448.094378702 container died 04b5091d21ad06885e48915d9539997c1b58745a52e3178ce12995cf6b82f944 (image=XXXX.dkr.ecr.us-east-1.amazonaws.com/XXXX:0NkyJzxqJQWVq2aet, name=lucid_mclaren, health_status=)
Oct 05 17:00:52 ip-10-2-22-82 podman[5526]: @ - - [05/Oct/2022:16:50:32 +0000] "POST /containers/04b5091d21ad06885e48915d9539997c1b58745a52e3178ce12995cf6b82f944/wait HTTP/1.1" 200 30 "" "
...

ps -ef | grep 04b5091d21ad06885e48915d9539997c1b58745a52e3178ce12995cf6b82f944

...
root       13829       1  0 16:50 ?        00:00:00 /usr/bin/conmon --api-version 1 -c 04b5091d21ad06885e48915d9539997c1b58745a52e3178ce12995cf6b82f944 -u 04b5091d21ad06885e48915d9539997c1b58745a52e3178ce12995cf6b82f944 -r /usr/bin/crun -b /var/lib/containers/storage/overlay-containers/04b5091d21ad06885e48915d9539997c1b58745a52e3178ce12995cf6b82f944/userdata -p /run/containers/storage/overlay-containers/04b5091d21ad06885e48915d9539997c1b58745a52e3178ce12995cf6b82f944/userdata/pidfile -n lucid_mclaren --exit-dir /run/libpod/exits --full-attach -s -l k8s-file:/var/lib/containers/storage/overlay-containers/04b5091d21ad06885e48915d9539997c1b58745a52e3178ce12995cf6b82f944/userdata/ctr.log --log-level info --log-size-max 10485760 --runtime-arg --log-format=json --runtime-arg --log --runtime-arg=/run/containers/storage/overlay-containers/04b5091d21ad06885e48915d9539997c1b58745a52e3178ce12995cf6b82f944/userdata/oci-log --conmon-pidfile /run/containers/storage/overlay-containers/04b5091d21ad06885e48915d9539997c1b58745a52e3178ce12995cf6b82f944/userdata/conmon.pid --exit-command /usr/bin/podman --exit-command-arg --root --exit-command-arg /var/lib/containers/storage --exit-command-arg --runroot --exit-command-arg /run/containers/storage --exit-command-arg --log-level --exit-command-arg info --exit-command-arg --cgroup-manager --exit-command-arg systemd --exit-command-arg --tmpdir --exit-command-arg /run/libpod --exit-command-arg --network-config-dir --exit-command-arg  --exit-command-arg --network-backend --exit-command-arg netavark --exit-command-arg --volumepath --exit-command-arg /var/lib/containers/storage/volumes --exit-command-arg --runtime --exit-command-arg crun --exit-command-arg --storage-driver --exit-command-arg overlay --exit-command-arg --storage-opt --exit-command-arg overlay.mountopt=nodev,metacopy=on --exit-command-arg --events-backend --exit-command-arg journald --exit-command-arg container --exit-command-arg cleanup --exit-command-arg 04b5091d21ad06885e48915d9539997c1b58745a52e3178ce12995cf6b82f944
root       44396   13829  0 17:00 ?        00:00:00 /usr/bin/podman --root /var/lib/containers/storage --runroot /run/containers/storage --log-level info --cgroup-manager systemd --tmpdir /run/libpod --network-config-dir  --network-backend netavark --volumepath /var/lib/containers/storage/volumes --runtime crun --storage-driver overlay --storage-opt overlay.mountopt=nodev,metacopy=on --events-backend journald container cleanup 04b5091d21ad06885e48915d9539997c1b58745a52e3178ce12995cf6b82f944
...

lslocks | grep 44396

...
podman   5482 POSIX      WRITE  0     0   0 /var/lib/containers/storage/userns.lock
podman  44396 POSIX  64B WRITE* 0     0   0 /var/lib/containers/storage/storage.lock
podman   5482 POSIX  64B WRITE  0     0   0 /var/lib/containers/storage/storage.lock
podman   5482 POSIX  64B WRITE  0     0   0 /var/lib/containers/storage/overlay-layers/layers.lock
podman   5482 POSIX  64B WRITE  0     0   0 /var/lib/containers/storage/overlay-images/images.lock
...

ps -ef | grep 5482

root        5482       1 81 19:09 ?        00:17:37 /usr/bin/podman --log-level=info system service

strace podman ps

...
futex(0x56081b801868, FUTEX_WAIT_PRIVATE, 0, NULL

Output of podman version:

Client:       Podman Engine
Version:      4.2.0
API Version:  4.2.0
Go Version:   go1.18.4
Built:        Thu Aug 11 14:42:17 2022
OS/Arch:      linux/amd64

Output of podman info:

host:
  arch: amd64
  buildahVersion: 1.27.0
  cgroupControllers:
  - cpuset
  - cpu
  - io
  - memory
  - hugetlb
  - pids
  - misc
  cgroupManager: systemd
  cgroupVersion: v2
  conmon:
    package: conmon-2.1.0-2.fc36.x86_64
    path: /usr/bin/conmon
    version: 'conmon version 2.1.0, commit: '
  cpuUtilization:
    idlePercent: 92.99
    systemPercent: 1.91
    userPercent: 5.09
  cpus: 16
  distribution:
    distribution: fedora
    variant: coreos
    version: "36"
  eventLogger: journald
  hostname: ip-10-2-22-82
  idMappings:
    gidmap: null
    uidmap: null
  kernel: 5.18.18-200.fc36.x86_64
  linkmode: dynamic
  logDriver: journald
  memFree: 123904901120
  memTotal: 133529747456
  networkBackend: netavark
  ociRuntime:
    name: crun
    package: crun-1.5-1.fc36.x86_64
    path: /usr/bin/crun
    version: |-
      crun version 1.5
      commit: 54ebb8ca8bf7e6ddae2eb919f5b82d1d96863dea
      spec: 1.0.0
      +SYSTEMD +SELINUX +APPARMOR +CAP +SECCOMP +EBPF +CRIU +YAJL
  os: linux
  remoteSocket:
    exists: true
    path: /run/podman/podman.sock
  security:
    apparmorEnabled: false
    capabilities: CAP_CHOWN,CAP_DAC_OVERRIDE,CAP_FOWNER,CAP_FSETID,CAP_KILL,CAP_NET_BIND_SERVICE,CAP_SETFCAP,CAP_SETGID,CAP_SETPCAP,CAP_SETUID,CAP_SYS_CHROOT
    rootless: false
    seccompEnabled: true
    seccompProfilePath: /usr/share/containers/seccomp.json
    selinuxEnabled: true
  serviceIsRemote: false
  slirp4netns:
    executable: /usr/bin/slirp4netns
    package: slirp4netns-1.2.0-0.2.beta.0.fc36.x86_64
    version: |-
      slirp4netns version 1.2.0-beta.0
      commit: 477db14a24ff1a3de3a705e51ca2c4c1fe3dda64
      libslirp: 4.6.1
      SLIRP_CONFIG_VERSION_MAX: 3
      libseccomp: 2.5.3
  swapFree: 0
  swapTotal: 0
  uptime: 1h 25m 45.00s (Approximately 0.04 days)
plugins:
  authorization: null
  log:
  - k8s-file
  - none
  - passthrough
  - journald
  network:
  - bridge
  - macvlan
  volume:
  - local
registries:
  search:
  - docker.io
store:
  configFile: /usr/share/containers/storage.conf
  containerStore:
    number: 34
    paused: 0
    running: 0
    stopped: 34
  graphDriverName: overlay
  graphOptions:
    overlay.mountopt: nodev,metacopy=on
  graphRoot: /var/lib/containers/storage
  graphRootAllocated: 1098974756864
  graphRootUsed: 12981538816
  graphStatus:
    Backing Filesystem: xfs
    Native Overlay Diff: "false"
    Supports d_type: "true"
    Using metacopy: "true"
  imageCopyTmpDir: /var/tmp
  imageStore:
    number: 11
  runRoot: /run/containers/storage
  volumePath: /var/lib/containers/storage/volumes
version:
  APIVersion: 4.2.0
  Built: 1660228937
  BuiltTime: Thu Aug 11 14:42:17 2022
  GitCommit: ""
  GoVersion: go1.18.4
  Os: linux
  OsArch: linux/amd64
  Version: 4.2.0

Package info (e.g. output of rpm -q podman or apt list podman or brew info podman):

podman-4.2.0-2.fc36.x86_64

Have you tested with the latest version of Podman and have you checked the Podman Troubleshooting Guide? (https://github.com/containers/podman/blob/main/troubleshooting.md)

Yes

Additional environment details (AWS, VirtualBox, physical, etc.):
AWS EC2 instance, /var/lib/containers/storage/ mounted from EBS running Fedora CoreOS 36.20220820.3.0

The text was updated successfully, but these errors were encountered:

github-actions · 2022-11-11T00:09:41Z

A friendly reminder that this issue had no activity for 30 days.

rhatdan · 2022-11-11T15:26:06Z

@mheon PTAL

lukasmrtvy · 2022-11-22T13:27:05Z

Adding output from SIGABRT signal https://pastebin.com/raw/wTuJ3jND ( large output )

mheon · 2022-11-22T14:10:58Z

github.com/containers/podman/vendor/github.com/containers/storage.(*store).GraphDriver(0xc000520000)

Seems like a c/storage deadlock, given this.

mheon · 2022-11-22T14:11:06Z

@nalind Any thoughts?

nalind · 2022-11-28T13:22:03Z

Well, the backtrace includes one goroutine (1) waiting for a file lock, and no other routines in a path that would obviously have obtained the lock, and the lslocks output lists two processes, so the backtrace is from the one that's waiting for the lock. The lock is currently held by another process, so if something's acquiring it and just holding it (either because of a bug, or because it's doing something time-consuming while holding it), it's happening in that other process.

lukasmrtvy · 2022-11-28T18:39:44Z

Adding output from SIGABRT from second process ( podman system service ) https://transfer.sh/SrjUkf/pdm.dump

lukasmrtvy · 2022-11-29T07:30:11Z

@nalind by the way, all this is not happening on the docker ( we are using --userns=auto and a few libpod endpoints instead of compat, then there is of course cgroup v1 vs cgroup v2 difference )

nalind · 2022-12-01T20:21:07Z

Confirmed: it's a locking bug. In goroutine 57243, the getMaxSizeFromImage() function calls LayerStore() before iterating through the list of known layers, and LayerStore() attempts to lock the layers.json file while reading it. The problem is that getMaxSizeFromImage() is being called from getAutoUserNS(), which is being called from CreateContainer(), which is already holding a write lock on that very file.

nalind · 2022-12-01T20:33:42Z

... and I'm pretty sure @mtrmac fixed it in containers/storage#1387.

mtrmac · 2022-12-02T14:26:05Z

getMaxSizeFromImage trying to acquire an already held layers.lock could happen, if storage.lock triggered a reload.

In this log, in goroutine 57243 , isn’t the blocking lock 0xc00011ecd0 store.graphLock = storage.lock, not layers.lock?

Either way, that getMaxSizeFromImage→LayerStore call is a problem, because it nests storage.lock inside layers.lock (held by CreateContainer, while the rest of store nests layers.lock inside storage.lock. In this case, it could be goroutine 62699 , which holds store.lock but blocks on layers.lock.

It does seem that containers/storage#1387 fixes this … without even noticing that it was broken. Oops. Does anyone have suggestions for containers/storage#1389 ?

I’ll, at least, update containers/storage#1438 to document the graphLock vs C/I/L lock ordering rules.

github-actions · 2023-01-03T00:06:40Z

A friendly reminder that this issue had no activity for 30 days.

mtrmac · 2023-01-03T18:51:38Z

containers/storage@8d8d6be is now included in the Podman main branch.

lukasmrtvy · 2023-01-13T09:56:21Z

@mtrmac Hey, just tried https://download.copr.fedorainfracloud.org/results/rhcontainerbot/podman-next/fedora-37-x86_64/05226098-podman/podman-0.0.git.17669.93118464-1.fc37.x86_64.rpm on FCOS installed via this snippet

variant: fcos
version: 1.4.0
storage:
  files:
    - path: /etc/pki/rpm-gpg/rhcontainerbot-podman-next-fedora.gpg
      contents:
        inline: |
            -----BEGIN PGP PUBLIC KEY BLOCK-----

            mQENBGGpQ8EBCAC3a5QD6FsFzFPVsNDrIXNXz/yDOdh0PdUat6Fma26pB2ivar1K
            H03VrB8/zEmiM7qjgnR/Z3h0b6jkiOiqR39/+hQVKm2Hs222cnC8Tcj36FASpHDq
            GxVVceqs/9zMc1oJTcCkgfrLtOxsQxsLcln43z9e5TF/oSkFFoyhCtCyl8SBWrzb
            qk7ihoZVA/zn+CjyauniUwN+ezTSJj821AfeymF9mNzkqkh9HcPr2aze+0Mct/FV
            eUXfGJ2LQufvhgNHu/eg958Oz2mBJS2JUxDjoQSAYYXzWSmTTQ4ft2+aA9rlg16E
            88G1HnRsYXqail7UQM9Bspqnq5cd+JehEgodABEBAAG0VHJoY29udGFpbmVyYm90
            X3BvZG1hbi1uZXh0IChOb25lKSA8cmhjb250YWluZXJib3QjcG9kbWFuLW5leHRA
            Y29wci5mZWRvcmFob3N0ZWQub3JnPokBWAQTAQgAQhYhBEk3txShalNb9LOwGI5U
            Q5nYfes5BQJhqUPBAhsvBQkJZgGABQsJCAcCAyICAQYVCgkICwIEFgIDAQIeBwIX
            gAAKCRCOVEOZ2H3rOfdcB/4irNK45SoOjhMZpub+UjxqB6t7SSj1LucJVNJhLG7c
            80nTapczhoghk8zN5txdnPGKbwOzfYReB9NEb7w231yVu+JhIU/5kY2UTS3QA/uN
            UG0BBPZ+0yRuOYK4A/mX5BCceTzBkwRkPG7AM0tT7G5tJBt1Z6sQJMfj2drUKutn
            +x1wJDsK4Ga0rDlpXsn81aQ7gZAUKiUiKXbujGDLyjPYwvKkjNA74CgAqz60RlXJ
            /LtcbS2pQV1ZuIpaHXyO+82IOmVJmXNL1YNfni5M60f+3/A9pyiiw3c1C+iho77N
            SfsgYLo0HKX7m1i4Q181F+RhSRgKmWP9PVRJ1H0xhDNO
            =P6WX
            -----END PGP PUBLIC KEY BLOCK-----
    - path: /etc/yum.repos.d/podman-next.repo
      contents:
        inline: |
            [copr:copr.fedorainfracloud.org:rhcontainerbot:podman-next]
            name=Copr repo for podman-next owned by rhcontainerbot
            baseurl=https://download.copr.fedorainfracloud.org/results/rhcontainerbot/podman-next/fedora-$releasever-$basearch/
            type=rpm-md
            skip_if_unavailable=True
            gpgcheck=1
            gpgkey=file:///etc/pki/rpm-gpg/rhcontainerbot-podman-next-fedora.gpg
            repo_gpgcheck=0
            enabled=1
            enabled_metadata=1
systemd:
  units:
    - name: upgrade-podman.service
      enabled: true
      contents: |
        [Unit]
        Description=upgrade-podman
        After=network-online.target
        Requires=network-online.target
        Before=zincati.service
        ConditionPathExists=!/var/lib/%N.stamp

        [Service]
        Type=oneshot
        RemainAfterExit=yes
        ExecStartPre=rpm-ostree refresh-md
        ExecStart=rpm-ostree override replace --experimental --freeze \
                    --from repo="copr:copr.fedorainfracloud.org:rhcontainerbot:podman-next" \
                    aardvark-dns conmon crun netavark podman containers-common containers-common-extra
        ExecStart=/usr/bin/rpm-ostree apply-live --allow-replacement
        ExecStart=/bin/touch /var/lib/%N.stamp
        Restart=on-failure
        RestartSec=1s

        [Install]
        WantedBy=multi-user.target

, but there is still some deadlock.

mtrmac · 2023-01-13T11:46:17Z

@lukasmrtvy It’s hard to act upon “some deadlock”. Please file a new issue, with similar data like before (lslocks, backtrace of all goroutines).

lukasmrtvy · 2023-01-13T13:08:22Z

[root@ip-10-1-38-116 core]# lslocks
COMMAND    PID  TYPE SIZE MODE   M START END PATH
podman    1643 POSIX      WRITE  0     0   0 /var/lib/containers/storage/userns.lock
podman  142736 POSIX  64B WRITE* 0     0   0 /var/lib/containers/storage/storage.lock
podman  142155 POSIX  64B WRITE* 0     0   0 /var/lib/containers/storage/storage.lock
podman  135246 POSIX  64B WRITE* 0     0   0 /var/lib/containers/storage/storage.lock
podman  127802 POSIX  64B WRITE* 0     0   0 /var/lib/containers/storage/storage.lock
podman    1643 POSIX  64B WRITE  0     0   0 /var/lib/containers/storage/storage.lock
podman    1643 POSIX  64B WRITE  0     0   0 /var/lib/containers/storage/overlay-layers/layers.lock
podman    1643 POSIX  64B WRITE  0     0   0 /var/lib/containers/storage/overlay-images/images.lock

Dump from podman system service running with a custom socket available for our application https://transfer.sh/bgSNsd/pdm.dump ( There is probably another podman system service running triggered via system.socket in OS )
This time with package builded from ceeeea8

mtrmac · 2023-01-13T15:54:28Z

This time with package builded from ceeeea8

I’m afraid that doesn’t match the backtrace; that backtrace seems to be consistent with this being unmodified Podman 4.3.1.

lukasmrtvy · 2023-01-13T17:14:13Z

podman info reports 4.4.0-dev. I guess I need to wait for stable release then.

mtrmac · 2023-01-13T17:18:14Z

https://github.com/containers/podman/releases/tag/v4.4.0-rc1 ? But if there were some deeper reason for the mismatch between what crashes and what you have upgraded to, just updating to that might not make a difference.

oseiberts11 · 2023-05-16T08:17:42Z

In #16062 (comment) this issue is considered a locking bug inside podman. We seem to have problems regarding a deadlock in the xfs file system, triggered by podman. I am wondering if these can be explained by the same explanation or not.

The issue occurs when starting a container for RabbitMQ with --user rabbitmq --userns=keep-id, on newer kernels when the native overlayfs is used and not fuse-overlayfs.

One thing that is sub-optimal is that RabbitMQ needs its home directory mounted in the container (/var/lib/rabbitmq) but this is also where Podman stores all the container files; so effectively the container files are mounted into the container.

The kernel warns about processes being stuck; here is one of them:

2023-05-10 15:37:39 in4 kernel[-] warning: [   69.953197] overlayfs: fs on '/var/lib/rabbitmq/.local/share/containers/storage/overlay/compat278498705/lower1' does not support file handles, falling back to xino=off.
2023-05-10 15:37:40 in4 kernel[-] warning: [   69.975395] overlayfs: fs on '/var/lib/rabbitmq/.local/share/containers/storage/overlay/metacopy-check184881131/l1' does not support file handles, falling back to xino=off.
2023-05-10 15:37:40 in4 kernel[-] warning: [   69.980772] overlayfs: fs on '/var/lib/rabbitmq/.local/share/containers/storage/overlay/opaque-bug-check964911957/l2' does not support file handles, falling back to xino=off.
2023-05-10 15:37:44 in4 kernel[-] warning: [   74.073926] overlayfs: fs on '/var/lib/rabbitmq/.local/share/containers/storage/overlay/l/PJRXH6F3G7CISWKDG472AW6PUO' does not support file handles, falling back to xino=off.
2023-05-10 15:40:35 in4 kernel[-] err: [  245.356952] INFO: task kworker/9:0:73 blocked for more than 120 seconds.
2023-05-10 15:40:35 in4 kernel[-] err: [  245.363676]       Not tainted 5.15.0-71-generic #78~20.04.1-Ubuntu
2023-05-10 15:40:35 in4 kernel[-] err: [  245.369880] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
2023-05-10 15:40:35 in4 kernel[-] info: [  245.377744] task:kworker/9:0     state:D stack:    0 pid:   73 ppid:     2 flags:0x00004000
2023-05-10 15:40:35 in4 kernel[-] info: [  245.377748] Workqueue: xfs-conv/dm-1 xfs_end_io [xfs]
2023-05-10 15:40:35 in4 kernel[-] info: [  245.377841] Call Trace:
2023-05-10 15:40:35 in4 kernel[-] info: [  245.377844]  <TASK>
2023-05-10 15:40:35 in4 kernel[-] info: [  245.377846]  __schedule+0x2cd/0x890
2023-05-10 15:40:35 in4 kernel[-] info: [  245.377850]  ? xfs_buf_find.isra.0+0x331/0x780 [xfs]
2023-05-10 15:40:35 in4 kernel[-] info: [  245.377915]  schedule+0x69/0x110
2023-05-10 15:40:35 in4 kernel[-] info: [  245.377917]  schedule_timeout+0x206/0x2d0
2023-05-10 15:40:35 in4 kernel[-] info: [  245.377919]  ? xfs_buf_get_map+0x52/0x420 [xfs]
2023-05-10 15:40:35 in4 kernel[-] info: [  245.377983]  __down+0x84/0xf0
2023-05-10 15:40:35 in4 kernel[-] info: [  245.377985]  down+0x53/0x70
2023-05-10 15:40:35 in4 kernel[-] info: [  245.377987]  xfs_buf_lock+0x32/0xc0 [xfs]
2023-05-10 15:40:35 in4 kernel[-] info: [  245.378051]  xfs_buf_find.isra.0+0x34a/0x780 [xfs]
2023-05-10 15:40:35 in4 kernel[-] info: [  245.378116]  xfs_buf_get_map+0x52/0x420 [xfs]
2023-05-10 15:40:35 in4 kernel[-] info: [  245.378179]  xfs_buf_read_map+0x53/0x350 [xfs]
2023-05-10 15:40:35 in4 kernel[-] info: [  245.378243]  ? xfs_read_agf+0xa6/0x130 [xfs]
2023-05-10 15:40:35 in4 kernel[-] info: [  245.378295]  xfs_trans_read_buf_map+0x130/0x2a0 [xfs]
2023-05-10 15:40:35 in4 kernel[-] info: [  245.378368]  ? xfs_read_agf+0xa6/0x130 [xfs]
2023-05-10 15:40:35 in4 kernel[-] info: [  245.378421]  xfs_read_agf+0xa6/0x130 [xfs]
2023-05-10 15:40:35 in4 kernel[-] info: [  245.378473]  xfs_alloc_read_agf+0x39/0x1b0 [xfs]
2023-05-10 15:40:35 in4 kernel[-] info: [  245.378523]  ? xfs_perag_get+0x45/0xa0 [xfs]
2023-05-10 15:40:35 in4 kernel[-] info: [  245.378574]  xfs_refcount_finish_one+0xee/0x350 [xfs]
2023-05-10 15:40:35 in4 kernel[-] info: [  245.378636]  xfs_refcount_update_finish_item+0x4a/0xb0 [xfs]
2023-05-10 15:40:35 in4 kernel[-] info: [  245.378706]  ? xfs_trans_get_cud+0x62/0x70 [xfs]
2023-05-10 15:40:35 in4 kernel[-] info: [  245.378777]  xfs_defer_finish_noroll+0x1f4/0x600 [xfs]
2023-05-10 15:40:35 in4 kernel[-] info: [  245.378835]  __xfs_trans_commit+0x17d/0x3b0 [xfs]
2023-05-10 15:40:35 in4 kernel[-] info: [  245.378905]  xfs_trans_commit+0x10/0x20 [xfs]
2023-05-10 15:40:35 in4 kernel[-] info: [  245.378974]  xfs_reflink_end_cow_extent+0x230/0x2a0 [xfs]
2023-05-10 15:40:35 in4 kernel[-] info: [  245.379043]  xfs_reflink_end_cow+0x7c/0x130 [xfs]
2023-05-10 15:40:35 in4 kernel[-] info: [  245.379112]  xfs_end_ioend+0xfc/0x190 [xfs]
2023-05-10 15:40:35 in4 kernel[-] info: [  245.379185]  xfs_end_io+0xb0/0xe0 [xfs]
2023-05-10 15:40:35 in4 kernel[-] info: [  245.379244]  process_one_work+0x228/0x3d0
2023-05-10 15:40:35 in4 kernel[-] info: [  245.379246]  worker_thread+0x4d/0x3f0
2023-05-10 15:40:35 in4 kernel[-] info: [  245.379248]  ? process_one_work+0x3d0/0x3d0
2023-05-10 15:40:35 in4 kernel[-] info: [  245.379250]  kthread+0x127/0x150
2023-05-10 15:40:35 in4 kernel[-] info: [  245.379252]  ? set_kthread_struct+0x50/0x50
2023-05-10 15:40:35 in4 kernel[-] info: [  245.379254]  ret_from_fork+0x1f/0x30
2023-05-10 15:40:35 in4 kernel[-] info: [  245.379258]  </TASK>

After this, more and more processes get blocked as they try to access the file system.

We are currently working around it by forcing the use of the fuse-overlayfs instead of the native one.

The version of Podman we have is fairly old (3.4.2) but Ubuntu doesn't seem to have packaged a newer version for Ubuntu 20.04. Therefore we could not try if this fix also works for us.

mheon · 2023-05-16T19:28:18Z

This sounds completely unrelated to the original bug (and, honestly, is most likely a kernel bug, not a Podman one). Please open a fresh issue (probably against Podman first, so we can be sure this is actually the kernel).

oseiberts11 · 2023-05-17T15:41:58Z

@mheon Yes I will do that. I will also add some extra information to it that may help. I also think it is more likely a kernel bug (after all, userland programs should not be able to mess up the kernel), but I wasn't sure where else to report it.

openshift-ci bot added the kind/bug Categorizes issue or PR as related to a bug. label Oct 5, 2022

edsantiago mentioned this issue Oct 11, 2022

sdnotify play kube policies: podman container wait, hangs #16076

Closed

github-actions bot added the stale-issue label Nov 11, 2022

rhatdan removed the stale-issue label Nov 11, 2022

lukasmrtvy mentioned this issue Nov 22, 2022

--userns=keep-id storage-chown-by-maps kills machine with large images #16541

Open

github-actions bot added the stale-issue label Jan 3, 2023

mtrmac closed this as completed Jan 3, 2023

oseiberts11 mentioned this issue May 17, 2023

Podman seems to trigger locking problems in xfs, locking up the whole machine #18610

Closed

github-actions bot added the locked - please file new issue/PR Assist humans wanting to comment on an old issue or PR with locked comments. label Aug 23, 2023

github-actions bot locked as resolved and limited conversation to collaborators Aug 23, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Podman hangs, possible deadlock ? #16062

Podman hangs, possible deadlock ? #16062

lukasmrtvy commented Oct 5, 2022 •

edited

github-actions bot commented Nov 11, 2022

rhatdan commented Nov 11, 2022

lukasmrtvy commented Nov 22, 2022 •

edited

mheon commented Nov 22, 2022

mheon commented Nov 22, 2022

nalind commented Nov 28, 2022

lukasmrtvy commented Nov 28, 2022

lukasmrtvy commented Nov 29, 2022

nalind commented Dec 1, 2022

nalind commented Dec 1, 2022

mtrmac commented Dec 2, 2022

github-actions bot commented Jan 3, 2023

mtrmac commented Jan 3, 2023

lukasmrtvy commented Jan 13, 2023 •

edited

mtrmac commented Jan 13, 2023

lukasmrtvy commented Jan 13, 2023 •

edited

mtrmac commented Jan 13, 2023

lukasmrtvy commented Jan 13, 2023 •

edited

mtrmac commented Jan 13, 2023

oseiberts11 commented May 16, 2023

mheon commented May 16, 2023

oseiberts11 commented May 17, 2023

Podman hangs, possible deadlock ? #16062

Podman hangs, possible deadlock ? #16062

Comments

lukasmrtvy commented Oct 5, 2022 • edited

github-actions bot commented Nov 11, 2022

rhatdan commented Nov 11, 2022

lukasmrtvy commented Nov 22, 2022 • edited

mheon commented Nov 22, 2022

mheon commented Nov 22, 2022

nalind commented Nov 28, 2022

lukasmrtvy commented Nov 28, 2022

lukasmrtvy commented Nov 29, 2022

nalind commented Dec 1, 2022

nalind commented Dec 1, 2022

mtrmac commented Dec 2, 2022

github-actions bot commented Jan 3, 2023

mtrmac commented Jan 3, 2023

lukasmrtvy commented Jan 13, 2023 • edited

mtrmac commented Jan 13, 2023

lukasmrtvy commented Jan 13, 2023 • edited

mtrmac commented Jan 13, 2023

lukasmrtvy commented Jan 13, 2023 • edited

mtrmac commented Jan 13, 2023

oseiberts11 commented May 16, 2023

mheon commented May 16, 2023

oseiberts11 commented May 17, 2023

lukasmrtvy commented Oct 5, 2022 •

edited

lukasmrtvy commented Nov 22, 2022 •

edited

lukasmrtvy commented Jan 13, 2023 •

edited

lukasmrtvy commented Jan 13, 2023 •

edited

lukasmrtvy commented Jan 13, 2023 •

edited