Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Podman hangs, possible deadlock ? #16062

Closed
lukasmrtvy opened this issue Oct 5, 2022 · 22 comments
Closed

Podman hangs, possible deadlock ? #16062

lukasmrtvy opened this issue Oct 5, 2022 · 22 comments
Labels
kind/bug Categorizes issue or PR as related to a bug. locked - please file new issue/PR Assist humans wanting to comment on an old issue or PR with locked comments. stale-issue

Comments

@lukasmrtvy
Copy link

lukasmrtvy commented Oct 5, 2022

Is this a BUG REPORT or FEATURE REQUEST? (leave only one on its own line)

/kind bug

Description
Seems that I am suffering from the very same problem as mentioned in #1574 ( 4y old ), but with the latest Podman version ( 4.2.0 ).

My scenario is a little bit different. I am running "custom scheduler" in a container with a mounted podman.socket and its running via systemd service. This scheduler is responsible for creating/managing/deleting ( there are also other operations like attaching, streaming logs, stats, etc involved ) and also for waiting for containers on behalf. Seems that when I restart this scheduler, Podman will become unavailable. Also using --userns=auto for containers created via "custom scheduler".

Steps to reproduce the issue:
( Not relevant due to its own business logic )

  1. systemctl start scheduler.service ( podman run -v "/run/podman/podman.sock:/run/podman/podman.sock" myscheduler )

  2. systemctl restart scheduler.service

Describe the results you received:
Podman is not accessible via CLI and API

Describe the results you expected:
Podman API is accessible via CLI and API

Additional information you deem important (e.g. issue happens only occasionally):

Related logs:
journald -u podman --no-pager | grep 04b5091d21ad06885e48915d9539997c1b58745a52e3178ce12995cf6b82f944

...
Oct 05 16:50:32 ip-10-2-22-82 podman[5526]: @ - - [05/Oct/2022:16:50:27 +0000] "POST /containers/04b5091d21ad06885e48915d9539997c1b58745a52e3178ce12995cf6b82f944/start HTTP/1.1" 204 0 "" ""
Oct 05 16:50:37 ip-10-2-22-82 podman[5526]: @ - - [05/Oct/2022:16:50:32 +0000] "GET /containers/04b5091d21ad06885e48915d9539997c1b58745a52e3178ce12995cf6b82f944/json HTTP/1.1" 200 9033 "" ""
Oct 05 17:00:32 ip-10-2-22-82 podman[5526]: @ - - [05/Oct/2022:16:50:22 +0000] "POST /containers/04b5091d21ad06885e48915d9539997c1b58745a52e3178ce12995cf6b82f944/attach?stream=true&stdout=true&stderr=true HTTP/1.1" 200 0 "" ""
Oct 05 17:00:32 ip-10-2-22-82 podman[5526]: 2022-10-05 17:00:32.984086144 +0000 UTC m=+1448.094378702 container died 04b5091d21ad06885e48915d9539997c1b58745a52e3178ce12995cf6b82f944 (image=XXXX.dkr.ecr.us-east-1.amazonaws.com/XXXX:0NkyJzxqJQWVq2aet, name=lucid_mclaren, health_status=)
Oct 05 17:00:52 ip-10-2-22-82 podman[5526]: @ - - [05/Oct/2022:16:50:32 +0000] "POST /containers/04b5091d21ad06885e48915d9539997c1b58745a52e3178ce12995cf6b82f944/wait HTTP/1.1" 200 30 "" "
...

ps -ef | grep 04b5091d21ad06885e48915d9539997c1b58745a52e3178ce12995cf6b82f944

...
root       13829       1  0 16:50 ?        00:00:00 /usr/bin/conmon --api-version 1 -c 04b5091d21ad06885e48915d9539997c1b58745a52e3178ce12995cf6b82f944 -u 04b5091d21ad06885e48915d9539997c1b58745a52e3178ce12995cf6b82f944 -r /usr/bin/crun -b /var/lib/containers/storage/overlay-containers/04b5091d21ad06885e48915d9539997c1b58745a52e3178ce12995cf6b82f944/userdata -p /run/containers/storage/overlay-containers/04b5091d21ad06885e48915d9539997c1b58745a52e3178ce12995cf6b82f944/userdata/pidfile -n lucid_mclaren --exit-dir /run/libpod/exits --full-attach -s -l k8s-file:/var/lib/containers/storage/overlay-containers/04b5091d21ad06885e48915d9539997c1b58745a52e3178ce12995cf6b82f944/userdata/ctr.log --log-level info --log-size-max 10485760 --runtime-arg --log-format=json --runtime-arg --log --runtime-arg=/run/containers/storage/overlay-containers/04b5091d21ad06885e48915d9539997c1b58745a52e3178ce12995cf6b82f944/userdata/oci-log --conmon-pidfile /run/containers/storage/overlay-containers/04b5091d21ad06885e48915d9539997c1b58745a52e3178ce12995cf6b82f944/userdata/conmon.pid --exit-command /usr/bin/podman --exit-command-arg --root --exit-command-arg /var/lib/containers/storage --exit-command-arg --runroot --exit-command-arg /run/containers/storage --exit-command-arg --log-level --exit-command-arg info --exit-command-arg --cgroup-manager --exit-command-arg systemd --exit-command-arg --tmpdir --exit-command-arg /run/libpod --exit-command-arg --network-config-dir --exit-command-arg  --exit-command-arg --network-backend --exit-command-arg netavark --exit-command-arg --volumepath --exit-command-arg /var/lib/containers/storage/volumes --exit-command-arg --runtime --exit-command-arg crun --exit-command-arg --storage-driver --exit-command-arg overlay --exit-command-arg --storage-opt --exit-command-arg overlay.mountopt=nodev,metacopy=on --exit-command-arg --events-backend --exit-command-arg journald --exit-command-arg container --exit-command-arg cleanup --exit-command-arg 04b5091d21ad06885e48915d9539997c1b58745a52e3178ce12995cf6b82f944
root       44396   13829  0 17:00 ?        00:00:00 /usr/bin/podman --root /var/lib/containers/storage --runroot /run/containers/storage --log-level info --cgroup-manager systemd --tmpdir /run/libpod --network-config-dir  --network-backend netavark --volumepath /var/lib/containers/storage/volumes --runtime crun --storage-driver overlay --storage-opt overlay.mountopt=nodev,metacopy=on --events-backend journald container cleanup 04b5091d21ad06885e48915d9539997c1b58745a52e3178ce12995cf6b82f944
...

lslocks | grep 44396

...
podman   5482 POSIX      WRITE  0     0   0 /var/lib/containers/storage/userns.lock
podman  44396 POSIX  64B WRITE* 0     0   0 /var/lib/containers/storage/storage.lock
podman   5482 POSIX  64B WRITE  0     0   0 /var/lib/containers/storage/storage.lock
podman   5482 POSIX  64B WRITE  0     0   0 /var/lib/containers/storage/overlay-layers/layers.lock
podman   5482 POSIX  64B WRITE  0     0   0 /var/lib/containers/storage/overlay-images/images.lock
...

ps -ef | grep 5482

root        5482       1 81 19:09 ?        00:17:37 /usr/bin/podman --log-level=info system service

strace podman ps

...
futex(0x56081b801868, FUTEX_WAIT_PRIVATE, 0, NULL

Output of podman version:

Client:       Podman Engine
Version:      4.2.0
API Version:  4.2.0
Go Version:   go1.18.4
Built:        Thu Aug 11 14:42:17 2022
OS/Arch:      linux/amd64

Output of podman info:

host:
  arch: amd64
  buildahVersion: 1.27.0
  cgroupControllers:
  - cpuset
  - cpu
  - io
  - memory
  - hugetlb
  - pids
  - misc
  cgroupManager: systemd
  cgroupVersion: v2
  conmon:
    package: conmon-2.1.0-2.fc36.x86_64
    path: /usr/bin/conmon
    version: 'conmon version 2.1.0, commit: '
  cpuUtilization:
    idlePercent: 92.99
    systemPercent: 1.91
    userPercent: 5.09
  cpus: 16
  distribution:
    distribution: fedora
    variant: coreos
    version: "36"
  eventLogger: journald
  hostname: ip-10-2-22-82
  idMappings:
    gidmap: null
    uidmap: null
  kernel: 5.18.18-200.fc36.x86_64
  linkmode: dynamic
  logDriver: journald
  memFree: 123904901120
  memTotal: 133529747456
  networkBackend: netavark
  ociRuntime:
    name: crun
    package: crun-1.5-1.fc36.x86_64
    path: /usr/bin/crun
    version: |-
      crun version 1.5
      commit: 54ebb8ca8bf7e6ddae2eb919f5b82d1d96863dea
      spec: 1.0.0
      +SYSTEMD +SELINUX +APPARMOR +CAP +SECCOMP +EBPF +CRIU +YAJL
  os: linux
  remoteSocket:
    exists: true
    path: /run/podman/podman.sock
  security:
    apparmorEnabled: false
    capabilities: CAP_CHOWN,CAP_DAC_OVERRIDE,CAP_FOWNER,CAP_FSETID,CAP_KILL,CAP_NET_BIND_SERVICE,CAP_SETFCAP,CAP_SETGID,CAP_SETPCAP,CAP_SETUID,CAP_SYS_CHROOT
    rootless: false
    seccompEnabled: true
    seccompProfilePath: /usr/share/containers/seccomp.json
    selinuxEnabled: true
  serviceIsRemote: false
  slirp4netns:
    executable: /usr/bin/slirp4netns
    package: slirp4netns-1.2.0-0.2.beta.0.fc36.x86_64
    version: |-
      slirp4netns version 1.2.0-beta.0
      commit: 477db14a24ff1a3de3a705e51ca2c4c1fe3dda64
      libslirp: 4.6.1
      SLIRP_CONFIG_VERSION_MAX: 3
      libseccomp: 2.5.3
  swapFree: 0
  swapTotal: 0
  uptime: 1h 25m 45.00s (Approximately 0.04 days)
plugins:
  authorization: null
  log:
  - k8s-file
  - none
  - passthrough
  - journald
  network:
  - bridge
  - macvlan
  volume:
  - local
registries:
  search:
  - docker.io
store:
  configFile: /usr/share/containers/storage.conf
  containerStore:
    number: 34
    paused: 0
    running: 0
    stopped: 34
  graphDriverName: overlay
  graphOptions:
    overlay.mountopt: nodev,metacopy=on
  graphRoot: /var/lib/containers/storage
  graphRootAllocated: 1098974756864
  graphRootUsed: 12981538816
  graphStatus:
    Backing Filesystem: xfs
    Native Overlay Diff: "false"
    Supports d_type: "true"
    Using metacopy: "true"
  imageCopyTmpDir: /var/tmp
  imageStore:
    number: 11
  runRoot: /run/containers/storage
  volumePath: /var/lib/containers/storage/volumes
version:
  APIVersion: 4.2.0
  Built: 1660228937
  BuiltTime: Thu Aug 11 14:42:17 2022
  GitCommit: ""
  GoVersion: go1.18.4
  Os: linux
  OsArch: linux/amd64
  Version: 4.2.0

Package info (e.g. output of rpm -q podman or apt list podman or brew info podman):

podman-4.2.0-2.fc36.x86_64

Have you tested with the latest version of Podman and have you checked the Podman Troubleshooting Guide? (https://github.com/containers/podman/blob/main/troubleshooting.md)

Yes

Additional environment details (AWS, VirtualBox, physical, etc.):
AWS EC2 instance, /var/lib/containers/storage/ mounted from EBS running Fedora CoreOS 36.20220820.3.0

@openshift-ci openshift-ci bot added the kind/bug Categorizes issue or PR as related to a bug. label Oct 5, 2022
@github-actions
Copy link

A friendly reminder that this issue had no activity for 30 days.

@rhatdan
Copy link
Member

rhatdan commented Nov 11, 2022

@mheon PTAL

@lukasmrtvy
Copy link
Author

lukasmrtvy commented Nov 22, 2022

Adding output from SIGABRT signal https://pastebin.com/raw/wTuJ3jND ( large output )

@mheon
Copy link
Member

mheon commented Nov 22, 2022

github.com/containers/podman/vendor/github.com/containers/storage.(*store).GraphDriver(0xc000520000)

Seems like a c/storage deadlock, given this.

@mheon
Copy link
Member

mheon commented Nov 22, 2022

@nalind Any thoughts?

@nalind
Copy link
Member

nalind commented Nov 28, 2022

Well, the backtrace includes one goroutine (1) waiting for a file lock, and no other routines in a path that would obviously have obtained the lock, and the lslocks output lists two processes, so the backtrace is from the one that's waiting for the lock. The lock is currently held by another process, so if something's acquiring it and just holding it (either because of a bug, or because it's doing something time-consuming while holding it), it's happening in that other process.

@lukasmrtvy
Copy link
Author

Adding output from SIGABRT from second process ( podman system service ) https://transfer.sh/SrjUkf/pdm.dump

@lukasmrtvy
Copy link
Author

@nalind by the way, all this is not happening on the docker ( we are using --userns=auto and a few libpod endpoints instead of compat, then there is of course cgroup v1 vs cgroup v2 difference )

@nalind
Copy link
Member

nalind commented Dec 1, 2022

Confirmed: it's a locking bug. In goroutine 57243, the getMaxSizeFromImage() function calls LayerStore() before iterating through the list of known layers, and LayerStore() attempts to lock the layers.json file while reading it. The problem is that getMaxSizeFromImage() is being called from getAutoUserNS(), which is being called from CreateContainer(), which is already holding a write lock on that very file.

@nalind
Copy link
Member

nalind commented Dec 1, 2022

... and I'm pretty sure @mtrmac fixed it in containers/storage#1387.

@mtrmac
Copy link
Collaborator

mtrmac commented Dec 2, 2022

getMaxSizeFromImage trying to acquire an already held layers.lock could happen, if storage.lock triggered a reload.

In this log, in goroutine 57243 , isn’t the blocking lock 0xc00011ecd0 store.graphLock = storage.lock, not layers.lock?

Either way, that getMaxSizeFromImageLayerStore call is a problem, because it nests storage.lock inside layers.lock (held by CreateContainer, while the rest of store nests layers.lock inside storage.lock. In this case, it could be goroutine 62699 , which holds store.lock but blocks on layers.lock.

It does seem that containers/storage#1387 fixes this … without even noticing that it was broken. Oops. Does anyone have suggestions for containers/storage#1389 ?

I’ll, at least, update containers/storage#1438 to document the graphLock vs C/I/L lock ordering rules.

@github-actions
Copy link

github-actions bot commented Jan 3, 2023

A friendly reminder that this issue had no activity for 30 days.

@mtrmac
Copy link
Collaborator

mtrmac commented Jan 3, 2023

containers/storage@8d8d6be is now included in the Podman main branch.

@mtrmac mtrmac closed this as completed Jan 3, 2023
@lukasmrtvy
Copy link
Author

lukasmrtvy commented Jan 13, 2023

@mtrmac Hey, just tried https://download.copr.fedorainfracloud.org/results/rhcontainerbot/podman-next/fedora-37-x86_64/05226098-podman/podman-0.0.git.17669.93118464-1.fc37.x86_64.rpm on FCOS installed via this snippet

variant: fcos
version: 1.4.0
storage:
  files:
    - path: /etc/pki/rpm-gpg/rhcontainerbot-podman-next-fedora.gpg
      contents:
        inline: |
            -----BEGIN PGP PUBLIC KEY BLOCK-----

            mQENBGGpQ8EBCAC3a5QD6FsFzFPVsNDrIXNXz/yDOdh0PdUat6Fma26pB2ivar1K
            H03VrB8/zEmiM7qjgnR/Z3h0b6jkiOiqR39/+hQVKm2Hs222cnC8Tcj36FASpHDq
            GxVVceqs/9zMc1oJTcCkgfrLtOxsQxsLcln43z9e5TF/oSkFFoyhCtCyl8SBWrzb
            qk7ihoZVA/zn+CjyauniUwN+ezTSJj821AfeymF9mNzkqkh9HcPr2aze+0Mct/FV
            eUXfGJ2LQufvhgNHu/eg958Oz2mBJS2JUxDjoQSAYYXzWSmTTQ4ft2+aA9rlg16E
            88G1HnRsYXqail7UQM9Bspqnq5cd+JehEgodABEBAAG0VHJoY29udGFpbmVyYm90
            X3BvZG1hbi1uZXh0IChOb25lKSA8cmhjb250YWluZXJib3QjcG9kbWFuLW5leHRA
            Y29wci5mZWRvcmFob3N0ZWQub3JnPokBWAQTAQgAQhYhBEk3txShalNb9LOwGI5U
            Q5nYfes5BQJhqUPBAhsvBQkJZgGABQsJCAcCAyICAQYVCgkICwIEFgIDAQIeBwIX
            gAAKCRCOVEOZ2H3rOfdcB/4irNK45SoOjhMZpub+UjxqB6t7SSj1LucJVNJhLG7c
            80nTapczhoghk8zN5txdnPGKbwOzfYReB9NEb7w231yVu+JhIU/5kY2UTS3QA/uN
            UG0BBPZ+0yRuOYK4A/mX5BCceTzBkwRkPG7AM0tT7G5tJBt1Z6sQJMfj2drUKutn
            +x1wJDsK4Ga0rDlpXsn81aQ7gZAUKiUiKXbujGDLyjPYwvKkjNA74CgAqz60RlXJ
            /LtcbS2pQV1ZuIpaHXyO+82IOmVJmXNL1YNfni5M60f+3/A9pyiiw3c1C+iho77N
            SfsgYLo0HKX7m1i4Q181F+RhSRgKmWP9PVRJ1H0xhDNO
            =P6WX
            -----END PGP PUBLIC KEY BLOCK-----
    - path: /etc/yum.repos.d/podman-next.repo
      contents:
        inline: |
            [copr:copr.fedorainfracloud.org:rhcontainerbot:podman-next]
            name=Copr repo for podman-next owned by rhcontainerbot
            baseurl=https://download.copr.fedorainfracloud.org/results/rhcontainerbot/podman-next/fedora-$releasever-$basearch/
            type=rpm-md
            skip_if_unavailable=True
            gpgcheck=1
            gpgkey=file:///etc/pki/rpm-gpg/rhcontainerbot-podman-next-fedora.gpg
            repo_gpgcheck=0
            enabled=1
            enabled_metadata=1
systemd:
  units:
    - name: upgrade-podman.service
      enabled: true
      contents: |
        [Unit]
        Description=upgrade-podman
        After=network-online.target
        Requires=network-online.target
        Before=zincati.service
        ConditionPathExists=!/var/lib/%N.stamp

        [Service]
        Type=oneshot
        RemainAfterExit=yes
        ExecStartPre=rpm-ostree refresh-md
        ExecStart=rpm-ostree override replace --experimental --freeze \
                    --from repo="copr:copr.fedorainfracloud.org:rhcontainerbot:podman-next" \
                    aardvark-dns conmon crun netavark podman containers-common containers-common-extra
        ExecStart=/usr/bin/rpm-ostree apply-live --allow-replacement
        ExecStart=/bin/touch /var/lib/%N.stamp
        Restart=on-failure
        RestartSec=1s

        [Install]
        WantedBy=multi-user.target

, but there is still some deadlock.

@mtrmac
Copy link
Collaborator

mtrmac commented Jan 13, 2023

@lukasmrtvy It’s hard to act upon “some deadlock”. Please file a new issue, with similar data like before (lslocks, backtrace of all goroutines).

@lukasmrtvy
Copy link
Author

lukasmrtvy commented Jan 13, 2023

[root@ip-10-1-38-116 core]# lslocks
COMMAND    PID  TYPE SIZE MODE   M START END PATH
podman    1643 POSIX      WRITE  0     0   0 /var/lib/containers/storage/userns.lock
podman  142736 POSIX  64B WRITE* 0     0   0 /var/lib/containers/storage/storage.lock
podman  142155 POSIX  64B WRITE* 0     0   0 /var/lib/containers/storage/storage.lock
podman  135246 POSIX  64B WRITE* 0     0   0 /var/lib/containers/storage/storage.lock
podman  127802 POSIX  64B WRITE* 0     0   0 /var/lib/containers/storage/storage.lock
podman    1643 POSIX  64B WRITE  0     0   0 /var/lib/containers/storage/storage.lock
podman    1643 POSIX  64B WRITE  0     0   0 /var/lib/containers/storage/overlay-layers/layers.lock
podman    1643 POSIX  64B WRITE  0     0   0 /var/lib/containers/storage/overlay-images/images.lock

Dump from podman system service running with a custom socket available for our application https://transfer.sh/bgSNsd/pdm.dump ( There is probably another podman system service running triggered via system.socket in OS )
This time with package builded from ceeeea8

@mtrmac
Copy link
Collaborator

mtrmac commented Jan 13, 2023

This time with package builded from ceeeea8

I’m afraid that doesn’t match the backtrace; that backtrace seems to be consistent with this being unmodified Podman 4.3.1.

@lukasmrtvy
Copy link
Author

lukasmrtvy commented Jan 13, 2023

podman info reports 4.4.0-dev. I guess I need to wait for stable release then.

@mtrmac
Copy link
Collaborator

mtrmac commented Jan 13, 2023

https://github.com/containers/podman/releases/tag/v4.4.0-rc1 ? But if there were some deeper reason for the mismatch between what crashes and what you have upgraded to, just updating to that might not make a difference.

@oseiberts11
Copy link

In #16062 (comment) this issue is considered a locking bug inside podman. We seem to have problems regarding a deadlock in the xfs file system, triggered by podman. I am wondering if these can be explained by the same explanation or not.

The issue occurs when starting a container for RabbitMQ with --user rabbitmq --userns=keep-id, on newer kernels when the native overlayfs is used and not fuse-overlayfs.

One thing that is sub-optimal is that RabbitMQ needs its home directory mounted in the container (/var/lib/rabbitmq) but this is also where Podman stores all the container files; so effectively the container files are mounted into the container.

The kernel warns about processes being stuck; here is one of them:

2023-05-10 15:37:39 in4 kernel[-] warning: [   69.953197] overlayfs: fs on '/var/lib/rabbitmq/.local/share/containers/storage/overlay/compat278498705/lower1' does not support file handles, falling back to xino=off.
2023-05-10 15:37:40 in4 kernel[-] warning: [   69.975395] overlayfs: fs on '/var/lib/rabbitmq/.local/share/containers/storage/overlay/metacopy-check184881131/l1' does not support file handles, falling back to xino=off.
2023-05-10 15:37:40 in4 kernel[-] warning: [   69.980772] overlayfs: fs on '/var/lib/rabbitmq/.local/share/containers/storage/overlay/opaque-bug-check964911957/l2' does not support file handles, falling back to xino=off.
2023-05-10 15:37:44 in4 kernel[-] warning: [   74.073926] overlayfs: fs on '/var/lib/rabbitmq/.local/share/containers/storage/overlay/l/PJRXH6F3G7CISWKDG472AW6PUO' does not support file handles, falling back to xino=off.
2023-05-10 15:40:35 in4 kernel[-] err: [  245.356952] INFO: task kworker/9:0:73 blocked for more than 120 seconds.
2023-05-10 15:40:35 in4 kernel[-] err: [  245.363676]       Not tainted 5.15.0-71-generic #78~20.04.1-Ubuntu
2023-05-10 15:40:35 in4 kernel[-] err: [  245.369880] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
2023-05-10 15:40:35 in4 kernel[-] info: [  245.377744] task:kworker/9:0     state:D stack:    0 pid:   73 ppid:     2 flags:0x00004000
2023-05-10 15:40:35 in4 kernel[-] info: [  245.377748] Workqueue: xfs-conv/dm-1 xfs_end_io [xfs]
2023-05-10 15:40:35 in4 kernel[-] info: [  245.377841] Call Trace:
2023-05-10 15:40:35 in4 kernel[-] info: [  245.377844]  <TASK>
2023-05-10 15:40:35 in4 kernel[-] info: [  245.377846]  __schedule+0x2cd/0x890
2023-05-10 15:40:35 in4 kernel[-] info: [  245.377850]  ? xfs_buf_find.isra.0+0x331/0x780 [xfs]
2023-05-10 15:40:35 in4 kernel[-] info: [  245.377915]  schedule+0x69/0x110
2023-05-10 15:40:35 in4 kernel[-] info: [  245.377917]  schedule_timeout+0x206/0x2d0
2023-05-10 15:40:35 in4 kernel[-] info: [  245.377919]  ? xfs_buf_get_map+0x52/0x420 [xfs]
2023-05-10 15:40:35 in4 kernel[-] info: [  245.377983]  __down+0x84/0xf0
2023-05-10 15:40:35 in4 kernel[-] info: [  245.377985]  down+0x53/0x70
2023-05-10 15:40:35 in4 kernel[-] info: [  245.377987]  xfs_buf_lock+0x32/0xc0 [xfs]
2023-05-10 15:40:35 in4 kernel[-] info: [  245.378051]  xfs_buf_find.isra.0+0x34a/0x780 [xfs]
2023-05-10 15:40:35 in4 kernel[-] info: [  245.378116]  xfs_buf_get_map+0x52/0x420 [xfs]
2023-05-10 15:40:35 in4 kernel[-] info: [  245.378179]  xfs_buf_read_map+0x53/0x350 [xfs]
2023-05-10 15:40:35 in4 kernel[-] info: [  245.378243]  ? xfs_read_agf+0xa6/0x130 [xfs]
2023-05-10 15:40:35 in4 kernel[-] info: [  245.378295]  xfs_trans_read_buf_map+0x130/0x2a0 [xfs]
2023-05-10 15:40:35 in4 kernel[-] info: [  245.378368]  ? xfs_read_agf+0xa6/0x130 [xfs]
2023-05-10 15:40:35 in4 kernel[-] info: [  245.378421]  xfs_read_agf+0xa6/0x130 [xfs]
2023-05-10 15:40:35 in4 kernel[-] info: [  245.378473]  xfs_alloc_read_agf+0x39/0x1b0 [xfs]
2023-05-10 15:40:35 in4 kernel[-] info: [  245.378523]  ? xfs_perag_get+0x45/0xa0 [xfs]
2023-05-10 15:40:35 in4 kernel[-] info: [  245.378574]  xfs_refcount_finish_one+0xee/0x350 [xfs]
2023-05-10 15:40:35 in4 kernel[-] info: [  245.378636]  xfs_refcount_update_finish_item+0x4a/0xb0 [xfs]
2023-05-10 15:40:35 in4 kernel[-] info: [  245.378706]  ? xfs_trans_get_cud+0x62/0x70 [xfs]
2023-05-10 15:40:35 in4 kernel[-] info: [  245.378777]  xfs_defer_finish_noroll+0x1f4/0x600 [xfs]
2023-05-10 15:40:35 in4 kernel[-] info: [  245.378835]  __xfs_trans_commit+0x17d/0x3b0 [xfs]
2023-05-10 15:40:35 in4 kernel[-] info: [  245.378905]  xfs_trans_commit+0x10/0x20 [xfs]
2023-05-10 15:40:35 in4 kernel[-] info: [  245.378974]  xfs_reflink_end_cow_extent+0x230/0x2a0 [xfs]
2023-05-10 15:40:35 in4 kernel[-] info: [  245.379043]  xfs_reflink_end_cow+0x7c/0x130 [xfs]
2023-05-10 15:40:35 in4 kernel[-] info: [  245.379112]  xfs_end_ioend+0xfc/0x190 [xfs]
2023-05-10 15:40:35 in4 kernel[-] info: [  245.379185]  xfs_end_io+0xb0/0xe0 [xfs]
2023-05-10 15:40:35 in4 kernel[-] info: [  245.379244]  process_one_work+0x228/0x3d0
2023-05-10 15:40:35 in4 kernel[-] info: [  245.379246]  worker_thread+0x4d/0x3f0
2023-05-10 15:40:35 in4 kernel[-] info: [  245.379248]  ? process_one_work+0x3d0/0x3d0
2023-05-10 15:40:35 in4 kernel[-] info: [  245.379250]  kthread+0x127/0x150
2023-05-10 15:40:35 in4 kernel[-] info: [  245.379252]  ? set_kthread_struct+0x50/0x50
2023-05-10 15:40:35 in4 kernel[-] info: [  245.379254]  ret_from_fork+0x1f/0x30
2023-05-10 15:40:35 in4 kernel[-] info: [  245.379258]  </TASK>

After this, more and more processes get blocked as they try to access the file system.

We are currently working around it by forcing the use of the fuse-overlayfs instead of the native one.

The version of Podman we have is fairly old (3.4.2) but Ubuntu doesn't seem to have packaged a newer version for Ubuntu 20.04. Therefore we could not try if this fix also works for us.

@mheon
Copy link
Member

mheon commented May 16, 2023

This sounds completely unrelated to the original bug (and, honestly, is most likely a kernel bug, not a Podman one). Please open a fresh issue (probably against Podman first, so we can be sure this is actually the kernel).

@oseiberts11
Copy link

@mheon Yes I will do that. I will also add some extra information to it that may help. I also think it is more likely a kernel bug (after all, userland programs should not be able to mess up the kernel), but I wasn't sure where else to report it.

@github-actions github-actions bot added the locked - please file new issue/PR Assist humans wanting to comment on an old issue or PR with locked comments. label Aug 23, 2023
@github-actions github-actions bot locked as resolved and limited conversation to collaborators Aug 23, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
kind/bug Categorizes issue or PR as related to a bug. locked - please file new issue/PR Assist humans wanting to comment on an old issue or PR with locked comments. stale-issue
Projects
None yet
Development

No branches or pull requests

6 participants