Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

OCI runtime error: crun: error executing hook using podman –userns keep-id #46

Closed
osiler opened this issue Jan 30, 2023 · 6 comments
Closed

Comments

@osiler
Copy link

osiler commented Jan 30, 2023

Versions:

	podman -v
            podman version 4.3.1

	buildah -v
	buildah version 1.28.0 (image-spec 1.0.2-dev, runtime-spec 1.0.2-dev)

	nvidia-container-toolkit -version
	NVIDIA Container Runtime Hook version 1.12.0-rc.3
	commit: 14e587d55f2a4dc2e047a88e9acc2be72cb45af8

I am attempting to get containers to run with access to the GPU with rootless podman and the --userns keep-id flag. My current steps include:

Generating the cdi spec via:

	nvidia-ctk cdi generate > nvidia.yaml && sudo mkdir /etc/cdi && sudo mv nvidia.yaml /etc/cdi

Attempt 1: Fails

podman run --rm --device nvidia.com/gpu=gpu0 docker.io/pytorch/pytorch python -c "import torch; print(torch.cuda.get_device_name(0))"

Error: setting up CDI devices: failed to inject devices: failed to stat CDI host device "/dev/dri/controlD69": no such file or directory

I then removed references to the following in the devices section of the /etc/cdi/nvidia.yaml:

    - path: /dev/dri/card5 
    - path: /dev/dri/controlD69
    - path: /dev/dri/renderD129

and removed the create symlink hooks in the devices section.

    hooks:
    - args:
      - nvidia-ctk
      - hook
      - create-symlinks
      - --link
      - ../card5::/dev/dri/by-path/pci-0000:58:00.0-card
      - --link
      - ../renderD129::/dev/dri/by-path/pci-0000:58:00.0-render
      hookName: createContainer
      path: nvidia-ctk

Finally I also removed the nvidia-ctk hook that changes the ownership of the /dev/dri path.

  - args:
    - nvidia-ctk
    - hook
    - chmod
    - --mode
    - "755"
    - --path
    - /dev/dri
    hookName: createContainer
    path: /usr/bin/nvidia-ctk

Attempt 2: Pass (missing selinux modules)

podman run --rm --device nvidia.com/gpu=gpu0 docker.io/pytorch/pytorch python -c "import torch; print(torch.cuda.get_device_name(0))"
Traceback (most recent call last):
  File "<string>", line 1, in <module>
  File "/opt/conda/lib/python3.10/site-packages/torch/cuda/__init__.py", line 341, in get_device_name
    return get_device_properties(device).name
  File "/opt/conda/lib/python3.10/site-packages/torch/cuda/__init__.py", line 371, in get_device_properties
    _lazy_init()  # will define _get_device_properties
  File "/opt/conda/lib/python3.10/site-packages/torch/cuda/__init__.py", line 229, in _lazy_init
    torch._C._cuda_init()
RuntimeError: No CUDA GPUs are available

I am not concerned about this error, I believe I need to just amend some policy modules as specified here.

However if I attempt to run the above with the –userns keep-id flag.

Attempt 3: Fail

podman run --rm --device nvidia.com/gpu=gpu0 --userns keep-id docker.io/pytorch/pytorch python -c "import torch; print(torch.cuda.get_device_name(0))"
Error: OCI runtime error: crun: error executing hook `/usr/bin/nvidia-container-runtime-hook` (exit code: 1)

I have also tried the different combinations for the flags of load-kmods and no-cgroups in /etc/nvidia-container-runtime/config.toml.

A lot of this trouble shooting has been directed from the following links.

I am unsure on the of the lifecycle of the permissions when running these hooks however it looks like the first issue where the mapped permissions may not add up is here.

@elezar
Copy link
Member

elezar commented Jan 30, 2023

Thanks for the detailed report. The final error that I see:

podman run --rm --device nvidia.com/gpu=gpu0 --userns keep-id docker.io/pytorch/pytorch python -c "import torch; print(torch.cuda.get_device_name(0))"
Error: OCI runtime error: crun: error executing hook `/usr/bin/nvidia-container-runtime-hook` (exit code: 1)

indicates that the original hook is still being detected and injected. When using CDI it's important that this is not the case. Please remove the installed hook and repeat the run.

With regards to the /dev/dri/ devices that cannot be found. Do the devices exist on your system and what are their permissions? The error you are seeing indicates that our detection logic around them is not as robust as what it should be and we will work on getting a fix out. Would you be willing to test a build that would address this -- assuming that we can get the container running with your modified CDI specification?

@osiler
Copy link
Author

osiler commented Jan 30, 2023

Thanks looks like it was left behind during the attempts to get this working with the various package versions. It would be a nice inclusion to remove this hook (or alert the user of the duality) during update.

Attempt 4: Pass (missing selinux modules)

cd /usr/share/containers/oci/hooks.d && sudo rm oci-nvidia-hook.json

podman run --rm --device nvidia.com/gpu=gpu0 --userns keep-id docker.io/pytorch/pytorch python -c "import torch; print(torch.cuda.get_device_name(0))"
Traceback (most recent call last):
  File "<string>", line 1, in <module>
  File "/opt/conda/lib/python3.10/site-packages/torch/cuda/__init__.py", line 341, in get_device_name
    return get_device_properties(device).name
  File "/opt/conda/lib/python3.10/site-packages/torch/cuda/__init__.py", line 371, in get_device_properties
    _lazy_init()  # will define _get_device_properties
  File "/opt/conda/lib/python3.10/site-packages/torch/cuda/__init__.py", line 229, in _lazy_init
    torch._C._cuda_init()
RuntimeError: No CUDA GPUs are available

	

For /dev/dri I have various /dev/dri/cardX with owned by root, within the video group, and /dev/dri/renderXXXX with root owner and render goup. The /dev/dri by path is all root.

I am happy to assist with release testing.

@elezar
Copy link
Member

elezar commented Jan 30, 2023

I have looked into the issue with the /dev/dri nodes and noted that it should be changed as of https://gitlab.com/nvidia/container-toolkit/container-toolkit/-/merge_requests/260 which is available on main.

Would you be able to repeat your experiments with a CDI spec generated from the HEAD of main? It should then not be required to modify the spec at all since only the /dev/dri/ device nodes actuallty present on your host should be added to the spec. Note that for crun in particular the creation of the /dev/dri folder in the container may be required in this case. There was an issue with how "nested" device nodes such as this were being handled.

Also with regards to:

Attempt 4: Pass (missing selinux modules)

What do you mean by missing selinux modules? Any information you could provide here as to how you are able to work around this would be much appreciated.

@osiler
Copy link
Author

osiler commented Feb 5, 2023

I checked out HEAD but have run into trouble getting a build to work correctly with podman and podman-docker as the runner. Currently I do not have a native docker install on my dev machine and have not dug into some of the issues of running both alongside each other as specefied here. I am assuming you currently run the build script with docker as the runner?

@emanuelbuholzer
Copy link

With the latest version of Podman and the NVIDIA Container Toolkit 1.13.1 this runs now just fine on my machine:

podman run --rm --device nvidia.com/gpu=gpu0 --userns keep-id docker.io/pytorch/pytorch python -c "import torch; print(torch.cuda.get_device_name(0))"

@osiler osiler closed this as completed Jun 4, 2023
@osiler osiler reopened this Jun 4, 2023
@osiler osiler closed this as completed Jun 4, 2023
@joefiorini
Copy link

I had this issue as well on Fedora Kinoite (Silverblue). @emanuelbuholzer's command did not work for me immediately, kept getting:

Error: setting up CDI devices: unresolvable CDI devices nvidia.com/gpu=gpu0

I could not find the actual name of the gpu's device file, but I did figure out that I could use nvidia.com/gpu=all, which then turned the error into:

Error: OCI runtime error: crun: error executing hook `/usr/bin/nvidia-container-runtime-hook` (exit code: 1)

According to nvidia's documentation this is because CDI and the nvidia-ctk runtime hook are incompatible. To disable it I added --runtime crun --hooks-dir "" to the podman command. Now I was down to a Python stack trace:

RuntimeError: No CUDA GPUs are available

To fix this I had to disable selinux (or lower security settings, something like that) with --security-opt label=disable. Altogether the final command that worked for me was:

podman --runtime crun --hooks-dir "" run --rm --security-opt label=disable --device nvidia.com/gpu=all --userns keep-id devel-jupyterlab python -c "import torch; print(torch.cuda.get_device_name(0))"

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants