pytorch could not detect Nvidia driver on bottlerocket #3916

chulkilee · 2024-04-25T23:21:01Z

Sorry, I don't have the whole details, but I'd like to report that I had issues using pytorch on Bottlerocket image for EKS.

When I switch to AL2 GPU AMI, it worked without an issue.

EKS 1.29
node group with the default launch template (so the latest AMI image)
instance type: g4dn.xlarge
The EKS cluster don't use nvidia device driver / gpu operator,

AMI

BOTTLEROCKET_X86_64_NVIDIA: ami-0d31d8d1285f91827 - bottlerocket-aws-k8s-1.29-nvidia-x86_64-v1.19.4-4f0a078e
AL2_x86_64_GPU: ami-093bb52bc444e09ba - amazon-eks-gpu-node-1.29-v20240415

In both AMIs nvidia kernel mod seems to be loaded.. but with different params.

cat /proc/driver/nvidia/version

BOTTLEROCKET_x86_64_NVIDIA:

NVRM version: NVIDIA UNIX x86_64 Kernel Module  535.161.07  Sat Feb 17 22:55:48 UTC 2024
GCC version:  gcc version 11.3.0 (Buildroot 2022.11.1)

AL2_x86_64_GPU:

NVRM version: NVIDIA UNIX x86_64 Kernel Module  535.161.08  Tue Mar  5 22:42:15 UTC 2024
GCC version:  gcc version 10.5.0 20230707 (Red Hat 10.5.0-1) (GCC)

cat /proc/driver/nvidia/params

BOTTLEROCKET_x86_64_NVIDIA:

ResmanDebugLevel: 4294967295
RmLogonRC: 1
ModifyDeviceFiles: 1
DeviceFileUID: 0
DeviceFileGID: 0
DeviceFileMode: 438
InitializeSystemMemoryAllocations: 1
UsePageAttributeTable: 4294967295
EnableMSI: 1
EnablePCIeGen3: 0
MemoryPoolSize: 0
KMallocHeapMaxSize: 0
VMallocHeapMaxSize: 0
IgnoreMMIOCheck: 0
TCEBypassMode: 0
EnableStreamMemOPs: 0
EnableUserNUMAManagement: 1
NvLinkDisable: 0
RmProfilingAdminOnly: 1
PreserveVideoMemoryAllocations: 0
EnableS0ixPowerManagement: 0
S0ixPowerManagementVideoMemoryThreshold: 256
DynamicPowerManagement: 3
DynamicPowerManagementVideoMemoryThreshold: 200
RegisterPCIDriver: 1
EnablePCIERelaxedOrderingMode: 0
EnableResizableBar: 0
EnableGpuFirmware: 18
EnableGpuFirmwareLogs: 2
EnableDbgBreakpoint: 0
OpenRmEnableUnsupportedGpus: 0
DmaRemapPeerMmio: 1
RegistryDwords: ""
RegistryDwordsPerDevice: ""
RmMsg: ""
GpuBlacklist: ""
TemporaryFilePath: ""
ExcludedGpus: ""

AL2_x86_64_GPU:

ResmanDebugLevel: 4294967295
RmLogonRC: 1
ModifyDeviceFiles: 0
DeviceFileUID: 0
DeviceFileGID: 0
DeviceFileMode: 438
InitializeSystemMemoryAllocations: 1
UsePageAttributeTable: 4294967295
EnableMSI: 1
EnablePCIeGen3: 0
MemoryPoolSize: 0
KMallocHeapMaxSize: 0
VMallocHeapMaxSize: 0
IgnoreMMIOCheck: 0
TCEBypassMode: 0
EnableStreamMemOPs: 0
EnableUserNUMAManagement: 1
NvLinkDisable: 0
RmProfilingAdminOnly: 1
PreserveVideoMemoryAllocations: 0
EnableS0ixPowerManagement: 0
S0ixPowerManagementVideoMemoryThreshold: 256
DynamicPowerManagement: 3
DynamicPowerManagementVideoMemoryThreshold: 200
RegisterPCIDriver: 1
EnablePCIERelaxedOrderingMode: 0
EnableResizableBar: 0
EnableGpuFirmware: 0
EnableGpuFirmwareLogs: 2
EnableDbgBreakpoint: 0
OpenRmEnableUnsupportedGpus: 0
DmaRemapPeerMmio: 1
RegistryDwords: ""
RegistryDwordsPerDevice: ""
RmMsg: ""
GpuBlacklist: ""
TemporaryFilePath: ""
ExcludedGpus: ""

However, pytorch failed to detect the driver in Bottleoeckt

Only in BOTTLEROCKET_x86_64_NVIDIA:

python -c "import torch; torch.cuda.current_device()"

RuntimeError: Found no NVIDIA driver on your system. Please check that you have an NVIDIA GPU and installed a driver from http://www.nvidia.com/Download/index.aspx

python -m torch.utils.collect_env:

BOTTLEROCKET_x86_64_NVIDIA:

PyTorch version: 2.0.1+cu117
Is debug build: False
CUDA used to build PyTorch: 11.7
ROCM used to build PyTorch: N/A

OS: Ubuntu 22.04.3 LTS (x86_64)
GCC version: (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0
Clang version: Could not collect
CMake version: version 3.28.1
Libc version: glibc-2.35

Python version: 3.10.12 (main, Nov 20 2023, 15:14:05) [GCC 11.4.0] (64-bit runtime)
Python platform: Linux-6.1.82-x86_64-with-glibc2.35
Is CUDA available: False
CUDA runtime version: No CUDA
CUDA_MODULE_LOADING set to: N/A
GPU models and configuration: No CUDA
Nvidia driver version: No CUDA
cuDNN version: No CUDA
HIP runtime version: N/A
MIOpen runtime version: N/A
Is XNNPACK available: True

AL2_x86_64_GPU:

PyTorch version: 2.0.1+cu117
Is debug build: False
CUDA used to build PyTorch: 11.7
ROCM used to build PyTorch: N/A

OS: Ubuntu 22.04.3 LTS (x86_64)
GCC version: (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0
Clang version: Could not collect
CMake version: version 3.28.1
Libc version: glibc-2.35

Python version: 3.10.12 (main, Nov 20 2023, 15:14:05) [GCC 11.4.0] (64-bit runtime)
Python platform: Linux-5.10.213-201.855.amzn2.x86_64-x86_64-with-glibc2.35
Is CUDA available: True
CUDA runtime version: Could not collect
CUDA_MODULE_LOADING set to: LAZY
GPU models and configuration: GPU 0: Tesla T4
Nvidia driver version: 535.161.08
cuDNN version: Could not collect
HIP runtime version: N/A
MIOpen runtime version: N/A
Is XNNPACK available: True

Used python packages

[pip3] numpy==1.26.3
[pip3] pytorch-lightning==2.1.3
[pip3] pytorch-metric-learning==2.4.1
[pip3] torch==2.0.1+cu117
[pip3] torch-audiomentations==0.11.0
[pip3] torch-pitch-shift==1.2.4
[pip3] torchaudio==2.0.2
[pip3] torchmetrics==1.3.0.post0

Could it be related to awslabs/amazon-eks-ami#1523 ?

The text was updated successfully, but these errors were encountered:

chulkilee · 2024-04-25T23:40:03Z

If it needs to be reported to https://github.com/awslabs/amazon-eks-ami/issues then please let me know.

yeazelm · 2024-04-25T23:50:23Z

Hello @chulkilee, thanks for cutting this issue! I don't believe this would be related to GSP on g4dn.xlarge instances but you could follow #3817 (comment) just to confirm that isn't the problem.

The difference in the output between Bottlerocket and Amazon Linux for the module config is:

Bottlerocket: ModifyDeviceFiles: 1
Amazon Linux: ModifyDeviceFiles: 0

Bottlerocket: EnableGpuFirmware: 18
Amazon Linux: EnableGpuFirmware: 0

EnableGpuFirmware is the GSP change and ModifyDeviceFiles will disable dynamic device file management when set to 0.

What is strange is that pytorch is reporting that CUDA is not available when it really should be since the other things you called out are there.

Can you also confirm what your podspec looks like just to make sure all the right settings are being passed from that perspective?

yeazelm · 2024-04-27T20:16:20Z

Hello @chulkilee, I just tried using an image from NVIDIA to confirm that pytorch can see the devices on a g4dn.xlarge node with latest bottlerocket and I don't get the same issue:

# python -c "import torch; print(torch.cuda.get_device_name(0))"
Tesla T4

Can you confirm which base container you are using and which CUDA version is included? I'm not able to replicate with the image I got.

bryantbiggs · 2024-05-01T20:15:53Z

@chulkilee do your container images contain the following environment variables?

ENV NVIDIA_VISIBLE_DEVICES all
ENV NVIDIA_DRIVER_CAPABILITIES compute,utility

If not, I would suggest adding them

chulkilee · 2024-05-06T22:30:01Z

ENV NVIDIA_VISIBLE_DEVICES all
ENV NVIDIA_DRIVER_CAPABILITIES compute,utility

Those were set. I'm using nvidia/cuda:11.8.0-base-ubuntu22.04 image - but still failing.

Update

declare -x CUDA_VERSION="11.8.0"
declare -x NVIDIA_REQUIRE_CUDA="cuda>=11.8 brand=tesla,driver>=450,driver<451 brand=tesla,driver>=470,driver<471 brand=unknown,driver>=470,driver<471 brand=nvidia,driver>=470,driver<471 brand=nvidiartx,driver>=470,driver<471 brand=geforce,driver>=470,driver<471 brand=geforcertx,driver>=470,driver<471 brand=quadro,driver>=470,driver<471 brand=quadrortx,driver>=470,driver<471 brand=titan,driver>=470,driver<471 brand=titanrtx,driver>=470,driver<471 brand=tesla,driver>=510,driver<511 brand=unknown,driver>=510,driver<511 brand=nvidia,driver>=510,driver<511 brand=nvidiartx,driver>=510,driver<511 brand=geforce,driver>=510,driver<511 brand=geforcertx,driver>=510,driver<511 brand=quadro,driver>=510,driver<511 brand=quadrortx,driver>=510,driver<511 brand=titan,driver>=510,driver<511 brand=titanrtx,driver>=510,driver<511 brand=tesla,driver>=515,driver<516 brand=unknown,driver>=515,driver<516 brand=nvidia,driver>=515,driver<516 brand=nvidiartx,driver>=515,driver<516 brand=geforce,driver>=515,driver<516 brand=geforcertx,driver>=515,driver<516 brand=quadro,driver>=515,driver<516 brand=quadrortx,driver>=515,driver<516 brand=titan,driver>=515,driver<516 brand=titanrtx,driver>=515,driver<516"
declare -x NV_CUDA_COMPAT_PACKAGE="cuda-compat-11-8"
declare -x NV_CUDA_CUDART_VERSION="11.8.89-1"

Even I unset NVIDIA_REQUIRE_CUDA - it still fails with the same error.

I also tested the same image with 1.19.4-4f0a078e and 1.19.5-64049ba8 AMI releases - both failed.

arnaldo2792 · 2024-05-09T00:32:25Z

@chulkilee , are you requesting GPUs in your pod specs? Or, do you need to oversubscribe your GPUs and thus you use NVIDIA_VISIBLE_DEVICES=all to get access to all the GPUs in the instance from your pod?

chulkilee added status/needs-triage Pending triage or re-evaluation type/bug Something isn't working labels Apr 25, 2024

Discipe mentioned this issue May 2, 2024

Fail to detect GPU on Bottlerocket v1.19 within AWS g4dn instance #3937

Open

vigh-m added area/accelerated-computing Issues related to GPUs/ASICs and removed status/needs-triage Pending triage or re-evaluation labels May 14, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

pytorch could not detect Nvidia driver on bottlerocket #3916

pytorch could not detect Nvidia driver on bottlerocket #3916

chulkilee commented Apr 25, 2024 •

edited

chulkilee commented Apr 25, 2024

yeazelm commented Apr 25, 2024

yeazelm commented Apr 27, 2024

bryantbiggs commented May 1, 2024

chulkilee commented May 6, 2024 •

edited

arnaldo2792 commented May 9, 2024

pytorch could not detect Nvidia driver on bottlerocket #3916

pytorch could not detect Nvidia driver on bottlerocket #3916

Comments

chulkilee commented Apr 25, 2024 • edited

chulkilee commented Apr 25, 2024

yeazelm commented Apr 25, 2024

yeazelm commented Apr 27, 2024

bryantbiggs commented May 1, 2024

chulkilee commented May 6, 2024 • edited

arnaldo2792 commented May 9, 2024

chulkilee commented Apr 25, 2024 •

edited

chulkilee commented May 6, 2024 •

edited