Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

pytorch could not detect Nvidia driver on bottlerocket #3916

Open
chulkilee opened this issue Apr 25, 2024 · 6 comments
Open

pytorch could not detect Nvidia driver on bottlerocket #3916

chulkilee opened this issue Apr 25, 2024 · 6 comments
Labels
area/accelerated-computing Issues related to GPUs/ASICs type/bug Something isn't working

Comments

@chulkilee
Copy link

chulkilee commented Apr 25, 2024

Sorry, I don't have the whole details, but I'd like to report that I had issues using pytorch on Bottlerocket image for EKS.

When I switch to AL2 GPU AMI, it worked without an issue.

  • EKS 1.29
  • node group with the default launch template (so the latest AMI image)
  • instance type: g4dn.xlarge
  • The EKS cluster don't use nvidia device driver / gpu operator,

AMI

  • BOTTLEROCKET_X86_64_NVIDIA: ami-0d31d8d1285f91827 - bottlerocket-aws-k8s-1.29-nvidia-x86_64-v1.19.4-4f0a078e
  • AL2_x86_64_GPU: ami-093bb52bc444e09ba - amazon-eks-gpu-node-1.29-v20240415

In both AMIs nvidia kernel mod seems to be loaded.. but with different params.

cat /proc/driver/nvidia/version

BOTTLEROCKET_x86_64_NVIDIA:

NVRM version: NVIDIA UNIX x86_64 Kernel Module  535.161.07  Sat Feb 17 22:55:48 UTC 2024
GCC version:  gcc version 11.3.0 (Buildroot 2022.11.1)

AL2_x86_64_GPU:

NVRM version: NVIDIA UNIX x86_64 Kernel Module  535.161.08  Tue Mar  5 22:42:15 UTC 2024
GCC version:  gcc version 10.5.0 20230707 (Red Hat 10.5.0-1) (GCC)

cat /proc/driver/nvidia/params

BOTTLEROCKET_x86_64_NVIDIA:

ResmanDebugLevel: 4294967295
RmLogonRC: 1
ModifyDeviceFiles: 1
DeviceFileUID: 0
DeviceFileGID: 0
DeviceFileMode: 438
InitializeSystemMemoryAllocations: 1
UsePageAttributeTable: 4294967295
EnableMSI: 1
EnablePCIeGen3: 0
MemoryPoolSize: 0
KMallocHeapMaxSize: 0
VMallocHeapMaxSize: 0
IgnoreMMIOCheck: 0
TCEBypassMode: 0
EnableStreamMemOPs: 0
EnableUserNUMAManagement: 1
NvLinkDisable: 0
RmProfilingAdminOnly: 1
PreserveVideoMemoryAllocations: 0
EnableS0ixPowerManagement: 0
S0ixPowerManagementVideoMemoryThreshold: 256
DynamicPowerManagement: 3
DynamicPowerManagementVideoMemoryThreshold: 200
RegisterPCIDriver: 1
EnablePCIERelaxedOrderingMode: 0
EnableResizableBar: 0
EnableGpuFirmware: 18
EnableGpuFirmwareLogs: 2
EnableDbgBreakpoint: 0
OpenRmEnableUnsupportedGpus: 0
DmaRemapPeerMmio: 1
RegistryDwords: ""
RegistryDwordsPerDevice: ""
RmMsg: ""
GpuBlacklist: ""
TemporaryFilePath: ""
ExcludedGpus: ""

AL2_x86_64_GPU:

ResmanDebugLevel: 4294967295
RmLogonRC: 1
ModifyDeviceFiles: 0
DeviceFileUID: 0
DeviceFileGID: 0
DeviceFileMode: 438
InitializeSystemMemoryAllocations: 1
UsePageAttributeTable: 4294967295
EnableMSI: 1
EnablePCIeGen3: 0
MemoryPoolSize: 0
KMallocHeapMaxSize: 0
VMallocHeapMaxSize: 0
IgnoreMMIOCheck: 0
TCEBypassMode: 0
EnableStreamMemOPs: 0
EnableUserNUMAManagement: 1
NvLinkDisable: 0
RmProfilingAdminOnly: 1
PreserveVideoMemoryAllocations: 0
EnableS0ixPowerManagement: 0
S0ixPowerManagementVideoMemoryThreshold: 256
DynamicPowerManagement: 3
DynamicPowerManagementVideoMemoryThreshold: 200
RegisterPCIDriver: 1
EnablePCIERelaxedOrderingMode: 0
EnableResizableBar: 0
EnableGpuFirmware: 0
EnableGpuFirmwareLogs: 2
EnableDbgBreakpoint: 0
OpenRmEnableUnsupportedGpus: 0
DmaRemapPeerMmio: 1
RegistryDwords: ""
RegistryDwordsPerDevice: ""
RmMsg: ""
GpuBlacklist: ""
TemporaryFilePath: ""
ExcludedGpus: ""

However, pytorch failed to detect the driver in Bottleoeckt

Only in BOTTLEROCKET_x86_64_NVIDIA:

python -c "import torch; torch.cuda.current_device()"

RuntimeError: Found no NVIDIA driver on your system. Please check that you have an NVIDIA GPU and installed a driver from http://www.nvidia.com/Download/index.aspx

python -m torch.utils.collect_env:

BOTTLEROCKET_x86_64_NVIDIA:

PyTorch version: 2.0.1+cu117
Is debug build: False
CUDA used to build PyTorch: 11.7
ROCM used to build PyTorch: N/A

OS: Ubuntu 22.04.3 LTS (x86_64)
GCC version: (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0
Clang version: Could not collect
CMake version: version 3.28.1
Libc version: glibc-2.35

Python version: 3.10.12 (main, Nov 20 2023, 15:14:05) [GCC 11.4.0] (64-bit runtime)
Python platform: Linux-6.1.82-x86_64-with-glibc2.35
Is CUDA available: False
CUDA runtime version: No CUDA
CUDA_MODULE_LOADING set to: N/A
GPU models and configuration: No CUDA
Nvidia driver version: No CUDA
cuDNN version: No CUDA
HIP runtime version: N/A
MIOpen runtime version: N/A
Is XNNPACK available: True

AL2_x86_64_GPU:

PyTorch version: 2.0.1+cu117
Is debug build: False
CUDA used to build PyTorch: 11.7
ROCM used to build PyTorch: N/A

OS: Ubuntu 22.04.3 LTS (x86_64)
GCC version: (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0
Clang version: Could not collect
CMake version: version 3.28.1
Libc version: glibc-2.35

Python version: 3.10.12 (main, Nov 20 2023, 15:14:05) [GCC 11.4.0] (64-bit runtime)
Python platform: Linux-5.10.213-201.855.amzn2.x86_64-x86_64-with-glibc2.35
Is CUDA available: True
CUDA runtime version: Could not collect
CUDA_MODULE_LOADING set to: LAZY
GPU models and configuration: GPU 0: Tesla T4
Nvidia driver version: 535.161.08
cuDNN version: Could not collect
HIP runtime version: N/A
MIOpen runtime version: N/A
Is XNNPACK available: True

Used python packages

[pip3] numpy==1.26.3
[pip3] pytorch-lightning==2.1.3
[pip3] pytorch-metric-learning==2.4.1
[pip3] torch==2.0.1+cu117
[pip3] torch-audiomentations==0.11.0
[pip3] torch-pitch-shift==1.2.4
[pip3] torchaudio==2.0.2
[pip3] torchmetrics==1.3.0.post0

Could it be related to awslabs/amazon-eks-ami#1523 ?

@chulkilee chulkilee added status/needs-triage Pending triage or re-evaluation type/bug Something isn't working labels Apr 25, 2024
@chulkilee
Copy link
Author

If it needs to be reported to https://github.com/awslabs/amazon-eks-ami/issues then please let me know.

@yeazelm
Copy link
Contributor

yeazelm commented Apr 25, 2024

Hello @chulkilee, thanks for cutting this issue! I don't believe this would be related to GSP on g4dn.xlarge instances but you could follow #3817 (comment) just to confirm that isn't the problem.

The difference in the output between Bottlerocket and Amazon Linux for the module config is:

Bottlerocket: ModifyDeviceFiles: 1
Amazon Linux: ModifyDeviceFiles: 0

Bottlerocket: EnableGpuFirmware: 18
Amazon Linux: EnableGpuFirmware: 0

EnableGpuFirmware is the GSP change and ModifyDeviceFiles will disable dynamic device file management when set to 0.

What is strange is that pytorch is reporting that CUDA is not available when it really should be since the other things you called out are there.

Can you also confirm what your podspec looks like just to make sure all the right settings are being passed from that perspective?

@yeazelm
Copy link
Contributor

yeazelm commented Apr 27, 2024

Hello @chulkilee, I just tried using an image from NVIDIA to confirm that pytorch can see the devices on a g4dn.xlarge node with latest bottlerocket and I don't get the same issue:

# python -c "import torch; print(torch.cuda.get_device_name(0))"
Tesla T4

Can you confirm which base container you are using and which CUDA version is included? I'm not able to replicate with the image I got.

@bryantbiggs
Copy link

@chulkilee do your container images contain the following environment variables?

ENV NVIDIA_VISIBLE_DEVICES all
ENV NVIDIA_DRIVER_CAPABILITIES compute,utility

If not, I would suggest adding them

@chulkilee
Copy link
Author

chulkilee commented May 6, 2024

ENV NVIDIA_VISIBLE_DEVICES all
ENV NVIDIA_DRIVER_CAPABILITIES compute,utility

Those were set. I'm using nvidia/cuda:11.8.0-base-ubuntu22.04 image - but still failing.

Update

declare -x CUDA_VERSION="11.8.0"
declare -x NVIDIA_REQUIRE_CUDA="cuda>=11.8 brand=tesla,driver>=450,driver<451 brand=tesla,driver>=470,driver<471 brand=unknown,driver>=470,driver<471 brand=nvidia,driver>=470,driver<471 brand=nvidiartx,driver>=470,driver<471 brand=geforce,driver>=470,driver<471 brand=geforcertx,driver>=470,driver<471 brand=quadro,driver>=470,driver<471 brand=quadrortx,driver>=470,driver<471 brand=titan,driver>=470,driver<471 brand=titanrtx,driver>=470,driver<471 brand=tesla,driver>=510,driver<511 brand=unknown,driver>=510,driver<511 brand=nvidia,driver>=510,driver<511 brand=nvidiartx,driver>=510,driver<511 brand=geforce,driver>=510,driver<511 brand=geforcertx,driver>=510,driver<511 brand=quadro,driver>=510,driver<511 brand=quadrortx,driver>=510,driver<511 brand=titan,driver>=510,driver<511 brand=titanrtx,driver>=510,driver<511 brand=tesla,driver>=515,driver<516 brand=unknown,driver>=515,driver<516 brand=nvidia,driver>=515,driver<516 brand=nvidiartx,driver>=515,driver<516 brand=geforce,driver>=515,driver<516 brand=geforcertx,driver>=515,driver<516 brand=quadro,driver>=515,driver<516 brand=quadrortx,driver>=515,driver<516 brand=titan,driver>=515,driver<516 brand=titanrtx,driver>=515,driver<516"
declare -x NV_CUDA_COMPAT_PACKAGE="cuda-compat-11-8"
declare -x NV_CUDA_CUDART_VERSION="11.8.89-1"

Even I unset NVIDIA_REQUIRE_CUDA - it still fails with the same error.

I also tested the same image with 1.19.4-4f0a078e and 1.19.5-64049ba8 AMI releases - both failed.

@arnaldo2792
Copy link
Contributor

@chulkilee , are you requesting GPUs in your pod specs? Or, do you need to oversubscribe your GPUs and thus you use NVIDIA_VISIBLE_DEVICES=all to get access to all the GPUs in the instance from your pod?

@vigh-m vigh-m added area/accelerated-computing Issues related to GPUs/ASICs and removed status/needs-triage Pending triage or re-evaluation labels May 14, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/accelerated-computing Issues related to GPUs/ASICs type/bug Something isn't working
Projects
None yet
Development

No branches or pull requests

5 participants