-
Notifications
You must be signed in to change notification settings - Fork 493
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
pytorch could not detect Nvidia driver on bottlerocket #3916
Comments
If it needs to be reported to https://github.com/awslabs/amazon-eks-ami/issues then please let me know. |
Hello @chulkilee, thanks for cutting this issue! I don't believe this would be related to GSP on g4dn.xlarge instances but you could follow #3817 (comment) just to confirm that isn't the problem. The difference in the output between Bottlerocket and Amazon Linux for the module config is:
EnableGpuFirmware is the GSP change and ModifyDeviceFiles will disable dynamic device file management when set to 0. What is strange is that pytorch is reporting that CUDA is not available when it really should be since the other things you called out are there. Can you also confirm what your podspec looks like just to make sure all the right settings are being passed from that perspective? |
Hello @chulkilee, I just tried using an image from NVIDIA to confirm that pytorch can see the devices on a g4dn.xlarge node with latest bottlerocket and I don't get the same issue:
Can you confirm which base container you are using and which CUDA version is included? I'm not able to replicate with the image I got. |
@chulkilee do your container images contain the following environment variables?
If not, I would suggest adding them |
Those were set. I'm using Update
Even I unset I also tested the same image with |
@chulkilee , are you requesting GPUs in your pod specs? Or, do you need to oversubscribe your GPUs and thus you use |
Sorry, I don't have the whole details, but I'd like to report that I had issues using pytorch on Bottlerocket image for EKS.
When I switch to AL2 GPU AMI, it worked without an issue.
AMI
In both AMIs nvidia kernel mod seems to be loaded.. but with different params.
cat /proc/driver/nvidia/version
BOTTLEROCKET_x86_64_NVIDIA:
AL2_x86_64_GPU:
cat /proc/driver/nvidia/params
BOTTLEROCKET_x86_64_NVIDIA:
AL2_x86_64_GPU:
However, pytorch failed to detect the driver in Bottleoeckt
Only in BOTTLEROCKET_x86_64_NVIDIA:
python -m torch.utils.collect_env
:BOTTLEROCKET_x86_64_NVIDIA:
AL2_x86_64_GPU:
Used python packages
Could it be related to awslabs/amazon-eks-ami#1523 ?
The text was updated successfully, but these errors were encountered: