Small wheels for 2.1.0 release candidate is not usable on AmazonLinux #109221

malfet · 2023-09-13T18:29:23Z

🐛 Describe the bug

Run:

% docker run --rm --gpus all -it amazonlinux:latest  bash -c "yum install -y python3-pip; pip3 install torch --extra-index-url https://download.pytorch.org/whl/test/cu121_pypi_cudnn; python3 -c 'import torch'"
...
Traceback (most recent call last):
  File "/usr/local/lib64/python3.9/site-packages/torch/__init__.py", line 174, in _load_global_deps
    ctypes.CDLL(lib_path, mode=ctypes.RTLD_GLOBAL)
  File "/usr/lib64/python3.9/ctypes/__init__.py", line 374, in __init__
    self._handle = _dlopen(self._name, mode)
OSError: libcufft.so.11: cannot open shared object file: No such file or directory

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "<string>", line 1, in <module>
  File "/usr/local/lib64/python3.9/site-packages/torch/__init__.py", line 234, in <module>
    _load_global_deps()
  File "/usr/local/lib64/python3.9/site-packages/torch/__init__.py", line 195, in _load_global_deps
    _preload_cuda_deps(lib_folder, lib_name)
  File "/usr/local/lib64/python3.9/site-packages/torch/__init__.py", line 160, in _preload_cuda_deps
    raise ValueError(f"{lib_name} not found in the system path {sys.path}")
ValueError: libnvrtc.so.*[0-9].*[0-9] not found in the system path ['', '/usr/lib64/python39.zip', '/usr/lib64/python3.9', '/usr/lib64/python3.9/lib-dynload', '/usr/local/lib64/python3.9/site-packages', '/usr/local/lib/python3.9/site-packages', '/usr/lib64/python3.9/site-packages', '/usr/lib/python3.9/site-packages']

Please note that above works as expected with 2.0 release, i.e. #88869 was reintroduced in trunk/2.1.0 branch

Versions

2.1.0

cc @ezyang @gchanan @zou3519 @kadeng @seemethere @ptrblck

atalman · 2023-09-13T18:31:25Z

Can you also please include install command what version is getting installed ?

atalman · 2023-09-13T18:33:11Z

Trying to install on amazon linux2 installs for some reason 1.13.1:

pip3 install torch --extra-index-url --extra-index-url  https://download.pytorch.org/whl/test/cu121_pypi_cudnn
WARNING: Running pip install with root privileges is generally not a good idea. Try `pip3 install --user` instead.
WARNING: The index url "--extra-index-url" seems invalid, please provide a scheme.
Looking in indexes: https://pypi.org/simple, --extra-index-url
Collecting https://download.pytorch.org/whl/test/cu121_pypi_cudnn
  Downloading https://download.pytorch.org/whl/test/cu121_pypi_cudnn (1.5 kB)
  ERROR: Cannot unpack file /tmp/pip-unpack-_aiuujx8/cu121_pypi_cudnn.html (downloaded from /tmp/pip-req-build-urwtwjek, content-type: text/html); cannot detect archive format
ERROR: Cannot determine archive format of /tmp/pip-req-build-urwtwjek
WARNING: Url '--extra-index-url/pip/' is ignored. It is either a non-existing path or lacks a specific scheme.
[root@ip-10-0-56-136 tmp]# pip3 install --pre torch --extra-index-url https://download.pytorch.org/whl/test/cu121_pypi_cudnn
WARNING: Running pip install with root privileges is generally not a good idea. Try `pip3 install --user` instead.
Looking in indexes: https://pypi.org/simple, https://download.pytorch.org/whl/test/cu121_pypi_cudnn
Collecting torch
  Downloading torch-1.13.1-cp37-cp37m-manylinux1_x86_64.whl (887.5 MB)
     |████████████████████████████████| 887.5 MB 5.5 kB/s 
Collecting nvidia-cuda-nvrtc-cu11==11.7.99; platform_system == "Linux"
  Downloading nvidia_cuda_nvrtc_cu11-11.7.99-2-py3-none-manylinux1_x86_64.whl (21.0 MB)
     |████████████████████████████████| 21.0 MB 77.2 MB/s 
Collecting nvidia-cudnn-cu11==8.5.0.96; platform_system == "Linux"
  Downloading nvidia_cudnn_cu11-8.5.0.96-2-py3-none-manylinux1_x86_64.whl (557.1 MB)
     |████████████████████████████████| 557.1 MB 6.5 kB/s 
Collecting typing-extensions
  Downloading typing_extensions-4.7.1-py3-none-any.whl (33 kB)
Collecting nvidia-cuda-runtime-cu11==11.7.99; platform_system == "Linux"
  Downloading nvidia_cuda_runtime_cu11-11.7.99-py3-none-manylinux1_x86_64.whl (849 kB)
     |████████████████████████████████| 849 kB 102.2 MB/s 
Collecting nvidia-cublas-cu11==11.10.3.66; platform_system == "Linux"
  Downloading nvidia_cublas_cu11-11.10.3.66-py3-none-manylinux1_x86_64.whl (317.1 MB)
     |████████████████████████████████| 317.1 MB 30 kB/s 
Collecting wheel
  Downloading wheel-0.41.2-py3-none-any.whl (64 kB)
     |████████████████████████████████| 64 kB 7.7 MB/s 
Requirement already satisfied: setuptools in /usr/lib/python3.7/site-packages (from nvidia-cuda-runtime-cu11==11.7.99; platform_system == "Linux"->torch) (49.1.3)
Installing collected packages: nvidia-cuda-nvrtc-cu11, wheel, nvidia-cublas-cu11, nvidia-cudnn-cu11, typing-extensions, nvidia-cuda-runtime-cu11, torch
  WARNING: The script wheel is installed in '/usr/local/bin' which is not on PATH.
  Consider adding this directory to PATH or, if you prefer to suppress this warning, use --no-warn-script-location.
  WARNING: The scripts convert-caffe2-to-onnx, convert-onnx-to-caffe2 and torchrun are installed in '/usr/local/bin' which is not on PATH.
  Consider adding this directory to PATH or, if you prefer to suppress this warning, use --no-warn-script-location.
Successfully installed nvidia-cublas-cu11-11.10.3.66 nvidia-cuda-nvrtc-cu11-11.7.99 nvidia-cuda-runtime-cu11-11.7.99 nvidia-cudnn-cu11-8.5.0.96 torch-1.13.1 typing-extensions-4.7.1 wheel-0.41.2

Manually downloading correct wheel:

curl -O https://download.pytorch.org/whl/test/cu121_pypi_cudnn/torch-2.1.0%2Bcu121.with.pypi.cudnn-cp311-cp311-linux_x86_64.whl
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  640M  100  640M    0     0  26.8M      0  0:00:23  0:00:23 --:--:-- 27.4M
[root@ip-10-0-56-136 tmp]# ls
torch-2.1.0%2Bcu121.with.pypi.cudnn-cp311-cp311-linux_x86_64.whl
[root@ip-10-0-56-136 tmp]# pip3 install torch-2.1.0%2Bcu121.with.pypi.cudnn-cp311-cp311-linux_x86_64.whl
WARNING: Running pip install with root privileges is generally not a good idea. Try `pip3 install --user` instead.
ERROR: torch-2.1.0+cu121.with.pypi.cudnn-cp311-cp311-linux_x86_64.whl is not a supported wheel on this platform.

Same command works on our validation environment:
https://github.com/pytorch/builder/actions/runs/6171808682/job/16751980235#step:11:2156

atalman · 2023-09-13T18:42:22Z

Here is metadata for this wheel:

Metadata-Version: 2.1
Name: torch
Version: 2.1.0+cu121.with.pypi.cudnn
Summary: Tensors and Dynamic neural networks in Python with strong GPU acceleration
Home-page: https://pytorch.org/
Download-URL: https://github.com/pytorch/pytorch/tags
Author: PyTorch Team
Author-email: packages@pytorch.org
License: BSD-3
Keywords: pytorch,machine learning
Classifier: Development Status :: 5 - Production/Stable
Classifier: Intended Audience :: Developers
Classifier: Intended Audience :: Education
Classifier: Intended Audience :: Science/Research
Classifier: License :: OSI Approved :: BSD License
Classifier: Topic :: Scientific/Engineering
Classifier: Topic :: Scientific/Engineering :: Mathematics
Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
Classifier: Topic :: Software Development
Classifier: Topic :: Software Development :: Libraries
Classifier: Topic :: Software Development :: Libraries :: Python Modules
Classifier: Programming Language :: C++
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.8
Classifier: Programming Language :: Python :: 3.9
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Requires-Python: >=3.8.0
Description-Content-Type: text/markdown
License-File: LICENSE
License-File: NOTICE
Requires-Dist: filelock
Requires-Dist: typing-extensions
Requires-Dist: sympy
Requires-Dist: networkx
Requires-Dist: jinja2
Requires-Dist: fsspec
Requires-Dist: pytorch-triton (==2.1.0)
Requires-Dist: nvidia-cuda-nvrtc-cu12 (==12.1.105) ; platform_system == "Linux" and platform_machine == "x86_64"
Requires-Dist: nvidia-cuda-runtime-cu12 (==12.1.105) ; platform_system == "Linux" and platform_machine == "x86_64"
Requires-Dist: nvidia-cuda-cupti-cu12 (==12.1.105) ; platform_system == "Linux" and platform_machine == "x86_64"
Requires-Dist: nvidia-cudnn-cu12 (==8.9.2.26) ; platform_system == "Linux" and platform_machine == "x86_64"
Requires-Dist: nvidia-cublas-cu12 (==12.1.3.1) ; platform_system == "Linux" and platform_machine == "x86_64"
Requires-Dist: nvidia-cufft-cu12 (==11.0.2.54) ; platform_system == "Linux" and platform_machine == "x86_64"
Requires-Dist: nvidia-curand-cu12 (==10.3.2.106) ; platform_system == "Linux" and platform_machine == "x86_64"
Requires-Dist: nvidia-cusolver-cu12 (==11.4.5.107) ; platform_system == "Linux" and platform_machine == "x86_64"
Requires-Dist: nvidia-cusparse-cu12 (==12.1.0.106) ; platform_system == "Linux" and platform_machine == "x86_64"
Requires-Dist: nvidia-nccl-cu12 (==2.18.1) ; platform_system == "Linux" and platform_machine == "x86_64"
Requires-Dist: nvidia-nvtx-cu12 (==12.1.105) ; platform_system == "Linux" and platform_machine == "x86_64"
Requires-Dist: triton (==2.1.0) ; platform_system == "Linux" and platform_machine == "x86_64"
Provides-Extra: dynamo
Requires-Dist: jinja2 ; extra == 'dynamo'
Provides-Extra: opt-einsum
Requires-Dist: opt-einsum (>=3.3) ; extra == 'opt-einsum'

Here is metadat for 2.0.1 :

Metadata-Version: 2.1
Name: torch
Version: 2.0.1
Summary: Tensors and Dynamic neural networks in Python with strong GPU acceleration
Home-page: https://pytorch.org/
Download-URL: https://github.com/pytorch/pytorch/tags
Author: PyTorch Team
Author-email: packages@pytorch.org
License: BSD-3
Keywords: pytorch,machine learning
Classifier: Development Status :: 5 - Production/Stable
Classifier: Intended Audience :: Developers
Classifier: Intended Audience :: Education
Classifier: Intended Audience :: Science/Research
Classifier: License :: OSI Approved :: BSD License
Classifier: Topic :: Scientific/Engineering
Classifier: Topic :: Scientific/Engineering :: Mathematics
Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
Classifier: Topic :: Software Development
Classifier: Topic :: Software Development :: Libraries
Classifier: Topic :: Software Development :: Libraries :: Python Modules
Classifier: Programming Language :: C++
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.8
Classifier: Programming Language :: Python :: 3.9
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Requires-Python: >=3.8.0
Description-Content-Type: text/markdown
License-File: LICENSE
License-File: NOTICE
Requires-Dist: filelock
Requires-Dist: typing-extensions
Requires-Dist: sympy
Requires-Dist: networkx
Requires-Dist: jinja2
Requires-Dist: nvidia-cuda-nvrtc-cu11 (==11.7.99) ; platform_system == "Linux" and platform_machine == "x86_64"
Requires-Dist: nvidia-cuda-runtime-cu11 (==11.7.99) ; platform_system == "Linux" and platform_machine == "x86_64"
Requires-Dist: nvidia-cuda-cupti-cu11 (==11.7.101) ; platform_system == "Linux" and platform_machine == "x86_64"
Requires-Dist: nvidia-cudnn-cu11 (==8.5.0.96) ; platform_system == "Linux" and platform_machine == "x86_64"
Requires-Dist: nvidia-cublas-cu11 (==11.10.3.66) ; platform_system == "Linux" and platform_machine == "x86_64"
Requires-Dist: nvidia-cufft-cu11 (==10.9.0.58) ; platform_system == "Linux" and platform_machine == "x86_64"
Requires-Dist: nvidia-curand-cu11 (==10.2.10.91) ; platform_system == "Linux" and platform_machine == "x86_64"
Requires-Dist: nvidia-cusolver-cu11 (==11.4.0.1) ; platform_system == "Linux" and platform_machine == "x86_64"
Requires-Dist: nvidia-cusparse-cu11 (==11.7.4.91) ; platform_system == "Linux" and platform_machine == "x86_64"
Requires-Dist: nvidia-nccl-cu11 (==2.14.3) ; platform_system == "Linux" and platform_machine == "x86_64"
Requires-Dist: nvidia-nvtx-cu11 (==11.7.91) ; platform_system == "Linux" and platform_machine == "x86_64"
Requires-Dist: triton (==2.0.0) ; platform_system == "Linux" and platform_machine == "x86_64"
Provides-Extra: opt-einsum
Requires-Dist: opt-einsum (>=3.3) ; extra == 'opt-einsum'

atalman · 2023-09-13T18:46:00Z

Looks like we need to strip: Requires-Dist: pytorch-triton (==2.1.0) from the Metadata and rebuild this weel

malfet · 2023-09-13T18:50:29Z

Trying to install on amazon linux2 installs for some reason 1.13.1:
pip3 install torch --extra-index-url --extra-index-url  https://download.pytorch.org/whl/test/cu121_pypi_cudnn

@atalman remove 2nd --extra-index-url command and it will install 2.1.0 for you

Same command works on our validation environment

Same command works fine for Ubuntu-22.04, because RedHat-flavor distros are the only one that have different system and local installation folders

malfet · 2023-09-13T19:16:12Z

Ok, regression is due to the library name changes between nvidia-cuda-11 and nvidia-cuda-12 pipi packages (i.e. lack of libcuXYZ.so.${MAJOR}.${MINOR}")

Following diff fixes the problem:

# diff -u /usr/local/lib64/python3.9/site-packages/torch/__init__.py __init__.py 
--- /usr/local/lib64/python3.9/site-packages/torch/__init__.py	2023-09-13 19:25:52.602695425 +0000
+++ __init__.py	2023-09-13 19:25:27.010624132 +0000
@@ -174,13 +174,13 @@
         ctypes.CDLL(lib_path, mode=ctypes.RTLD_GLOBAL)
     except OSError as err:
         # Can only happen for wheel with cuda libs as PYPI deps
-        # As PyTorch is not purelib, but nvidia-*-cu11 is
+        # As PyTorch is not purelib, but nvidia-*-cu12 is
         cuda_libs: Dict[str, str] = {
             'cublas': 'libcublas.so.*[0-9]',
             'cudnn': 'libcudnn.so.*[0-9]',
-            'cuda_nvrtc': 'libnvrtc.so.*[0-9].*[0-9]',
-            'cuda_runtime': 'libcudart.so.*[0-9].*[0-9]',
-            'cuda_cupti': 'libcupti.so.*[0-9].*[0-9]',
+            'cuda_nvrtc': 'libnvrtc.so.*[0-9]',
+            'cuda_runtime': 'libcudart.so.*[0-9]',
+            'cuda_cupti': 'libcupti.so.*[0-9]',
             'cufft': 'libcufft.so.*[0-9]',
             'curand': 'libcurand.so.*[0-9]',
             'cusolver': 'libcusolver.so.*[0-9]',

Or any other distro that have different purelib and platlib paths Regression was introduced, when small wheel base dependency was migrated from CUDA-11 to CUDA-12 Not sure why, but minor version of the package is no longer shipped with following CUDA-12: - nvidia_cuda_nvrtc_cu12-12.1.105 - nvidia-cuda-cupti-cu12-12.1.105 - nvidia-cuda-cupti-cu12-12.1.105 But those were present in CUDA-11 release Fixes #109221

malfet · 2023-09-13T21:24:46Z

Looks like we need to strip: Requires-Dist: pytorch-triton (==2.1.0) from the Metadata and rebuild this weel

This feels separate, should we file different issue for it? Also let's add a test for that (that wheel has only one triton dependency)

Or any other distro that have different purelib and platlib paths Regression was introduced, when small wheel base dependency was migrated from CUDA-11 to CUDA-12 Not sure why, but minor version of the package is no longer shipped with following CUDA-12: - nvidia_cuda_nvrtc_cu12-12.1.105 - nvidia-cuda-cupti-cu12-12.1.105 - nvidia-cuda-cupti-cu12-12.1.105 But those were present in CUDA-11 release, i.e: ``` shell bash-5.2# curl -OL https://files.pythonhosted.org/packages/ef/25/922c5996aada6611b79b53985af7999fc629aee1d5d001b6a22431e18fec/nvidia_cuda_nvrtc_cu11-11.7.99-2-py3-none-manylinux1_x86_64.whl; unzip -t nvidia_cuda_nvrtc_cu11-11.7.99-2-py3-none-manylinux1_x86_64.whl |grep \.so testing: nvidia/cuda_nvrtc/lib/libnvrtc-builtins.so.11.7 OK testing: nvidia/cuda_nvrtc/lib/libnvrtc.so.11.2 OK bash-5.2# curl -OL https://files.pythonhosted.org/packages/b6/9f/c64c03f49d6fbc56196664d05dba14e3a561038a81a638eeb47f4d4cfd48/nvidia_cuda_nvrtc_cu12-12.1.105-py3-none-manylinux1_x86_64.whl; unzip -t nvidia_cuda_nvrtc_cu12-12.1.105-py3-none-manylinux1_x86_64.whl|grep \.so testing: nvidia/cuda_nvrtc/lib/libnvrtc-builtins.so.12.1 OK testing: nvidia/cuda_nvrtc/lib/libnvrtc.so.12 OK ``` Fixes pytorch#109221 Pull Request resolved: pytorch#109244 Approved by: https://github.com/huydhn

Or any other distro that have different purelib and platlib paths Regression was introduced, when small wheel base dependency was migrated from CUDA-11 to CUDA-12 Not sure why, but minor version of the package is no longer shipped with following CUDA-12: - nvidia_cuda_nvrtc_cu12-12.1.105 - nvidia-cuda-cupti-cu12-12.1.105 - nvidia-cuda-cupti-cu12-12.1.105 But those were present in CUDA-11 release, i.e: ``` shell bash-5.2# curl -OL https://files.pythonhosted.org/packages/ef/25/922c5996aada6611b79b53985af7999fc629aee1d5d001b6a22431e18fec/nvidia_cuda_nvrtc_cu11-11.7.99-2-py3-none-manylinux1_x86_64.whl; unzip -t nvidia_cuda_nvrtc_cu11-11.7.99-2-py3-none-manylinux1_x86_64.whl |grep \.so testing: nvidia/cuda_nvrtc/lib/libnvrtc-builtins.so.11.7 OK testing: nvidia/cuda_nvrtc/lib/libnvrtc.so.11.2 OK bash-5.2# curl -OL https://files.pythonhosted.org/packages/b6/9f/c64c03f49d6fbc56196664d05dba14e3a561038a81a638eeb47f4d4cfd48/nvidia_cuda_nvrtc_cu12-12.1.105-py3-none-manylinux1_x86_64.whl; unzip -t nvidia_cuda_nvrtc_cu12-12.1.105-py3-none-manylinux1_x86_64.whl|grep \.so testing: nvidia/cuda_nvrtc/lib/libnvrtc-builtins.so.12.1 OK testing: nvidia/cuda_nvrtc/lib/libnvrtc.so.12 OK ``` Fixes #109221 This is a cherry-pick of #109244 into release/2.1 branch

malfet added high priority module: binaries Anything related to official binaries that we release to users module: cuda Related to torch.cuda, and CUDA support in general module: regression It used to work, and now it doesn't labels Sep 13, 2023

malfet added this to the 2.1.0 milestone Sep 13, 2023

pytorch-bot bot added the triage review label Sep 13, 2023

malfet mentioned this issue Sep 13, 2023

Fix CUDA-12 wheel loading on AmazonLinux #109244

Closed

huydhn mentioned this issue Sep 13, 2023

[Release-Only] Use triton pypi package for 2.1.0 release pytorch/builder#1533

Merged

pytorchmergebot closed this as completed in 90068ab Sep 14, 2023

This was referenced Sep 14, 2023

[Release/2.1] Fix CUDA-12 wheel loading on AmazonLinux (#109244) #109291

Merged

[v.2.1.0] Release Tracker #108055

Closed

GiusTex mentioned this issue Nov 1, 2023

[Bug]: CUDA error at ~60% each time. 4070 - 8gb vram - 16gb DDR5 ram. continue-revolution/sd-webui-animatediff#268

Open

2 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Small wheels for 2.1.0 release candidate is not usable on AmazonLinux #109221

Small wheels for 2.1.0 release candidate is not usable on AmazonLinux #109221

malfet commented Sep 13, 2023 •

edited

atalman commented Sep 13, 2023

atalman commented Sep 13, 2023 •

edited

atalman commented Sep 13, 2023 •

edited

atalman commented Sep 13, 2023 •

edited

malfet commented Sep 13, 2023 •

edited

malfet commented Sep 13, 2023 •

edited

malfet commented Sep 13, 2023 •

edited

Small wheels for 2.1.0 release candidate is not usable on AmazonLinux #109221

Small wheels for 2.1.0 release candidate is not usable on AmazonLinux #109221

Comments

malfet commented Sep 13, 2023 • edited

🐛 Describe the bug

Versions

atalman commented Sep 13, 2023

atalman commented Sep 13, 2023 • edited

atalman commented Sep 13, 2023 • edited

atalman commented Sep 13, 2023 • edited

malfet commented Sep 13, 2023 • edited

malfet commented Sep 13, 2023 • edited

malfet commented Sep 13, 2023 • edited

malfet commented Sep 13, 2023 •

edited

atalman commented Sep 13, 2023 •

edited

atalman commented Sep 13, 2023 •

edited

atalman commented Sep 13, 2023 •

edited

malfet commented Sep 13, 2023 •

edited

malfet commented Sep 13, 2023 •

edited

malfet commented Sep 13, 2023 •

edited