Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We鈥檒l occasionally send you account related emails.

Already on GitHub? Sign in to your account

Small wheels for 2.1.0 release candidate is not usable on AmazonLinux #109221

Closed
malfet opened this issue Sep 13, 2023 · 7 comments
Closed

Small wheels for 2.1.0 release candidate is not usable on AmazonLinux #109221

malfet opened this issue Sep 13, 2023 · 7 comments
Labels
high priority module: binaries Anything related to official binaries that we release to users module: cuda Related to torch.cuda, and CUDA support in general module: regression It used to work, and now it doesn't triage review
Milestone

Comments

@malfet
Copy link
Contributor

malfet commented Sep 13, 2023

馃悰 Describe the bug

Run:

% docker run --rm --gpus all -it amazonlinux:latest  bash -c "yum install -y python3-pip; pip3 install torch --extra-index-url https://download.pytorch.org/whl/test/cu121_pypi_cudnn; python3 -c 'import torch'"
...
Traceback (most recent call last):
  File "/usr/local/lib64/python3.9/site-packages/torch/__init__.py", line 174, in _load_global_deps
    ctypes.CDLL(lib_path, mode=ctypes.RTLD_GLOBAL)
  File "/usr/lib64/python3.9/ctypes/__init__.py", line 374, in __init__
    self._handle = _dlopen(self._name, mode)
OSError: libcufft.so.11: cannot open shared object file: No such file or directory

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "<string>", line 1, in <module>
  File "/usr/local/lib64/python3.9/site-packages/torch/__init__.py", line 234, in <module>
    _load_global_deps()
  File "/usr/local/lib64/python3.9/site-packages/torch/__init__.py", line 195, in _load_global_deps
    _preload_cuda_deps(lib_folder, lib_name)
  File "/usr/local/lib64/python3.9/site-packages/torch/__init__.py", line 160, in _preload_cuda_deps
    raise ValueError(f"{lib_name} not found in the system path {sys.path}")
ValueError: libnvrtc.so.*[0-9].*[0-9] not found in the system path ['', '/usr/lib64/python39.zip', '/usr/lib64/python3.9', '/usr/lib64/python3.9/lib-dynload', '/usr/local/lib64/python3.9/site-packages', '/usr/local/lib/python3.9/site-packages', '/usr/lib64/python3.9/site-packages', '/usr/lib/python3.9/site-packages']

Please note that above works as expected with 2.0 release, i.e. #88869 was reintroduced in trunk/2.1.0 branch

Versions

2.1.0

cc @ezyang @gchanan @zou3519 @kadeng @seemethere @ptrblck

@malfet malfet added high priority module: binaries Anything related to official binaries that we release to users module: cuda Related to torch.cuda, and CUDA support in general module: regression It used to work, and now it doesn't labels Sep 13, 2023
@malfet malfet added this to the 2.1.0 milestone Sep 13, 2023
@atalman
Copy link
Contributor

atalman commented Sep 13, 2023

Can you also please include install command what version is getting installed ?

@atalman
Copy link
Contributor

atalman commented Sep 13, 2023

Trying to install on amazon linux2 installs for some reason 1.13.1:

pip3 install torch --extra-index-url --extra-index-url  https://download.pytorch.org/whl/test/cu121_pypi_cudnn
WARNING: Running pip install with root privileges is generally not a good idea. Try `pip3 install --user` instead.
WARNING: The index url "--extra-index-url" seems invalid, please provide a scheme.
Looking in indexes: https://pypi.org/simple, --extra-index-url
Collecting https://download.pytorch.org/whl/test/cu121_pypi_cudnn
  Downloading https://download.pytorch.org/whl/test/cu121_pypi_cudnn (1.5 kB)
  ERROR: Cannot unpack file /tmp/pip-unpack-_aiuujx8/cu121_pypi_cudnn.html (downloaded from /tmp/pip-req-build-urwtwjek, content-type: text/html); cannot detect archive format
ERROR: Cannot determine archive format of /tmp/pip-req-build-urwtwjek
WARNING: Url '--extra-index-url/pip/' is ignored. It is either a non-existing path or lacks a specific scheme.
[root@ip-10-0-56-136 tmp]# pip3 install --pre torch --extra-index-url https://download.pytorch.org/whl/test/cu121_pypi_cudnn
WARNING: Running pip install with root privileges is generally not a good idea. Try `pip3 install --user` instead.
Looking in indexes: https://pypi.org/simple, https://download.pytorch.org/whl/test/cu121_pypi_cudnn
Collecting torch
  Downloading torch-1.13.1-cp37-cp37m-manylinux1_x86_64.whl (887.5 MB)
     |鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅| 887.5 MB 5.5 kB/s 
Collecting nvidia-cuda-nvrtc-cu11==11.7.99; platform_system == "Linux"
  Downloading nvidia_cuda_nvrtc_cu11-11.7.99-2-py3-none-manylinux1_x86_64.whl (21.0 MB)
     |鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅| 21.0 MB 77.2 MB/s 
Collecting nvidia-cudnn-cu11==8.5.0.96; platform_system == "Linux"
  Downloading nvidia_cudnn_cu11-8.5.0.96-2-py3-none-manylinux1_x86_64.whl (557.1 MB)
     |鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅| 557.1 MB 6.5 kB/s 
Collecting typing-extensions
  Downloading typing_extensions-4.7.1-py3-none-any.whl (33 kB)
Collecting nvidia-cuda-runtime-cu11==11.7.99; platform_system == "Linux"
  Downloading nvidia_cuda_runtime_cu11-11.7.99-py3-none-manylinux1_x86_64.whl (849 kB)
     |鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅| 849 kB 102.2 MB/s 
Collecting nvidia-cublas-cu11==11.10.3.66; platform_system == "Linux"
  Downloading nvidia_cublas_cu11-11.10.3.66-py3-none-manylinux1_x86_64.whl (317.1 MB)
     |鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅| 317.1 MB 30 kB/s 
Collecting wheel
  Downloading wheel-0.41.2-py3-none-any.whl (64 kB)
     |鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅| 64 kB 7.7 MB/s 
Requirement already satisfied: setuptools in /usr/lib/python3.7/site-packages (from nvidia-cuda-runtime-cu11==11.7.99; platform_system == "Linux"->torch) (49.1.3)
Installing collected packages: nvidia-cuda-nvrtc-cu11, wheel, nvidia-cublas-cu11, nvidia-cudnn-cu11, typing-extensions, nvidia-cuda-runtime-cu11, torch
  WARNING: The script wheel is installed in '/usr/local/bin' which is not on PATH.
  Consider adding this directory to PATH or, if you prefer to suppress this warning, use --no-warn-script-location.
  WARNING: The scripts convert-caffe2-to-onnx, convert-onnx-to-caffe2 and torchrun are installed in '/usr/local/bin' which is not on PATH.
  Consider adding this directory to PATH or, if you prefer to suppress this warning, use --no-warn-script-location.
Successfully installed nvidia-cublas-cu11-11.10.3.66 nvidia-cuda-nvrtc-cu11-11.7.99 nvidia-cuda-runtime-cu11-11.7.99 nvidia-cudnn-cu11-8.5.0.96 torch-1.13.1 typing-extensions-4.7.1 wheel-0.41.2

Manually downloading correct wheel:

curl -O https://download.pytorch.org/whl/test/cu121_pypi_cudnn/torch-2.1.0%2Bcu121.with.pypi.cudnn-cp311-cp311-linux_x86_64.whl
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  640M  100  640M    0     0  26.8M      0  0:00:23  0:00:23 --:--:-- 27.4M
[root@ip-10-0-56-136 tmp]# ls
torch-2.1.0%2Bcu121.with.pypi.cudnn-cp311-cp311-linux_x86_64.whl
[root@ip-10-0-56-136 tmp]# pip3 install torch-2.1.0%2Bcu121.with.pypi.cudnn-cp311-cp311-linux_x86_64.whl
WARNING: Running pip install with root privileges is generally not a good idea. Try `pip3 install --user` instead.
ERROR: torch-2.1.0+cu121.with.pypi.cudnn-cp311-cp311-linux_x86_64.whl is not a supported wheel on this platform.

Same command works on our validation environment:
https://github.com/pytorch/builder/actions/runs/6171808682/job/16751980235#step:11:2156

@atalman
Copy link
Contributor

atalman commented Sep 13, 2023

Here is metadata for this wheel:

Metadata-Version: 2.1
Name: torch
Version: 2.1.0+cu121.with.pypi.cudnn
Summary: Tensors and Dynamic neural networks in Python with strong GPU acceleration
Home-page: https://pytorch.org/
Download-URL: https://github.com/pytorch/pytorch/tags
Author: PyTorch Team
Author-email: packages@pytorch.org
License: BSD-3
Keywords: pytorch,machine learning
Classifier: Development Status :: 5 - Production/Stable
Classifier: Intended Audience :: Developers
Classifier: Intended Audience :: Education
Classifier: Intended Audience :: Science/Research
Classifier: License :: OSI Approved :: BSD License
Classifier: Topic :: Scientific/Engineering
Classifier: Topic :: Scientific/Engineering :: Mathematics
Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
Classifier: Topic :: Software Development
Classifier: Topic :: Software Development :: Libraries
Classifier: Topic :: Software Development :: Libraries :: Python Modules
Classifier: Programming Language :: C++
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.8
Classifier: Programming Language :: Python :: 3.9
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Requires-Python: >=3.8.0
Description-Content-Type: text/markdown
License-File: LICENSE
License-File: NOTICE
Requires-Dist: filelock
Requires-Dist: typing-extensions
Requires-Dist: sympy
Requires-Dist: networkx
Requires-Dist: jinja2
Requires-Dist: fsspec
Requires-Dist: pytorch-triton (==2.1.0)
Requires-Dist: nvidia-cuda-nvrtc-cu12 (==12.1.105) ; platform_system == "Linux" and platform_machine == "x86_64"
Requires-Dist: nvidia-cuda-runtime-cu12 (==12.1.105) ; platform_system == "Linux" and platform_machine == "x86_64"
Requires-Dist: nvidia-cuda-cupti-cu12 (==12.1.105) ; platform_system == "Linux" and platform_machine == "x86_64"
Requires-Dist: nvidia-cudnn-cu12 (==8.9.2.26) ; platform_system == "Linux" and platform_machine == "x86_64"
Requires-Dist: nvidia-cublas-cu12 (==12.1.3.1) ; platform_system == "Linux" and platform_machine == "x86_64"
Requires-Dist: nvidia-cufft-cu12 (==11.0.2.54) ; platform_system == "Linux" and platform_machine == "x86_64"
Requires-Dist: nvidia-curand-cu12 (==10.3.2.106) ; platform_system == "Linux" and platform_machine == "x86_64"
Requires-Dist: nvidia-cusolver-cu12 (==11.4.5.107) ; platform_system == "Linux" and platform_machine == "x86_64"
Requires-Dist: nvidia-cusparse-cu12 (==12.1.0.106) ; platform_system == "Linux" and platform_machine == "x86_64"
Requires-Dist: nvidia-nccl-cu12 (==2.18.1) ; platform_system == "Linux" and platform_machine == "x86_64"
Requires-Dist: nvidia-nvtx-cu12 (==12.1.105) ; platform_system == "Linux" and platform_machine == "x86_64"
Requires-Dist: triton (==2.1.0) ; platform_system == "Linux" and platform_machine == "x86_64"
Provides-Extra: dynamo
Requires-Dist: jinja2 ; extra == 'dynamo'
Provides-Extra: opt-einsum
Requires-Dist: opt-einsum (>=3.3) ; extra == 'opt-einsum'

Here is metadat for 2.0.1 :

Metadata-Version: 2.1
Name: torch
Version: 2.0.1
Summary: Tensors and Dynamic neural networks in Python with strong GPU acceleration
Home-page: https://pytorch.org/
Download-URL: https://github.com/pytorch/pytorch/tags
Author: PyTorch Team
Author-email: packages@pytorch.org
License: BSD-3
Keywords: pytorch,machine learning
Classifier: Development Status :: 5 - Production/Stable
Classifier: Intended Audience :: Developers
Classifier: Intended Audience :: Education
Classifier: Intended Audience :: Science/Research
Classifier: License :: OSI Approved :: BSD License
Classifier: Topic :: Scientific/Engineering
Classifier: Topic :: Scientific/Engineering :: Mathematics
Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
Classifier: Topic :: Software Development
Classifier: Topic :: Software Development :: Libraries
Classifier: Topic :: Software Development :: Libraries :: Python Modules
Classifier: Programming Language :: C++
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.8
Classifier: Programming Language :: Python :: 3.9
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Requires-Python: >=3.8.0
Description-Content-Type: text/markdown
License-File: LICENSE
License-File: NOTICE
Requires-Dist: filelock
Requires-Dist: typing-extensions
Requires-Dist: sympy
Requires-Dist: networkx
Requires-Dist: jinja2
Requires-Dist: nvidia-cuda-nvrtc-cu11 (==11.7.99) ; platform_system == "Linux" and platform_machine == "x86_64"
Requires-Dist: nvidia-cuda-runtime-cu11 (==11.7.99) ; platform_system == "Linux" and platform_machine == "x86_64"
Requires-Dist: nvidia-cuda-cupti-cu11 (==11.7.101) ; platform_system == "Linux" and platform_machine == "x86_64"
Requires-Dist: nvidia-cudnn-cu11 (==8.5.0.96) ; platform_system == "Linux" and platform_machine == "x86_64"
Requires-Dist: nvidia-cublas-cu11 (==11.10.3.66) ; platform_system == "Linux" and platform_machine == "x86_64"
Requires-Dist: nvidia-cufft-cu11 (==10.9.0.58) ; platform_system == "Linux" and platform_machine == "x86_64"
Requires-Dist: nvidia-curand-cu11 (==10.2.10.91) ; platform_system == "Linux" and platform_machine == "x86_64"
Requires-Dist: nvidia-cusolver-cu11 (==11.4.0.1) ; platform_system == "Linux" and platform_machine == "x86_64"
Requires-Dist: nvidia-cusparse-cu11 (==11.7.4.91) ; platform_system == "Linux" and platform_machine == "x86_64"
Requires-Dist: nvidia-nccl-cu11 (==2.14.3) ; platform_system == "Linux" and platform_machine == "x86_64"
Requires-Dist: nvidia-nvtx-cu11 (==11.7.91) ; platform_system == "Linux" and platform_machine == "x86_64"
Requires-Dist: triton (==2.0.0) ; platform_system == "Linux" and platform_machine == "x86_64"
Provides-Extra: opt-einsum
Requires-Dist: opt-einsum (>=3.3) ; extra == 'opt-einsum'

@atalman
Copy link
Contributor

atalman commented Sep 13, 2023

Looks like we need to strip: Requires-Dist: pytorch-triton (==2.1.0) from the Metadata and rebuild this weel

@malfet
Copy link
Contributor Author

malfet commented Sep 13, 2023

Trying to install on amazon linux2 installs for some reason 1.13.1:

pip3 install torch --extra-index-url --extra-index-url  https://download.pytorch.org/whl/test/cu121_pypi_cudnn

@atalman remove 2nd --extra-index-url command and it will install 2.1.0 for you

Same command works on our validation environment

Same command works fine for Ubuntu-22.04, because RedHat-flavor distros are the only one that have different system and local installation folders

@malfet
Copy link
Contributor Author

malfet commented Sep 13, 2023

Ok, regression is due to the library name changes between nvidia-cuda-11 and nvidia-cuda-12 pipi packages (i.e. lack of libcuXYZ.so.${MAJOR}.${MINOR}")

Following diff fixes the problem:

# diff -u /usr/local/lib64/python3.9/site-packages/torch/__init__.py __init__.py 
--- /usr/local/lib64/python3.9/site-packages/torch/__init__.py	2023-09-13 19:25:52.602695425 +0000
+++ __init__.py	2023-09-13 19:25:27.010624132 +0000
@@ -174,13 +174,13 @@
         ctypes.CDLL(lib_path, mode=ctypes.RTLD_GLOBAL)
     except OSError as err:
         # Can only happen for wheel with cuda libs as PYPI deps
-        # As PyTorch is not purelib, but nvidia-*-cu11 is
+        # As PyTorch is not purelib, but nvidia-*-cu12 is
         cuda_libs: Dict[str, str] = {
             'cublas': 'libcublas.so.*[0-9]',
             'cudnn': 'libcudnn.so.*[0-9]',
-            'cuda_nvrtc': 'libnvrtc.so.*[0-9].*[0-9]',
-            'cuda_runtime': 'libcudart.so.*[0-9].*[0-9]',
-            'cuda_cupti': 'libcupti.so.*[0-9].*[0-9]',
+            'cuda_nvrtc': 'libnvrtc.so.*[0-9]',
+            'cuda_runtime': 'libcudart.so.*[0-9]',
+            'cuda_cupti': 'libcupti.so.*[0-9]',
             'cufft': 'libcufft.so.*[0-9]',
             'curand': 'libcurand.so.*[0-9]',
             'cusolver': 'libcusolver.so.*[0-9]',

malfet added a commit that referenced this issue Sep 13, 2023
Or any other distro that have different purelib and platlib paths
Regression was introduced, when small wheel base dependency was migrated
from CUDA-11 to CUDA-12

Not sure why, but minor version of the package is no longer shipped with
following CUDA-12:
 - nvidia_cuda_nvrtc_cu12-12.1.105
 - nvidia-cuda-cupti-cu12-12.1.105
 - nvidia-cuda-cupti-cu12-12.1.105

But those were present in CUDA-11 release

Fixes #109221
@malfet
Copy link
Contributor Author

malfet commented Sep 13, 2023

Looks like we need to strip: Requires-Dist: pytorch-triton (==2.1.0) from the Metadata and rebuild this weel

This feels separate, should we file different issue for it? Also let's add a test for that (that wheel has only one triton dependency)

atalman pushed a commit to atalman/pytorch that referenced this issue Sep 14, 2023
Or any other distro that have different purelib and platlib paths Regression was introduced, when small wheel base dependency was migrated from CUDA-11 to CUDA-12

Not sure why, but minor version of the package is no longer shipped with following CUDA-12:
 - nvidia_cuda_nvrtc_cu12-12.1.105
 - nvidia-cuda-cupti-cu12-12.1.105
 - nvidia-cuda-cupti-cu12-12.1.105

But those were present in CUDA-11 release, i.e:
``` shell
bash-5.2# curl -OL https://files.pythonhosted.org/packages/ef/25/922c5996aada6611b79b53985af7999fc629aee1d5d001b6a22431e18fec/nvidia_cuda_nvrtc_cu11-11.7.99-2-py3-none-manylinux1_x86_64.whl; unzip -t nvidia_cuda_nvrtc_cu11-11.7.99-2-py3-none-manylinux1_x86_64.whl |grep \.so
    testing: nvidia/cuda_nvrtc/lib/libnvrtc-builtins.so.11.7   OK
    testing: nvidia/cuda_nvrtc/lib/libnvrtc.so.11.2   OK
bash-5.2# curl -OL https://files.pythonhosted.org/packages/b6/9f/c64c03f49d6fbc56196664d05dba14e3a561038a81a638eeb47f4d4cfd48/nvidia_cuda_nvrtc_cu12-12.1.105-py3-none-manylinux1_x86_64.whl; unzip -t nvidia_cuda_nvrtc_cu12-12.1.105-py3-none-manylinux1_x86_64.whl|grep \.so
    testing: nvidia/cuda_nvrtc/lib/libnvrtc-builtins.so.12.1   OK
    testing: nvidia/cuda_nvrtc/lib/libnvrtc.so.12   OK
```

Fixes pytorch#109221

Pull Request resolved: pytorch#109244
Approved by: https://github.com/huydhn
malfet pushed a commit that referenced this issue Sep 14, 2023
Or any other distro that have different purelib and platlib paths Regression was introduced, when small wheel base dependency was migrated from CUDA-11 to CUDA-12

Not sure why, but minor version of the package is no longer shipped with following CUDA-12:
 - nvidia_cuda_nvrtc_cu12-12.1.105
 - nvidia-cuda-cupti-cu12-12.1.105
 - nvidia-cuda-cupti-cu12-12.1.105

But those were present in CUDA-11 release, i.e:
``` shell
bash-5.2# curl -OL https://files.pythonhosted.org/packages/ef/25/922c5996aada6611b79b53985af7999fc629aee1d5d001b6a22431e18fec/nvidia_cuda_nvrtc_cu11-11.7.99-2-py3-none-manylinux1_x86_64.whl; unzip -t nvidia_cuda_nvrtc_cu11-11.7.99-2-py3-none-manylinux1_x86_64.whl |grep \.so
    testing: nvidia/cuda_nvrtc/lib/libnvrtc-builtins.so.11.7   OK
    testing: nvidia/cuda_nvrtc/lib/libnvrtc.so.11.2   OK
bash-5.2# curl -OL https://files.pythonhosted.org/packages/b6/9f/c64c03f49d6fbc56196664d05dba14e3a561038a81a638eeb47f4d4cfd48/nvidia_cuda_nvrtc_cu12-12.1.105-py3-none-manylinux1_x86_64.whl; unzip -t nvidia_cuda_nvrtc_cu12-12.1.105-py3-none-manylinux1_x86_64.whl|grep \.so
    testing: nvidia/cuda_nvrtc/lib/libnvrtc-builtins.so.12.1   OK
    testing: nvidia/cuda_nvrtc/lib/libnvrtc.so.12   OK
```

Fixes #109221

This is a cherry-pick of  #109244 into release/2.1 branch
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
high priority module: binaries Anything related to official binaries that we release to users module: cuda Related to torch.cuda, and CUDA support in general module: regression It used to work, and now it doesn't triage review
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants