On CPU-only machine received OSError from importing: libcublas.so.11: cannot open shared object file #88869

weiliw-amz · 2022-11-11T04:32:53Z

🐛 Describe the bug

On a CPU-only machine, in Amazon Linux 2 latest docker image, install latest PyTorch(1.13) via pip and import, received this error.
However in my sense, PyTorch should not install GPU related components on a CPU-only machine without CUDA and nvidia related package installed?

Steps to reproduce:

docker pull amazonlinux:2
docker run --rm -it amazonlinux:2 /bin/bash

bash-4.2# yum install -y python3 python3-devel python3-distutils
bash-4.2# python3 -m pip install torch
bash-4.2# python3 -c "import torch"

Then get error message:

Traceback (most recent call last):
  File "<string>", line 1, in <module>
  File "/usr/local/lib64/python3.7/site-packages/torch/__init__.py", line 191, in <module>
    _load_global_deps()
  File "/usr/local/lib64/python3.7/site-packages/torch/__init__.py", line 153, in _load_global_deps
    ctypes.CDLL(lib_path, mode=ctypes.RTLD_GLOBAL)
  File "/usr/lib64/python3.7/ctypes/__init__.py", line 359, in __init__
    self._handle = _dlopen(self._name, mode)
OSError: libcublas.so.11: cannot open shared object file: No such file or directory

Versions

Collecting environment information...
PyTorch version: N/A
Is debug build: N/A
CUDA used to build PyTorch: N/A
ROCM used to build PyTorch: N/A

OS: Amazon Linux 2 (x86_64)
GCC version: Could not collect
Clang version: Could not collect
CMake version: Could not collect
Libc version: glibc-2.2.5

Python version: 3.7.10 (default, Jun 3 2021, 00:02:01) [GCC 7.3.1 20180712 (Red Hat 7.3.1-13)] (64-bit runtime)
Python platform: Linux-5.4.214-134.408.amzn2int.x86_64-x86_64-with-glibc2.2.5
Is CUDA available: N/A
CUDA runtime version: Could not collect
CUDA_MODULE_LOADING set to: N/A
GPU models and configuration: Could not collect
Nvidia driver version: Could not collect
cuDNN version: Could not collect
HIP runtime version: N/A
MIOpen runtime version: N/A
Is XNNPACK available: N/A

Versions of relevant libraries:
[pip3] torch==1.13.0
[conda] Could not collect

cc @ezyang @gchanan @zou3519 @seemethere @malfet

The text was updated successfully, but these errors were encountered:

malfet · 2022-11-11T18:52:51Z

[Edit] Hmm, I can reproduce this even though nvidia-cublas-cu11 is installed:

bash-4.2# python3 -m pip install torch
WARNING: Running pip install with root privileges is generally not a good idea. Try `python3 -m pip install --user` instead.
Collecting torch
  Downloading torch-1.13.0-cp37-cp37m-manylinux1_x86_64.whl (890.2 MB)
     |████████████████████████████████| 890.2 MB 4.2 kB/s 
Collecting typing-extensions
  Downloading typing_extensions-4.4.0-py3-none-any.whl (26 kB)
Collecting nvidia-cuda-runtime-cu11==11.7.99
  Downloading nvidia_cuda_runtime_cu11-11.7.99-py3-none-manylinux1_x86_64.whl (849 kB)
     |████████████████████████████████| 849 kB 107.1 MB/s 
Collecting nvidia-cublas-cu11==11.10.3.66
  Downloading nvidia_cublas_cu11-11.10.3.66-py3-none-manylinux1_x86_64.whl (317.1 MB)
     |████████████████████████████████| 317.1 MB 14 kB/s 
Collecting nvidia-cuda-nvrtc-cu11==11.7.99
  Downloading nvidia_cuda_nvrtc_cu11-11.7.99-2-py3-none-manylinux1_x86_64.whl (21.0 MB)
     |████████████████████████████████| 21.0 MB 105.8 MB/s 
Collecting nvidia-cudnn-cu11==8.5.0.96
  Downloading nvidia_cudnn_cu11-8.5.0.96-2-py3-none-manylinux1_x86_64.whl (557.1 MB)
     |████████████████████████████████| 557.1 MB 4.9 kB/s 
Requirement already satisfied: setuptools in /usr/lib/python3.7/site-packages (from nvidia-cuda-runtime-cu11==11.7.99->torch) (49.1.3)
Collecting wheel
  Downloading wheel-0.38.4-py3-none-any.whl (36 kB)
Installing collected packages: typing-extensions, wheel, nvidia-cuda-runtime-cu11, nvidia-cublas-cu11, nvidia-cuda-nvrtc-cu11, nvidia-cudnn-cu11, torch
Successfully installed nvidia-cublas-cu11-11.10.3.66 nvidia-cuda-nvrtc-cu11-11.7.99 nvidia-cuda-runtime-cu11-11.7.99 nvidia-cudnn-cu11-8.5.0.96 torch-1.13.0 typing-extensions-4.4.0 wheel-0.38.4
bash-4.2# python3 -c "import torch"
Traceback (most recent call last):
  File "<string>", line 1, in <module>
  File "/usr/local/lib64/python3.7/site-packages/torch/__init__.py", line 191, in <module>
    _load_global_deps()
  File "/usr/local/lib64/python3.7/site-packages/torch/__init__.py", line 153, in _load_global_deps
    ctypes.CDLL(lib_path, mode=ctypes.RTLD_GLOBAL)
  File "/usr/lib64/python3.7/ctypes/__init__.py", line 359, in __init__
    self._handle = _dlopen(self._name, mode)
OSError: libcublas.so.11: cannot open shared object file: No such file or directory

And reason for that is:

bash-4.2# ls /usr/local/lib/python3.7/site-packages/
__pycache__  nvidia_cublas_cu11-11.10.3.66.dist-info   nvidia_cuda_runtime_cu11-11.7.99.dist-info  typing_extensions-4.4.0.dist-info  wheel
nvidia	     nvidia_cuda_nvrtc_cu11-11.7.99.dist-info  nvidia_cudnn_cu11-8.5.0.96.dist-info	   typing_extensions.py		      wheel-0.38.4.dist-info
bash-4.2# ls /usr/local/lib64/python3.7/site-packages/
functorch  torch  torch-1.13.0.dist-info  torchgen

cc: @syed-ahmed @ptrblck

malfet · 2022-11-11T19:04:15Z

@weiliw-amz but to unblock yourself, please consider installing a CPU-only version of pytorch, by running python3 -m pip install torch --extra-index-url https://download.pytorch.org/whl/cpu

seemethere · 2022-11-11T19:12:26Z

Interestingly enough, I can reproduce this with the amazonlinux:2 image but can't reproduce this with python:3.7.

I wonder what makes the amazonlinux:2 image so different in this case

malfet · 2022-11-11T21:45:11Z

@seemethere I guess amazonlinux:2 is sort of a multiarch distro (i.e. supporting 32-bit and 64-bit applications)

On amazonlinux:2:

python3 -c 'import sysconfig; print([sysconfig.get_paths()[i] for i in ["purelib", "platlib"]])'
['/usr/lib/python3.7/site-packages', '/usr/lib64/python3.7/site-packages']

On python:3.7:

['/usr/local/lib/python3.7/site-packages', '/usr/local/lib/python3.7/site-packages']

And if one to look into WHEEL metadata, then all NVIIDA packages has Root-Is-Purelib: set to true, while PyTorch sets it to false.

atalman · 2022-11-14T20:26:11Z

@syed-ahmed @ptrblck Any recommendations on how deal with this issue ?

malfet · 2022-11-14T21:02:28Z

IMO RPATH would not work here, as it's not a relative, but rather a different absolute path and could be vastly different if users choose to use venv. To solve this problem we should just set LD_LIBRARY_PATH to sysconfig.get_paths()['purelib '] / 'nv-xyz' / 'lib' in torch/__init__.py similar to how it is done in ./backends/cudnn/init.py

Or, may be we should change wheels to be non-purelibs (this way PyTorch and its dependencies will always be installed in the same folder)

atalman · 2022-12-07T16:17:06Z

@malfet according do definition https://peps.python.org/pep-0427/#what-s-the-deal-with-purelib-vs-platlib

Wheel preserves the “purelib” vs. “platlib” distinction, which is significant on some platforms. For example, Fedora installs pure Python packages to ‘/usr/lib/pythonX.Y/site-packages’ and platform dependent packages to ‘/usr/lib64/pythonX.Y/site-packages’.

From what I see Pytorch Linux wheels should be supported on any linux platforms. So are cudnn wheels: https://pypi.org/project/nvidia-cudnn-cu11/

Should we just align to use same Root-Is-Purelib flag for both packages ?
Can we set PyTorch wheels to Root-Is-Purelib to true ?

malfet · 2022-12-07T20:29:35Z

It could be done, but I think will take as a while. Something like that would fix the problem and seem like a much less intrusive change that modifying the package location:

    try:
        ctypes.CDLL(lib_path, mode=ctypes.RTLD_GLOBAL)
    except OSError as e:
        ctypes.CDLL("/usr/local/lib/python3.7/site-packages/nvidia/cublas/lib/libcublas.so.11")
        ctypes.CDLL("/usr/local/lib/python3.7/site-packages/nvidia/cudnn/lib/libcudnn.so.8")
        ctypes.CDLL(lib_path, mode=ctypes.RTLD_GLOBAL)

Fixes #88869

If PyTorch is package into a wheel with [nvidia-cublas-cu11](https://pypi.org/project/nvidia-cublas-cu11/), which is designated as PureLib, but `torch` wheel is not, can cause a torch_globals loading problem. Fix that by searching for `nvidia/cublas/lib/libcublas.so.11` an `nvidia/cudnn/lib/libcudnn.so.8` across all `sys.path` folders. Test plan: ``` docker pull amazonlinux:2 docker run --rm -t amazonlinux:2 bash -c 'yum install -y python3 python3-devel python3-distutils patch;python3 -m pip install torch==1.13.0;curl -OL https://patch-diff.githubusercontent.com/raw/pytorch/pytorch/pull/90411.diff; pushd /usr/local/lib64/python3.7/site-packages; patch -p1 </90411.diff; popd; python3 -c "import torch;print(torch.__version__, torch.cuda.is_available())"' ``` Fixes pytorch#88869 Pull Request resolved: pytorch#90411 Approved by: https://github.com/atalman

If PyTorch is package into a wheel with [nvidia-cublas-cu11](https://pypi.org/project/nvidia-cublas-cu11/), which is designated as PureLib, but `torch` wheel is not, can cause a torch_globals loading problem. Fix that by searching for `nvidia/cublas/lib/libcublas.so.11` an `nvidia/cudnn/lib/libcudnn.so.8` across all `sys.path` folders. Test plan: ``` docker pull amazonlinux:2 docker run --rm -t amazonlinux:2 bash -c 'yum install -y python3 python3-devel python3-distutils patch;python3 -m pip install torch==1.13.0;curl -OL https://patch-diff.githubusercontent.com/raw/pytorch/pytorch/pull/90411.diff; pushd /usr/local/lib64/python3.7/site-packages; patch -p1 </90411.diff; popd; python3 -c "import torch;print(torch.__version__, torch.cuda.is_available())"' ``` Fixes #88869 Pull Request resolved: #90411 Approved by: https://github.com/atalman Co-authored-by: Nikita Shulga <nshulga@meta.com>

If PyTorch is package into a wheel with [nvidia-cublas-cu11](https://pypi.org/project/nvidia-cublas-cu11/), which is designated as PureLib, but `torch` wheel is not, can cause a torch_globals loading problem. Fix that by searching for `nvidia/cublas/lib/libcublas.so.11` an `nvidia/cudnn/lib/libcudnn.so.8` across all `sys.path` folders. Test plan: ``` docker pull amazonlinux:2 docker run --rm -t amazonlinux:2 bash -c 'yum install -y python3 python3-devel python3-distutils patch;python3 -m pip install torch==1.13.0;curl -OL https://patch-diff.githubusercontent.com/raw/pytorch/pytorch/pull/90411.diff; pushd /usr/local/lib64/python3.7/site-packages; patch -p1 </90411.diff; popd; python3 -c "import torch;print(torch.__version__, torch.cuda.is_available())"' ``` Fixes pytorch#88869 Pull Request resolved: pytorch#90411 Approved by: https://github.com/atalman

malfet added high priority module: binaries Anything related to official binaries that we release to users module: regression It used to work, and now it doesn't labels Nov 11, 2022

pytorch-bot bot added the triage review label Nov 11, 2022

malfet added this to the 1.13.1 milestone Nov 11, 2022

weiliw-amz changed the title ~~On CPU only machine received OSError from importing: libcublas.so.11: cannot open shared object file on CPU-only machine~~ On CPU-only machine received OSError from importing: libcublas.so.11: cannot open shared object file Nov 11, 2022

atalman assigned weiwangmeta and atalman Nov 14, 2022

atalman assigned malfet Nov 28, 2022

malfet added a commit that referenced this issue Dec 7, 2022

Fix cuda deps search path

73794d8

Fixes #88869

malfet mentioned this issue Dec 7, 2022

Add manual cuda deps search logic #90411

Closed

pytorchmergebot closed this as completed in e0f681a Dec 7, 2022

atalman mentioned this issue Dec 7, 2022

Add manual cuda deps search logic (#90411) #90426

Merged

malfet mentioned this issue Sep 13, 2023

Small wheels for 2.1.0 release candidate is not usable on AmazonLinux #109221

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

On CPU-only machine received OSError from importing: libcublas.so.11: cannot open shared object file #88869

On CPU-only machine received OSError from importing: libcublas.so.11: cannot open shared object file #88869

weiliw-amz commented Nov 11, 2022 •

edited by pytorch-bot bot

malfet commented Nov 11, 2022 •

edited

malfet commented Nov 11, 2022

seemethere commented Nov 11, 2022

malfet commented Nov 11, 2022 •

edited

atalman commented Nov 14, 2022

malfet commented Nov 14, 2022 •

edited

atalman commented Dec 7, 2022

malfet commented Dec 7, 2022 •

edited

Navigation Menu

On CPU-only machine received OSError from importing: libcublas.so.11: cannot open shared object file #88869

On CPU-only machine received OSError from importing: libcublas.so.11: cannot open shared object file #88869

Comments

weiliw-amz commented Nov 11, 2022 • edited by pytorch-bot bot

🐛 Describe the bug

Versions

malfet commented Nov 11, 2022 • edited

malfet commented Nov 11, 2022

seemethere commented Nov 11, 2022

malfet commented Nov 11, 2022 • edited

atalman commented Nov 14, 2022

malfet commented Nov 14, 2022 • edited

atalman commented Dec 7, 2022

malfet commented Dec 7, 2022 • edited

weiliw-amz commented Nov 11, 2022 •

edited by pytorch-bot bot

malfet commented Nov 11, 2022 •

edited

malfet commented Nov 11, 2022 •

edited

malfet commented Nov 14, 2022 •

edited

malfet commented Dec 7, 2022 •

edited