Navigation Menu

Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

On CPU-only machine received OSError from importing: libcublas.so.11: cannot open shared object file #88869

Closed
weiliw-amz opened this issue Nov 11, 2022 · 8 comments
Assignees
Labels
high priority module: binaries Anything related to official binaries that we release to users module: regression It used to work, and now it doesn't triage review
Milestone

Comments

@weiliw-amz
Copy link

weiliw-amz commented Nov 11, 2022

🐛 Describe the bug

On a CPU-only machine, in Amazon Linux 2 latest docker image, install latest PyTorch(1.13) via pip and import, received this error.
However in my sense, PyTorch should not install GPU related components on a CPU-only machine without CUDA and nvidia related package installed?

Steps to reproduce:

docker pull amazonlinux:2
docker run --rm -it amazonlinux:2 /bin/bash

bash-4.2# yum install -y python3 python3-devel python3-distutils
bash-4.2# python3 -m pip install torch
bash-4.2# python3 -c "import torch"

Then get error message:

Traceback (most recent call last):
  File "<string>", line 1, in <module>
  File "/usr/local/lib64/python3.7/site-packages/torch/__init__.py", line 191, in <module>
    _load_global_deps()
  File "/usr/local/lib64/python3.7/site-packages/torch/__init__.py", line 153, in _load_global_deps
    ctypes.CDLL(lib_path, mode=ctypes.RTLD_GLOBAL)
  File "/usr/lib64/python3.7/ctypes/__init__.py", line 359, in __init__
    self._handle = _dlopen(self._name, mode)
OSError: libcublas.so.11: cannot open shared object file: No such file or directory

Versions

Collecting environment information...
PyTorch version: N/A
Is debug build: N/A
CUDA used to build PyTorch: N/A
ROCM used to build PyTorch: N/A

OS: Amazon Linux 2 (x86_64)
GCC version: Could not collect
Clang version: Could not collect
CMake version: Could not collect
Libc version: glibc-2.2.5

Python version: 3.7.10 (default, Jun 3 2021, 00:02:01) [GCC 7.3.1 20180712 (Red Hat 7.3.1-13)] (64-bit runtime)
Python platform: Linux-5.4.214-134.408.amzn2int.x86_64-x86_64-with-glibc2.2.5
Is CUDA available: N/A
CUDA runtime version: Could not collect
CUDA_MODULE_LOADING set to: N/A
GPU models and configuration: Could not collect
Nvidia driver version: Could not collect
cuDNN version: Could not collect
HIP runtime version: N/A
MIOpen runtime version: N/A
Is XNNPACK available: N/A

Versions of relevant libraries:
[pip3] torch==1.13.0
[conda] Could not collect

cc @ezyang @gchanan @zou3519 @seemethere @malfet

@malfet malfet added high priority module: binaries Anything related to official binaries that we release to users module: regression It used to work, and now it doesn't labels Nov 11, 2022
@malfet
Copy link
Contributor

malfet commented Nov 11, 2022

[Edit] Hmm, I can reproduce this even though nvidia-cublas-cu11 is installed:

bash-4.2# python3 -m pip install torch
WARNING: Running pip install with root privileges is generally not a good idea. Try `python3 -m pip install --user` instead.
Collecting torch
  Downloading torch-1.13.0-cp37-cp37m-manylinux1_x86_64.whl (890.2 MB)
     |████████████████████████████████| 890.2 MB 4.2 kB/s 
Collecting typing-extensions
  Downloading typing_extensions-4.4.0-py3-none-any.whl (26 kB)
Collecting nvidia-cuda-runtime-cu11==11.7.99
  Downloading nvidia_cuda_runtime_cu11-11.7.99-py3-none-manylinux1_x86_64.whl (849 kB)
     |████████████████████████████████| 849 kB 107.1 MB/s 
Collecting nvidia-cublas-cu11==11.10.3.66
  Downloading nvidia_cublas_cu11-11.10.3.66-py3-none-manylinux1_x86_64.whl (317.1 MB)
     |████████████████████████████████| 317.1 MB 14 kB/s 
Collecting nvidia-cuda-nvrtc-cu11==11.7.99
  Downloading nvidia_cuda_nvrtc_cu11-11.7.99-2-py3-none-manylinux1_x86_64.whl (21.0 MB)
     |████████████████████████████████| 21.0 MB 105.8 MB/s 
Collecting nvidia-cudnn-cu11==8.5.0.96
  Downloading nvidia_cudnn_cu11-8.5.0.96-2-py3-none-manylinux1_x86_64.whl (557.1 MB)
     |████████████████████████████████| 557.1 MB 4.9 kB/s 
Requirement already satisfied: setuptools in /usr/lib/python3.7/site-packages (from nvidia-cuda-runtime-cu11==11.7.99->torch) (49.1.3)
Collecting wheel
  Downloading wheel-0.38.4-py3-none-any.whl (36 kB)
Installing collected packages: typing-extensions, wheel, nvidia-cuda-runtime-cu11, nvidia-cublas-cu11, nvidia-cuda-nvrtc-cu11, nvidia-cudnn-cu11, torch
Successfully installed nvidia-cublas-cu11-11.10.3.66 nvidia-cuda-nvrtc-cu11-11.7.99 nvidia-cuda-runtime-cu11-11.7.99 nvidia-cudnn-cu11-8.5.0.96 torch-1.13.0 typing-extensions-4.4.0 wheel-0.38.4
bash-4.2# python3 -c "import torch"
Traceback (most recent call last):
  File "<string>", line 1, in <module>
  File "/usr/local/lib64/python3.7/site-packages/torch/__init__.py", line 191, in <module>
    _load_global_deps()
  File "/usr/local/lib64/python3.7/site-packages/torch/__init__.py", line 153, in _load_global_deps
    ctypes.CDLL(lib_path, mode=ctypes.RTLD_GLOBAL)
  File "/usr/lib64/python3.7/ctypes/__init__.py", line 359, in __init__
    self._handle = _dlopen(self._name, mode)
OSError: libcublas.so.11: cannot open shared object file: No such file or directory

And reason for that is:

bash-4.2# ls /usr/local/lib/python3.7/site-packages/
__pycache__  nvidia_cublas_cu11-11.10.3.66.dist-info   nvidia_cuda_runtime_cu11-11.7.99.dist-info  typing_extensions-4.4.0.dist-info  wheel
nvidia	     nvidia_cuda_nvrtc_cu11-11.7.99.dist-info  nvidia_cudnn_cu11-8.5.0.96.dist-info	   typing_extensions.py		      wheel-0.38.4.dist-info
bash-4.2# ls /usr/local/lib64/python3.7/site-packages/
functorch  torch  torch-1.13.0.dist-info  torchgen

cc: @syed-ahmed @ptrblck

@malfet malfet added this to the 1.13.1 milestone Nov 11, 2022
@weiliw-amz weiliw-amz changed the title On CPU only machine received OSError from importing: libcublas.so.11: cannot open shared object file on CPU-only machine On CPU-only machine received OSError from importing: libcublas.so.11: cannot open shared object file Nov 11, 2022
@malfet
Copy link
Contributor

malfet commented Nov 11, 2022

@weiliw-amz but to unblock yourself, please consider installing a CPU-only version of pytorch, by running python3 -m pip install torch --extra-index-url https://download.pytorch.org/whl/cpu

@seemethere
Copy link
Member

Interestingly enough, I can reproduce this with the amazonlinux:2 image but can't reproduce this with python:3.7.

I wonder what makes the amazonlinux:2 image so different in this case

@malfet
Copy link
Contributor

malfet commented Nov 11, 2022

@seemethere I guess amazonlinux:2 is sort of a multiarch distro (i.e. supporting 32-bit and 64-bit applications)

On amazonlinux:2:

python3 -c 'import sysconfig; print([sysconfig.get_paths()[i] for i in ["purelib", "platlib"]])'
['/usr/lib/python3.7/site-packages', '/usr/lib64/python3.7/site-packages']

On python:3.7:

['/usr/local/lib/python3.7/site-packages', '/usr/local/lib/python3.7/site-packages']

And if one to look into WHEEL metadata, then all NVIIDA packages has Root-Is-Purelib: set to true, while PyTorch sets it to false.

@atalman
Copy link
Contributor

atalman commented Nov 14, 2022

@syed-ahmed @ptrblck Any recommendations on how deal with this issue ?

@malfet
Copy link
Contributor

malfet commented Nov 14, 2022

IMO RPATH would not work here, as it's not a relative, but rather a different absolute path and could be vastly different if users choose to use venv. To solve this problem we should just set LD_LIBRARY_PATH to sysconfig.get_paths()['purelib '] / 'nv-xyz' / 'lib' in torch/__init__.py similar to how it is done in ./backends/cudnn/init.py

Or, may be we should change wheels to be non-purelibs (this way PyTorch and its dependencies will always be installed in the same folder)

@atalman
Copy link
Contributor

atalman commented Dec 7, 2022

@malfet according do definition https://peps.python.org/pep-0427/#what-s-the-deal-with-purelib-vs-platlib

Wheel preserves the “purelib” vs. “platlib” distinction, which is significant on some platforms. For example, Fedora installs pure Python packages to ‘/usr/lib/pythonX.Y/site-packages’ and platform dependent packages to ‘/usr/lib64/pythonX.Y/site-packages’.

From what I see Pytorch Linux wheels should be supported on any linux platforms. So are cudnn wheels: https://pypi.org/project/nvidia-cudnn-cu11/

Should we just align to use same Root-Is-Purelib flag for both packages ?
Can we set PyTorch wheels to Root-Is-Purelib to true ?

@malfet
Copy link
Contributor

malfet commented Dec 7, 2022

It could be done, but I think will take as a while. Something like that would fix the problem and seem like a much less intrusive change that modifying the package location:

    try:
        ctypes.CDLL(lib_path, mode=ctypes.RTLD_GLOBAL)
    except OSError as e:
        ctypes.CDLL("/usr/local/lib/python3.7/site-packages/nvidia/cublas/lib/libcublas.so.11")
        ctypes.CDLL("/usr/local/lib/python3.7/site-packages/nvidia/cudnn/lib/libcudnn.so.8")
        ctypes.CDLL(lib_path, mode=ctypes.RTLD_GLOBAL)

malfet added a commit that referenced this issue Dec 7, 2022
atalman pushed a commit to atalman/pytorch that referenced this issue Dec 7, 2022
If PyTorch is package into a wheel with [nvidia-cublas-cu11](https://pypi.org/project/nvidia-cublas-cu11/), which is designated as PureLib, but `torch` wheel is not, can cause a torch_globals loading problem.

Fix that by searching for `nvidia/cublas/lib/libcublas.so.11` an `nvidia/cudnn/lib/libcudnn.so.8` across all `sys.path` folders.

Test plan:
```
docker pull amazonlinux:2
docker run --rm -t amazonlinux:2 bash -c 'yum install -y python3 python3-devel python3-distutils patch;python3 -m pip install torch==1.13.0;curl -OL https://patch-diff.githubusercontent.com/raw/pytorch/pytorch/pull/90411.diff; pushd /usr/local/lib64/python3.7/site-packages; patch -p1 </90411.diff; popd; python3 -c "import torch;print(torch.__version__, torch.cuda.is_available())"'
```

Fixes pytorch#88869

Pull Request resolved: pytorch#90411
Approved by: https://github.com/atalman
atalman added a commit that referenced this issue Dec 8, 2022
If PyTorch is package into a wheel with [nvidia-cublas-cu11](https://pypi.org/project/nvidia-cublas-cu11/), which is designated as PureLib, but `torch` wheel is not, can cause a torch_globals loading problem.

Fix that by searching for `nvidia/cublas/lib/libcublas.so.11` an `nvidia/cudnn/lib/libcudnn.so.8` across all `sys.path` folders.

Test plan:
```
docker pull amazonlinux:2
docker run --rm -t amazonlinux:2 bash -c 'yum install -y python3 python3-devel python3-distutils patch;python3 -m pip install torch==1.13.0;curl -OL https://patch-diff.githubusercontent.com/raw/pytorch/pytorch/pull/90411.diff; pushd /usr/local/lib64/python3.7/site-packages; patch -p1 </90411.diff; popd; python3 -c "import torch;print(torch.__version__, torch.cuda.is_available())"'
```

Fixes #88869

Pull Request resolved: #90411
Approved by: https://github.com/atalman

Co-authored-by: Nikita Shulga <nshulga@meta.com>
kulinseth pushed a commit to kulinseth/pytorch that referenced this issue Dec 10, 2022
If PyTorch is package into a wheel with [nvidia-cublas-cu11](https://pypi.org/project/nvidia-cublas-cu11/), which is designated as PureLib, but `torch` wheel is not, can cause a torch_globals loading problem.

Fix that by searching for `nvidia/cublas/lib/libcublas.so.11` an `nvidia/cudnn/lib/libcudnn.so.8` across all `sys.path` folders.

Test plan:
```
docker pull amazonlinux:2
docker run --rm -t amazonlinux:2 bash -c 'yum install -y python3 python3-devel python3-distutils patch;python3 -m pip install torch==1.13.0;curl -OL https://patch-diff.githubusercontent.com/raw/pytorch/pytorch/pull/90411.diff; pushd /usr/local/lib64/python3.7/site-packages; patch -p1 </90411.diff; popd; python3 -c "import torch;print(torch.__version__, torch.cuda.is_available())"'
```

Fixes pytorch#88869

Pull Request resolved: pytorch#90411
Approved by: https://github.com/atalman
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
high priority module: binaries Anything related to official binaries that we release to users module: regression It used to work, and now it doesn't triage review
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants
@seemethere @malfet @atalman @weiliw-amz @weiwangmeta and others