Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cannot install PyTorch 1.13.x with PDM #1732

Closed
1 task done
yukw777 opened this issue Feb 22, 2023 · 19 comments · Fixed by #2425
Closed
1 task done

Cannot install PyTorch 1.13.x with PDM #1732

yukw777 opened this issue Feb 22, 2023 · 19 comments · Fixed by #2425
Labels
🐛 bug Something isn't working
Milestone

Comments

@yukw777
Copy link

yukw777 commented Feb 22, 2023

  • I have searched the issue tracker and believe that this is not a duplicate.

Make sure you run commands with -v flag before pasting the output.

Steps to reproduce

  1. Install PyTorch 1.13.x by running pdm add torch (1.13.1 is the latest version currently.)
  2. Try to import pytorch in the interpreter python -c 'import torch'.

Actual behavior

PyTorch should be imported without any errors.

Expected behavior

❯ python -c 'import torch'
Traceback (most recent call last):
  File ".../.venv/lib/python3.10/site-packages/torch/__init__.py", line 172, in _load_global_deps
    ctypes.CDLL(lib_path, mode=ctypes.RTLD_GLOBAL)
  File "/usr/lib/python3.10/ctypes/__init__.py", line 374, in __init__
    self._handle = _dlopen(self._name, mode)
OSError: libcublas.so.11: cannot open shared object file: No such file or directory

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "<string>", line 1, in <module>
  File ".../.venv/lib/python3.10/site-packages/torch/__init__.py", line 217, in <module>
    _load_global_deps()
  File ".../.venv/lib/python3.10/site-packages/torch/__init__.py", line 178, in _load_global_deps
    _preload_cuda_deps()
  File ".../.venv/lib/python3.10/site-packages/torch/__init__.py", line 158, in _preload_cuda_deps
    ctypes.CDLL(cublas_path)
  File "/usr/lib/python3.10/ctypes/__init__.py", line 374, in __init__
    self._handle = _dlopen(self._name, mode)
OSError: .../.venv/lib/python3.10/site-packages/nvidia/cublas/lib/libcublas.so.11: cannot open shared object file: No such file or directory

Environment Information

PDM version:
  2.4.6
Python Interpreter:
  .../.venv/bin/python (3.10)
Project Root:
  ...
Project Packages:
  None
{
  "implementation_name": "cpython",
  "implementation_version": "3.10.10",
  "os_name": "posix",
  "platform_machine": "x86_64",
  "platform_release": "5.4.0-121-generic",
  "platform_system": "Linux",
  "platform_version": "#137-Ubuntu SMP Wed Jun 15 13:33:07 UTC 2022",
  "python_full_version": "3.10.10",
  "platform_python_implementation": "CPython",
  "python_version": "3.10",
  "sys_platform": "linux"
}

I "think" this is related to the fact that PyTorch 1.13.x introduced a new set of dependencies around cuda (pytorch/pytorch#85097). Poetry had issues b/c of this (pytorch/pytorch#88049) but it's since been resolved, but not for pdm. My guess is that it might be b/c pdm installs the cuda dependencies separately from pytorch and b/c of that the pytorch installation doesn't know about them. It's a bummer, b/c I wanted to give pdm a spin for a new project, for now I'm going to have to stick to poetry. :/

@yukw777 yukw777 added the 🐛 bug Something isn't working label Feb 22, 2023
@xiaojinwhu
Copy link

if use cuda:
add follow in pyproject.toml

[[tool.pdm.source]]
url = "https://download.pytorch.org/whl/cu116"
verify_ssl = true
name = "torch"

@yukw777
Copy link
Author

yukw777 commented Feb 24, 2023

Screen Shot 2023-02-24 at 2 56 01 PM

@xiaojinwhu If you use cuda 11.7, you actually don't need to add an extra index as you can see above. That's the problem. It should work without adding that extra index. This works with pip and poetry.

@frostming
Copy link
Collaborator

frostming commented Mar 1, 2023

I am working on Mac M1 and torch 1.13.1 is installed successfully, without CUDA. So I am afraid I am not able to reproduce it. You can try to research yourself, or if anyone else can help. For example, try to find out why it misses so files but other installers(such as pip) don't, and what are the differences in the installed files.

@frostming frostming added ❓ help wanted Extra attention is needed 🤔 not enough info Requires more information to clarify the issue labels Mar 1, 2023
@michaelze
Copy link

I'm having a similar (probably even the same problem) and I suspect the install.cache setting being the culprit here (I assume @yukw777 also has this set to true).

I discovered the following issue with the nvidia libraries (nvidia_cublas_cu11, nvidia_cuda_nvrtc_cu11, etc.):

With install.cache turned off, the directory structure is as follows:

nvidia
├ __init__.py
├ cublas
├ cuda_nvrtc
├ cuda_runtime
└ cudnn
nvidia_cublas_cu11-11.10.3.66.dist-info
nvidia_cuda_nvrtc_cu11-11.7.99.dist-info
nvidia_cuda_runtime_cu11-11.7.99.dist-info
nvidia_cudnn_cu11-8.5.0.96.dist-info

As soon as you activate install.cache, the directory structure changes:

nvidia -> /root/.cache/pdm/packages/nvidia_cudnn_cu11-8.5.0.96-2-py3-none-manylinux1_x86_64/lib/nvidia
nvidia_cublas_cu11-11.10.3.66.dist-info
nvidia_cuda_nvrtc_cu11-11.7.99.dist-info
nvidia_cuda_runtime_cu11-11.7.99.dist-info
nvidia_cudnn_cu11-8.5.0.96.dist-info

The content of /root/.cache/pdm/packages/nvidia_cudnn_cu11-8.5.0.96-2-py3-none-manylinux1_x86_64/lib/nvidia is obviously only

__init__.py
cudnn

I hope that this issue can be fixed somehow (I don't know how standard compliant several packages installing into a common package folder is) because the nvidia packages are the primary reaon I activated install.cache in the first place.

@frostming
Copy link
Collaborator

frostming commented Mar 3, 2023

@michaelze Thanks for the investigation, but the wheel /root/.cache/pdm/packages/nvidia_cudnn_cu11-8.5.0.96-2-py3-none-manylinux1_x86_64/lib/nvidia only contains cudnn itself:
https://pypi-browser.org/package/nvidia-cudnn-cu11/nvidia_cudnn_cu11-8.5.0.96-2-py3-none-manylinux1_x86_64.whl

So there might be some other packages that install cuda_* folders into the nvidia package, which can't work with the cache mechanism, where the cache key is the wheel name, as you can see.


Ah, yes you list the packages below. The problem is, when they share the namespace nvidia, they don't use the PEP 420 style implicit namespace packages. When creating symlinks, PDM thinks they are different packages and won't create symlinks recursively.


Try setting pdm config install.cache_method pth to see if another method works.

@michaelze
Copy link

michaelze commented Mar 6, 2023

I tested your suggestion but the problem still persists.

Looking at the PyTorch source code (https://github.com/pytorch/pytorch/blob/v1.13.1/torch/__init__.py#L144) reveals the underlying problem:

  • the code actually searches for the nvidia folder in all elements of the sys.path
  • if it is found, it constructs paths to the required libraries (libcublas and libcudnn) and checks for their existance
  • the last condition is never true for both of the libraries, only ever for one
  • in the end the code tries to load the two libraries from the current paths to libcublas and libcudnn which obviously fails.

So the problem here is, I think,

  1. Nvidia distributing several packages that install into the same subfolder
    This makes creating a symlink at the top level impossible, symlinking would have to treat the nvidia packages in a special way (create the parent folder, create symlinks for the subfolders)
  2. PyTorch using Knowledge about that directory structure directly to load the libraries
    This makes using install.cache_method pth also not a possibility. If they would resolve the path to the libraries for each package while iterating sys.path, it could work...

From looking at the code, PyTorch 2.0.0 might actually work with PDM and install.cache_method pth as the code that loads the cuda libraries iterates all elements of sys.path and looks for the nvidia subfolder and the library in each element individually.

@frostming
Copy link
Collaborator

  1. Nvidia distributing several packages that install into the same subfolder
    This makes creating a symlink at the top level impossible, symlinking would have to treat the nvidia packages in a special way (create the parent folder, create symlinks for the subfolders)

That special treat does exist, but for PEP 420 namespace packages(package without __init__.py), not for one special package, and not going to be either. The best way to fix it is to remove the nvidia/__init__.py from the nvidia distribution.

@JesseFarebro
Copy link

@michaelze It seems to work for me with pdm config install.cache_method pth but symlink fails as mentioned above.

@ocss884
Copy link

ocss884 commented Aug 4, 2023

If you install torch via Pypi, the full version name is 1.13.x+cu117 and the following CUDA dependencies will be shipped together via Pypi with torch installation (those under nvidia folder):

  • cublas
  • cuda_nvrtc
  • cuda_runtime
  • cudnn

See this function https://github.com/pytorch/pytorch/blob/v1.13.1/torch/__init__.py#L163.
When importing torch==1.13.x (when >=2.0.0, the loading mechanism is different), the logic is:

  1. The program will first load dependencies via libtorch_global_deps.so, the following CUDA dependencies are checked during this procedure: libcublas.so.11, libcudnn.so.8 and libnvToolsExt.so.1.
  2. If OSError occurs, check whether it is caused by a missing of libcublas.so.11, if so search for libcublas and libcudnn by exploiting sys.path, otherwise raise the error.

If you have a local CUDA toolkit 11.7 installation (may also work for 11.8, as long as libcublas.so.11 can be found) and have configured the LD_LIBRARY_PATH correctly, in fact, all these CUDA dependencies could be found except cudnn. So regardless of your setup of install.cache_method, the top 3 files can always be found. But magically, because of the above logic, when OSError occurs for missing of cudnn but libcublas is presented, it will not search for cudnn.

If you don't have a local toolkit installation, now libcublas is also missing at first place, then the pth config could help the program to find all the CUDA dependencies.

It seems the installation of torch wheel from PyTorch website will always have the necessary cudnn.so.* file under torch/lib directory, which doesn't exist if downloaded from Pypi. If you have a local CUDA installation, try downloading from their website, e.g.:

pdm add https://download.pytorch.org/whl/cu117/torch-1.13.1%2Bcu117-cp310-cp310-linux_x86_64.whl

The disadvantage is torch cannot be cached in this way, but if you are using torch from Pypi and without a local toolkit installation, I'm not sure whether all the functionality of torch can be used.

@Ttayu
Copy link

Ttayu commented Oct 20, 2023

It doesn't work with latest pdm or pytorch
If there is actually a problem with nvidia, pytorch users will be happier if there is some way to compromise

I'm compelled to create a script that runs like this and copy it directly to the cache.

cp -r /home/user/.cache/pdm/packages/nvidia_nccl_cu12-2.18.1-py3-none-manylinux1_x86_64/lib/nvidia/nccl /home/user/.cache/pdm/packages/nvidia_cudnn_cu12-8.9.2.26-py3-none-manylinux1_x86_64/lib/nvidia/
cp -r /home/user/.cache/pdm/packages/nvidia_nvtx_cu12-12.1.105-py3-none-manylinux1_x86_64/lib/nvidia/nvtx /home/user/.cache/pdm/packages/nvidia_cudnn_cu12-8.9.2.26-py3-none-manylinux1_x86_64/lib/nvidia/
cp -r /home/user/.cache/pdm/packages/nvidia_cufft_cu12-11.0.2.54-py3-none-manylinux1_x86_64/lib/nvidia/cufft /home/user/.cache/pdm/packages/nvidia_cudnn_cu12-8.9.2.26-py3-none-manylinux1_x86_64/lib/nvidia/

...

It works explicitly, but the user should not ask for it.

For example, is it possible to do a workaround that downloads only libraries from nvidia (explicitly named libraries like pdm.toml) directly instead of a symlink(cache_method)?

By the way, in my environment, pdm config install.cache_method pth did not work.

@frostming
Copy link
Collaborator

Can anyone in this thread check if the issue still exists on the latest PDM? Much appreciated for that.

@Ttayu
Copy link

Ttayu commented Nov 23, 2023

Yes, this occurs even in the latest PDM(2.10.3).

@frostming
Copy link
Collaborator

frostming commented Nov 23, 2023

Fine, I'll paste the code comment to give more insight on why it happens:

def _create_symlinks_recursively(source: str, destination: str) -> Iterable[str]:
"""Create symlinks recursively from source to destination. In the following ways:
package <-- link
__init__.py
namespace_package <-- mkdir
foo.py <-- link
bar.py <-- link
"""

PDM only looks at children if the parent dir is a namespace package. And PDM detects a namespace based on these rules:

_namespace_package_lines = frozenset(
[
# pkg_resources style
"__import__('pkg_resources').declare_namespace(__name__)",
"pkg_resources.declare_namespace(__name__)",
"declare_namespace(__name__)",
# pkgutil style
"__path__ = __import__('pkgutil').extend_path(__path__, __name__)",
"__path__ = pkgutil.extend_path(__path__, __name__)",
"__path__ = extend_path(__path__, __name__)",
]
)

So if the package breaks the assumption PDM doesn't know how to create symlinks properly, and I don't think it's something PDM can fix, or you need to disable install.cache for it.

@Ttayu
Copy link

Ttayu commented Nov 23, 2023

yes. I understand that PDM is NOT the main cause.
However, as a PyTorch user, torch and nvidia are downloaded or copied directly to __pypackages__ (cp -r ~/.cache/pdm/packages/somelib/somelib), and other libraries are placed under __pypackages__ using symlinks. Is it difficult to realize? (If you don't need that much significance, close it.)

The former has the problem that _C.cpython-3x-x86_64-linux-gnu.so under the symlinked __pypackages__/3.x/lib/torch cannot search for depending .so libraries in __pypackages__/3.x/lib/nvidia, the latter has the problem of nvidia folder structure and the problem of respectively, caused by symlink.

@frostming
Copy link
Collaborator

frostming commented Nov 23, 2023

The main cause is nvidia is a normal package with a blank __init__.py, in which case PDM will create a single symlink for the whole directory. Maybe we can implement a different link strategy to force PDM to create a symlink for each individual files.

@Ttayu
Copy link

Ttayu commented Nov 23, 2023

PyTorch side problem is clearly a different issue.

This occurs when lib/torch is symlinked, regardless of whether lib/nvidia is a real or a symlink.
It might be better to open a separate new issue for this.

The solution is to copy everything (without using the cache method), but I think I would like to take advantage of the wonderful feature of linking from the cache.

@frostming frostming removed the 🤔 not enough info Requires more information to clarify the issue label Nov 23, 2023
@frostming frostming added this to the 2.11.0 milestone Nov 23, 2023
@frostming frostming linked a pull request Nov 23, 2023 that will close this issue
2 tasks
@frostming frostming removed the ❓ help wanted Extra attention is needed label Nov 29, 2023
@ae9is
Copy link

ae9is commented Jan 11, 2024

For anyone coming here off search engines... I wiped my lock file and .venv, and the following worked for me (thanks to #2425!):

pdm config --local install.cache_method symlink_individual

@fancyerii
Copy link

fancyerii commented Feb 4, 2024

For anyone coming here off search engines... I wiped my lock file and .venv, and the following worked for me (thanks to #2425!):

pdm config --local install.cache_method symlink_individual

still not work for pytorch 2.2.0 and latest pdm. I tried symlink_individual, hardlink and pth(I can't find it in document, maybe it's deleted in new version of pdm?) and none of them worked.

@fancyerii
Copy link

fancyerii commented Feb 4, 2024

sible to do a workaround that downloads only libraries fro

still exists with pytorch 2.2 and pdm 2.12.3. see #2614

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
🐛 bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

9 participants