Cannot install PyTorch 1.13.x with PDM #1732

yukw777 · 2023-02-22T22:49:00Z

I have searched the issue tracker and believe that this is not a duplicate.

Make sure you run commands with -v flag before pasting the output.

Steps to reproduce

Install PyTorch 1.13.x by running pdm add torch (1.13.1 is the latest version currently.)
Try to import pytorch in the interpreter python -c 'import torch'.

Actual behavior

PyTorch should be imported without any errors.

Expected behavior

❯ python -c 'import torch'
Traceback (most recent call last):
  File ".../.venv/lib/python3.10/site-packages/torch/__init__.py", line 172, in _load_global_deps
    ctypes.CDLL(lib_path, mode=ctypes.RTLD_GLOBAL)
  File "/usr/lib/python3.10/ctypes/__init__.py", line 374, in __init__
    self._handle = _dlopen(self._name, mode)
OSError: libcublas.so.11: cannot open shared object file: No such file or directory

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "<string>", line 1, in <module>
  File ".../.venv/lib/python3.10/site-packages/torch/__init__.py", line 217, in <module>
    _load_global_deps()
  File ".../.venv/lib/python3.10/site-packages/torch/__init__.py", line 178, in _load_global_deps
    _preload_cuda_deps()
  File ".../.venv/lib/python3.10/site-packages/torch/__init__.py", line 158, in _preload_cuda_deps
    ctypes.CDLL(cublas_path)
  File "/usr/lib/python3.10/ctypes/__init__.py", line 374, in __init__
    self._handle = _dlopen(self._name, mode)
OSError: .../.venv/lib/python3.10/site-packages/nvidia/cublas/lib/libcublas.so.11: cannot open shared object file: No such file or directory

Environment Information

PDM version:
  2.4.6
Python Interpreter:
  .../.venv/bin/python (3.10)
Project Root:
  ...
Project Packages:
  None
{
  "implementation_name": "cpython",
  "implementation_version": "3.10.10",
  "os_name": "posix",
  "platform_machine": "x86_64",
  "platform_release": "5.4.0-121-generic",
  "platform_system": "Linux",
  "platform_version": "#137-Ubuntu SMP Wed Jun 15 13:33:07 UTC 2022",
  "python_full_version": "3.10.10",
  "platform_python_implementation": "CPython",
  "python_version": "3.10",
  "sys_platform": "linux"
}

I "think" this is related to the fact that PyTorch 1.13.x introduced a new set of dependencies around cuda (pytorch/pytorch#85097). Poetry had issues b/c of this (pytorch/pytorch#88049) but it's since been resolved, but not for pdm. My guess is that it might be b/c pdm installs the cuda dependencies separately from pytorch and b/c of that the pytorch installation doesn't know about them. It's a bummer, b/c I wanted to give pdm a spin for a new project, for now I'm going to have to stick to poetry. :/

The text was updated successfully, but these errors were encountered:

xiaojinwhu · 2023-02-24T10:34:13Z

if use cuda:
add follow in pyproject.toml

[[tool.pdm.source]]
url = "https://download.pytorch.org/whl/cu116"
verify_ssl = true
name = "torch"

yukw777 · 2023-02-24T19:57:30Z

@xiaojinwhu If you use cuda 11.7, you actually don't need to add an extra index as you can see above. That's the problem. It should work without adding that extra index. This works with pip and poetry.

frostming · 2023-03-01T07:23:24Z

I am working on Mac M1 and torch 1.13.1 is installed successfully, without CUDA. So I am afraid I am not able to reproduce it. You can try to research yourself, or if anyone else can help. For example, try to find out why it misses so files but other installers(such as pip) don't, and what are the differences in the installed files.

michaelze · 2023-03-03T06:15:03Z

I'm having a similar (probably even the same problem) and I suspect the install.cache setting being the culprit here (I assume @yukw777 also has this set to true).

I discovered the following issue with the nvidia libraries (nvidia_cublas_cu11, nvidia_cuda_nvrtc_cu11, etc.):

With install.cache turned off, the directory structure is as follows:

nvidia
├ __init__.py
├ cublas
├ cuda_nvrtc
├ cuda_runtime
└ cudnn
nvidia_cublas_cu11-11.10.3.66.dist-info
nvidia_cuda_nvrtc_cu11-11.7.99.dist-info
nvidia_cuda_runtime_cu11-11.7.99.dist-info
nvidia_cudnn_cu11-8.5.0.96.dist-info

As soon as you activate install.cache, the directory structure changes:

nvidia -> /root/.cache/pdm/packages/nvidia_cudnn_cu11-8.5.0.96-2-py3-none-manylinux1_x86_64/lib/nvidia
nvidia_cublas_cu11-11.10.3.66.dist-info
nvidia_cuda_nvrtc_cu11-11.7.99.dist-info
nvidia_cuda_runtime_cu11-11.7.99.dist-info
nvidia_cudnn_cu11-8.5.0.96.dist-info

The content of /root/.cache/pdm/packages/nvidia_cudnn_cu11-8.5.0.96-2-py3-none-manylinux1_x86_64/lib/nvidia is obviously only

__init__.py
cudnn

I hope that this issue can be fixed somehow (I don't know how standard compliant several packages installing into a common package folder is) because the nvidia packages are the primary reaon I activated install.cache in the first place.

frostming · 2023-03-03T07:39:11Z

@michaelze Thanks for the investigation, but the wheel /root/.cache/pdm/packages/nvidia_cudnn_cu11-8.5.0.96-2-py3-none-manylinux1_x86_64/lib/nvidia only contains cudnn itself:
https://pypi-browser.org/package/nvidia-cudnn-cu11/nvidia_cudnn_cu11-8.5.0.96-2-py3-none-manylinux1_x86_64.whl

So there might be some other packages that install cuda_* folders into the nvidia package, which can't work with the cache mechanism, where the cache key is the wheel name, as you can see.

Ah, yes you list the packages below. The problem is, when they share the namespace nvidia, they don't use the PEP 420 style implicit namespace packages. When creating symlinks, PDM thinks they are different packages and won't create symlinks recursively.

Try setting pdm config install.cache_method pth to see if another method works.

michaelze · 2023-03-06T10:01:23Z

I tested your suggestion but the problem still persists.

Looking at the PyTorch source code (https://github.com/pytorch/pytorch/blob/v1.13.1/torch/__init__.py#L144) reveals the underlying problem:

the code actually searches for the nvidia folder in all elements of the sys.path
if it is found, it constructs paths to the required libraries (libcublas and libcudnn) and checks for their existance
the last condition is never true for both of the libraries, only ever for one
in the end the code tries to load the two libraries from the current paths to libcublas and libcudnn which obviously fails.

So the problem here is, I think,

Nvidia distributing several packages that install into the same subfolder
This makes creating a symlink at the top level impossible, symlinking would have to treat the nvidia packages in a special way (create the parent folder, create symlinks for the subfolders)
PyTorch using Knowledge about that directory structure directly to load the libraries
This makes using install.cache_method pth also not a possibility. If they would resolve the path to the libraries for each package while iterating sys.path, it could work...

From looking at the code, PyTorch 2.0.0 might actually work with PDM and install.cache_method pth as the code that loads the cuda libraries iterates all elements of sys.path and looks for the nvidia subfolder and the library in each element individually.

frostming · 2023-03-06T12:58:43Z

Nvidia distributing several packages that install into the same subfolder
This makes creating a symlink at the top level impossible, symlinking would have to treat the nvidia packages in a special way (create the parent folder, create symlinks for the subfolders)

That special treat does exist, but for PEP 420 namespace packages(package without __init__.py), not for one special package, and not going to be either. The best way to fix it is to remove the nvidia/__init__.py from the nvidia distribution.

JesseFarebro · 2023-07-19T17:15:39Z

@michaelze It seems to work for me with pdm config install.cache_method pth but symlink fails as mentioned above.

ocss884 · 2023-08-04T04:04:23Z

If you install torch via Pypi, the full version name is 1.13.x+cu117 and the following CUDA dependencies will be shipped together via Pypi with torch installation (those under nvidia folder):

cublas
cuda_nvrtc
cuda_runtime
cudnn

See this function https://github.com/pytorch/pytorch/blob/v1.13.1/torch/__init__.py#L163.
When importing torch==1.13.x (when >=2.0.0, the loading mechanism is different), the logic is:

The program will first load dependencies via libtorch_global_deps.so, the following CUDA dependencies are checked during this procedure: libcublas.so.11, libcudnn.so.8 and libnvToolsExt.so.1.
If OSError occurs, check whether it is caused by a missing of libcublas.so.11, if so search for libcublas and libcudnn by exploiting sys.path, otherwise raise the error.

If you have a local CUDA toolkit 11.7 installation (may also work for 11.8, as long as libcublas.so.11 can be found) and have configured the LD_LIBRARY_PATH correctly, in fact, all these CUDA dependencies could be found except cudnn. So regardless of your setup of install.cache_method, the top 3 files can always be found. But magically, because of the above logic, when OSError occurs for missing of cudnn but libcublas is presented, it will not search for cudnn.

If you don't have a local toolkit installation, now libcublas is also missing at first place, then the pth config could help the program to find all the CUDA dependencies.

It seems the installation of torch wheel from PyTorch website will always have the necessary cudnn.so.* file under torch/lib directory, which doesn't exist if downloaded from Pypi. If you have a local CUDA installation, try downloading from their website, e.g.:

pdm add https://download.pytorch.org/whl/cu117/torch-1.13.1%2Bcu117-cp310-cp310-linux_x86_64.whl

The disadvantage is torch cannot be cached in this way, but if you are using torch from Pypi and without a local toolkit installation, I'm not sure whether all the functionality of torch can be used.

Ttayu · 2023-10-20T08:14:58Z

It doesn't work with latest pdm or pytorch
If there is actually a problem with nvidia, pytorch users will be happier if there is some way to compromise

I'm compelled to create a script that runs like this and copy it directly to the cache.

cp -r /home/user/.cache/pdm/packages/nvidia_nccl_cu12-2.18.1-py3-none-manylinux1_x86_64/lib/nvidia/nccl /home/user/.cache/pdm/packages/nvidia_cudnn_cu12-8.9.2.26-py3-none-manylinux1_x86_64/lib/nvidia/
cp -r /home/user/.cache/pdm/packages/nvidia_nvtx_cu12-12.1.105-py3-none-manylinux1_x86_64/lib/nvidia/nvtx /home/user/.cache/pdm/packages/nvidia_cudnn_cu12-8.9.2.26-py3-none-manylinux1_x86_64/lib/nvidia/
cp -r /home/user/.cache/pdm/packages/nvidia_cufft_cu12-11.0.2.54-py3-none-manylinux1_x86_64/lib/nvidia/cufft /home/user/.cache/pdm/packages/nvidia_cudnn_cu12-8.9.2.26-py3-none-manylinux1_x86_64/lib/nvidia/

...

It works explicitly, but the user should not ask for it.

For example, is it possible to do a workaround that downloads only libraries from nvidia (explicitly named libraries like pdm.toml) directly instead of a symlink(cache_method)?

By the way, in my environment, pdm config install.cache_method pth did not work.

frostming · 2023-11-23T01:09:25Z

Can anyone in this thread check if the issue still exists on the latest PDM? Much appreciated for that.

Ttayu · 2023-11-23T01:54:09Z

Yes, this occurs even in the latest PDM(2.10.3).

frostming · 2023-11-23T02:14:59Z

Fine, I'll paste the code comment to give more insight on why it happens:

pdm/src/pdm/installers/installers.py

Lines 75 to 82 in 837e7d0

    
           def _create_symlinks_recursively(source: str, destination: str) -> Iterable[str]: 
        
               """Create symlinks recursively from source to destination. In the following ways: 
        
               package  <-- link 
        
                   __init__.py 
        
               namespace_package  <-- mkdir 
        
                   foo.py  <-- link 
        
                   bar.py  <-- link 
        
               """

PDM only looks at children if the parent dir is a namespace package. And PDM detects a namespace based on these rules:

pdm/src/pdm/installers/installers.py

Lines 49 to 60 in 837e7d0

    
           _namespace_package_lines = frozenset( 
        
               [ 
        
                   # pkg_resources style 
        
                   "__import__('pkg_resources').declare_namespace(__name__)", 
        
                   "pkg_resources.declare_namespace(__name__)", 
        
                   "declare_namespace(__name__)", 
        
                   # pkgutil style 
        
                   "__path__ = __import__('pkgutil').extend_path(__path__, __name__)", 
        
                   "__path__ = pkgutil.extend_path(__path__, __name__)", 
        
                   "__path__ = extend_path(__path__, __name__)", 
        
               ] 
        
           )

So if the package breaks the assumption PDM doesn't know how to create symlinks properly, and I don't think it's something PDM can fix, or you need to disable install.cache for it.

Ttayu · 2023-11-23T02:30:17Z

yes. I understand that PDM is NOT the main cause.
However, as a PyTorch user, torch and nvidia are downloaded or copied directly to __pypackages__ (cp -r ~/.cache/pdm/packages/somelib/somelib), and other libraries are placed under __pypackages__ using symlinks. Is it difficult to realize? (If you don't need that much significance, close it.)

The former has the problem that _C.cpython-3x-x86_64-linux-gnu.so under the symlinked __pypackages__/3.x/lib/torch cannot search for depending .so libraries in __pypackages__/3.x/lib/nvidia, the latter has the problem of nvidia folder structure and the problem of respectively, caused by symlink.

frostming · 2023-11-23T02:38:00Z

The main cause is nvidia is a normal package with a blank __init__.py, in which case PDM will create a single symlink for the whole directory. Maybe we can implement a different link strategy to force PDM to create a symlink for each individual files.

Ttayu · 2023-11-23T02:48:23Z

PyTorch side problem is clearly a different issue.

This occurs when lib/torch is symlinked, regardless of whether lib/nvidia is a real or a symlink.
It might be better to open a separate new issue for this.

The solution is to copy everything (without using the cache method), but I think I would like to take advantage of the wonderful feature of linking from the cache.

ae9is · 2024-01-11T07:33:00Z

For anyone coming here off search engines... I wiped my lock file and .venv, and the following worked for me (thanks to #2425!):

pdm config --local install.cache_method symlink_individual

fancyerii · 2024-02-04T05:44:59Z

For anyone coming here off search engines... I wiped my lock file and .venv, and the following worked for me (thanks to #2425!):
pdm config --local install.cache_method symlink_individual

still not work for pytorch 2.2.0 and latest pdm. I tried symlink_individual, hardlink and pth(I can't find it in document, maybe it's deleted in new version of pdm?) and none of them worked.

fancyerii · 2024-02-04T06:25:46Z

sible to do a workaround that downloads only libraries fro

still exists with pytorch 2.2 and pdm 2.12.3. see #2614

yukw777 added the 🐛 bug Something isn't working label Feb 22, 2023

frostming added ❓ help wanted Extra attention is needed 🤔 not enough info Requires more information to clarify the issue labels Mar 1, 2023

frostming removed the 🤔 not enough info Requires more information to clarify the issue label Nov 23, 2023

frostming added this to the 2.11.0 milestone Nov 23, 2023

frostming linked a pull request Nov 23, 2023 that will close this issue

feat: new cache methods: hardlink and symlink_individual #2425

Merged

2 tasks

frostming mentioned this issue Nov 29, 2023

install cache, not symlinking nvidia cuda related packages #2445

Closed

1 task

frostming removed the ❓ help wanted Extra attention is needed label Nov 29, 2023

frostming closed this as completed Dec 14, 2023

fancyerii mentioned this issue Feb 4, 2024

can't install pytorch 2.2 when install.cache is on #2614

Closed

1 task

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Cannot install PyTorch 1.13.x with PDM #1732

Cannot install PyTorch 1.13.x with PDM #1732

yukw777 commented Feb 22, 2023

xiaojinwhu commented Feb 24, 2023

yukw777 commented Feb 24, 2023 •

edited

frostming commented Mar 1, 2023 •

edited

michaelze commented Mar 3, 2023

frostming commented Mar 3, 2023 •

edited

michaelze commented Mar 6, 2023 •

edited

frostming commented Mar 6, 2023

JesseFarebro commented Jul 19, 2023

ocss884 commented Aug 4, 2023 •

edited

Ttayu commented Oct 20, 2023

frostming commented Nov 23, 2023

Ttayu commented Nov 23, 2023

frostming commented Nov 23, 2023 •

edited

Ttayu commented Nov 23, 2023 •

edited

frostming commented Nov 23, 2023 •

edited

Ttayu commented Nov 23, 2023 •

edited

ae9is commented Jan 11, 2024

fancyerii commented Feb 4, 2024 •

edited

fancyerii commented Feb 4, 2024 •

edited

Cannot install PyTorch 1.13.x with PDM #1732

Cannot install PyTorch 1.13.x with PDM #1732

Comments

yukw777 commented Feb 22, 2023

Steps to reproduce

Actual behavior

Expected behavior

Environment Information

xiaojinwhu commented Feb 24, 2023

yukw777 commented Feb 24, 2023 • edited

frostming commented Mar 1, 2023 • edited

michaelze commented Mar 3, 2023

frostming commented Mar 3, 2023 • edited

michaelze commented Mar 6, 2023 • edited

frostming commented Mar 6, 2023

JesseFarebro commented Jul 19, 2023

ocss884 commented Aug 4, 2023 • edited

Ttayu commented Oct 20, 2023

frostming commented Nov 23, 2023

Ttayu commented Nov 23, 2023

frostming commented Nov 23, 2023 • edited

Ttayu commented Nov 23, 2023 • edited

frostming commented Nov 23, 2023 • edited

Ttayu commented Nov 23, 2023 • edited

ae9is commented Jan 11, 2024

fancyerii commented Feb 4, 2024 • edited

fancyerii commented Feb 4, 2024 • edited

yukw777 commented Feb 24, 2023 •

edited

frostming commented Mar 1, 2023 •

edited

frostming commented Mar 3, 2023 •

edited

michaelze commented Mar 6, 2023 •

edited

ocss884 commented Aug 4, 2023 •

edited

frostming commented Nov 23, 2023 •

edited

Ttayu commented Nov 23, 2023 •

edited

frostming commented Nov 23, 2023 •

edited

Ttayu commented Nov 23, 2023 •

edited

fancyerii commented Feb 4, 2024 •

edited

fancyerii commented Feb 4, 2024 •

edited