Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Different version of cuda used for nccl and torch compilation #1713

Open
tangjiangling opened this issue Mar 1, 2024 · 1 comment
Open

Comments

@tangjiangling
Copy link

Problem Description

Since the precompiled version of torch manages some dependencies on its own (e.g. cuda, nccl, cudnn), when we install torch via pip, let's say I want to install version 2.2.0 of torch:

$> pip3 install torch==2.2.0 torchvision==0.17.0 torchaudio==2.2.0 --index-url https://download.pytorch.org/whl/cu121 

It will go and download nvidia-nccl-cu12==2.19.3 as shown in the following log:

Collecting nvidia-nccl-cu12==2.19.3 (from torch==2.1.0.mt20240224+cu121)
  Downloading https://download.pytorch.org/whl/cu121/nvidia_nccl_cu12-2.19.3-py3-none-manylinux1_x86_64.whl (166.0 MB)

So here's the issue: the nccl downloaded here is compiled using cuda12.3, while torch uses cuda12.1.

Although the compilation uses inconsistent versions, it actually works (at least I haven't had any problems so far), so I thought I'd ask here if this inconsistency could be hiding some problems I'm not aware of.

By the way, we can use nccl-tests to verify the version of cuda used by the nccl compilation:

$> export LD_LIBRARY_PATH=/usr/local/conda/lib/python3.9/site-packages/nvidia/nccl/lib:$LD_LIBRARY_PATH
$> git clone \
    --recursive \
    --branch v2.13.6 \
    --single-branch \
    --depth 1 \
    https://github.com/NVIDIA/nccl-tests.git
$> cd nccl-tests
$> make -j16
$> NCCL_DEBUG=INFO ./build/all_reduce_perf -b 8 -e 128M -f 2 -g 8

NCCL_DEBUG=INFO this option prints out the version of cuda used at compile time when nccl-tests runs:

...
NCCL INFO cudaDriverVersion 12010
NCCL version 2.19.3+cuda12.3
...
@bryantbiggs
Copy link
Contributor

now that containers are mainstream, it would be great to move off of python packaging for NVIDIA artifacts and instead install them on the system (i.e. - in the container, not in a conda environment, virtualenv, etc.)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants