Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Relax the pin on pynvml again #1130

Draft
wants to merge 4 commits into
base: branch-23.04
Choose a base branch
from

Conversation

wence-
Copy link
Contributor

@wence- wence- commented Feb 24, 2023

Handling the str vs. bytes discrepancy should have been covered by the changes in #1118.

Handling the str vs. bytes discrepancy should have been covered by
the changes in rapidsai#1118.
@wence- wence- requested a review from a team as a code owner February 24, 2023 13:06
@wence- wence- added improvement Improvement / enhancement to an existing function non-breaking Non-breaking change labels Feb 24, 2023
@ajschmidt8
Copy link
Member

There were two pins in the PR below, but only one unpin in this PR.

Should pyproject.toml also be unpinned?

@wence-
Copy link
Contributor Author

wence- commented Feb 24, 2023

Oh thanks, I suspect so (pushed that change). Thanks for the sharp eyes!

@jakirkham
Copy link
Member

One CI job is failing with this error:

Unable to start CUDA Context
Traceback (most recent call last):
  File "/opt/conda/envs/test/lib/python3.8/site-packages/pynvml/nvml.py", line 850, in _nvmlGetFunctionPointer
    _nvmlGetFunctionPointer_cache[name] = getattr(nvmlLib, name)
  File "/opt/conda/envs/test/lib/python3.8/ctypes/__init__.py", line 386, in __getattr__
    func = self.__getitem__(name)
  File "/opt/conda/envs/test/lib/python3.8/ctypes/__init__.py", line 391, in __getitem__
    func = self._FuncPtr((name_or_ordinal, self))
AttributeError: /usr/lib64/libnvidia-ml.so.1: undefined symbol: nvmlDeviceGetComputeRunningProcesses_v3

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/opt/conda/envs/test/lib/python3.8/site-packages/dask_cuda/initialize.py", line 31, in _create_cuda_context
    distributed.comm.ucx.init_once()
  File "/opt/conda/envs/test/lib/python3.8/site-packages/distributed/comm/ucx.py", line 136, in init_once
    pre_existing_cuda_context = has_cuda_context()
  File "/opt/conda/envs/test/lib/python3.8/site-packages/distributed/diagnostics/nvml.py", line 219, in has_cuda_context
    if _running_process_matches(handle):
  File "/opt/conda/envs/test/lib/python3.8/site-packages/distributed/diagnostics/nvml.py", line 179, in _running_process_matches
    running_processes = pynvml.nvmlDeviceGetComputeRunningProcesses(handle)
  File "/opt/conda/envs/test/lib/python3.8/site-packages/pynvml/nvml.py", line 2608, in nvmlDeviceGetComputeRunningProcesses
    return nvmlDeviceGetComputeRunningProcesses_v3(handle);
  File "/opt/conda/envs/test/lib/python3.8/site-packages/pynvml/nvml.py", line 2576, in nvmlDeviceGetComputeRunningProcesses_v3
    fn = _nvmlGetFunctionPointer("nvmlDeviceGetComputeRunningProcesses_v3")
  File "/opt/conda/envs/test/lib/python3.8/site-packages/pynvml/nvml.py", line 853, in _nvmlGetFunctionPointer
    raise NVMLError(NVML_ERROR_FUNCTION_NOT_FOUND)
pynvml.nvml.NVMLError_FunctionNotFound: Function Not Found

@jakirkham
Copy link
Member

Rerunning CI to see if the Dask 2023.2.1 release helped

@wence-
Copy link
Contributor Author

wence- commented Feb 25, 2023

Rerunning CI to see if the Dask 2023.2.1 release helped

I imagine the problem is that pynvml has been updated to require a v3 version of a function in nvml, but that doesn't exist in cuda 11.2?

@wence-
Copy link
Contributor Author

wence- commented Feb 28, 2023

This is WIP until such time as a solution for backwards compat is decided on in nvidia-ml-py (and/or pynvml). So until then we should just keep pynvml at 11.4.1

@jakirkham
Copy link
Member

Going to double check this, but my understanding is we want PyNVML 11.5 for CUDA 12 support

@pentschev
Copy link
Member

Agreed, but it seems we need that fix to land in nvidia-ml-py first as we can't work around that in a reasonable manner.

@wence-
Copy link
Contributor Author

wence- commented Mar 1, 2023

Going to double check this, but my understanding is we want PyNVML 11.5 for CUDA 12 support

I don't think that is necessary, unless we need features in nvml that were only introduced in cuda 12.

Specifically, I have CTK 12 on my system, I install pvnvml < 11.5, and all the queries work. The C API preserves backwards compatibility so old versions of pynvml work fine with new versions of libnvidia-ml.so. The problem is the other way round, new versions of pynvml don't work with old versions of libnvidia-ml.so.

@wence- wence- marked this pull request as draft March 6, 2023 16:42
@pentschev
Copy link
Member

This pending resolution of NVBug 4008080.

@pentschev pentschev added the 0 - Blocked Cannot progress due to external reasons label Jul 28, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
0 - Blocked Cannot progress due to external reasons improvement Improvement / enhancement to an existing function non-breaking Non-breaking change
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants