Relax the pin on pynvml again #1130

wence- · 2023-02-24T13:06:57Z

Handling the str vs. bytes discrepancy should have been covered by the changes in #1118.

Handling the str vs. bytes discrepancy should have been covered by the changes in rapidsai#1118.

ajschmidt8 · 2023-02-24T15:29:29Z

There were two pins in the PR below, but only one unpin in this PR.

https://github.com/rapidsai/dask-cuda/pull/1128/files

Should pyproject.toml also be unpinned?

wence- · 2023-02-24T17:35:22Z

Oh thanks, I suspect so (pushed that change). Thanks for the sharp eyes!

jakirkham · 2023-02-24T21:33:06Z

One CI job is failing with this error:

Unable to start CUDA Context
Traceback (most recent call last):
  File "/opt/conda/envs/test/lib/python3.8/site-packages/pynvml/nvml.py", line 850, in _nvmlGetFunctionPointer
    _nvmlGetFunctionPointer_cache[name] = getattr(nvmlLib, name)
  File "/opt/conda/envs/test/lib/python3.8/ctypes/__init__.py", line 386, in __getattr__
    func = self.__getitem__(name)
  File "/opt/conda/envs/test/lib/python3.8/ctypes/__init__.py", line 391, in __getitem__
    func = self._FuncPtr((name_or_ordinal, self))
AttributeError: /usr/lib64/libnvidia-ml.so.1: undefined symbol: nvmlDeviceGetComputeRunningProcesses_v3

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/opt/conda/envs/test/lib/python3.8/site-packages/dask_cuda/initialize.py", line 31, in _create_cuda_context
    distributed.comm.ucx.init_once()
  File "/opt/conda/envs/test/lib/python3.8/site-packages/distributed/comm/ucx.py", line 136, in init_once
    pre_existing_cuda_context = has_cuda_context()
  File "/opt/conda/envs/test/lib/python3.8/site-packages/distributed/diagnostics/nvml.py", line 219, in has_cuda_context
    if _running_process_matches(handle):
  File "/opt/conda/envs/test/lib/python3.8/site-packages/distributed/diagnostics/nvml.py", line 179, in _running_process_matches
    running_processes = pynvml.nvmlDeviceGetComputeRunningProcesses(handle)
  File "/opt/conda/envs/test/lib/python3.8/site-packages/pynvml/nvml.py", line 2608, in nvmlDeviceGetComputeRunningProcesses
    return nvmlDeviceGetComputeRunningProcesses_v3(handle);
  File "/opt/conda/envs/test/lib/python3.8/site-packages/pynvml/nvml.py", line 2576, in nvmlDeviceGetComputeRunningProcesses_v3
    fn = _nvmlGetFunctionPointer("nvmlDeviceGetComputeRunningProcesses_v3")
  File "/opt/conda/envs/test/lib/python3.8/site-packages/pynvml/nvml.py", line 853, in _nvmlGetFunctionPointer
    raise NVMLError(NVML_ERROR_FUNCTION_NOT_FOUND)
pynvml.nvml.NVMLError_FunctionNotFound: Function Not Found

jakirkham · 2023-02-25T04:40:34Z

Rerunning CI to see if the Dask 2023.2.1 release helped

wence- · 2023-02-25T09:22:51Z

Rerunning CI to see if the Dask 2023.2.1 release helped

I imagine the problem is that pynvml has been updated to require a v3 version of a function in nvml, but that doesn't exist in cuda 11.2?

wence- · 2023-02-28T16:36:57Z

This is WIP until such time as a solution for backwards compat is decided on in nvidia-ml-py (and/or pynvml). So until then we should just keep pynvml at 11.4.1

jakirkham · 2023-02-28T23:07:27Z

Going to double check this, but my understanding is we want PyNVML 11.5 for CUDA 12 support

pentschev · 2023-03-01T08:32:26Z

Agreed, but it seems we need that fix to land in nvidia-ml-py first as we can't work around that in a reasonable manner.

wence- · 2023-03-01T10:45:52Z

Going to double check this, but my understanding is we want PyNVML 11.5 for CUDA 12 support

I don't think that is necessary, unless we need features in nvml that were only introduced in cuda 12.

Specifically, I have CTK 12 on my system, I install pvnvml < 11.5, and all the queries work. The C API preserves backwards compatibility so old versions of pynvml work fine with new versions of libnvidia-ml.so. The problem is the other way round, new versions of pynvml don't work with old versions of libnvidia-ml.so.

pentschev · 2023-07-28T19:45:34Z

This pending resolution of NVBug 4008080.

Relax the pin on pynvml again

703497a

Handling the str vs. bytes discrepancy should have been covered by the changes in rapidsai#1118.

wence- requested a review from a team as a code owner February 24, 2023 13:06

wence- mentioned this pull request Feb 24, 2023

Merge branch-23.02 into branch-23.04 #1128

Merged

wence- added improvement Improvement / enhancement to an existing function non-breaking Non-breaking change labels Feb 24, 2023

And in pyproject

0d52fd9

ajschmidt8 approved these changes Feb 24, 2023

View reviewed changes

pentschev added 2 commits February 28, 2023 10:10

Merge branch 'branch-23.04' into wence/relax-pynvml-pin

718d8df

Merge branch 'branch-23.04' into wence/relax-pynvml-pin

1deff4c

jakirkham mentioned this pull request Feb 28, 2023

Dask-CUDA: CUDA 12 Conda Packages #1115

Closed

wence- marked this pull request as draft March 6, 2023 16:42

pentschev added the 0 - Blocked Cannot progress due to external reasons label Jul 28, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Relax the pin on pynvml again #1130

Relax the pin on pynvml again #1130

wence- commented Feb 24, 2023

ajschmidt8 commented Feb 24, 2023

wence- commented Feb 24, 2023

jakirkham commented Feb 24, 2023

jakirkham commented Feb 25, 2023

wence- commented Feb 25, 2023

wence- commented Feb 28, 2023

jakirkham commented Feb 28, 2023

pentschev commented Mar 1, 2023

wence- commented Mar 1, 2023

pentschev commented Jul 28, 2023

Relax the pin on pynvml again #1130

Are you sure you want to change the base?

Relax the pin on pynvml again #1130

Conversation

wence- commented Feb 24, 2023

ajschmidt8 commented Feb 24, 2023

wence- commented Feb 24, 2023

jakirkham commented Feb 24, 2023

jakirkham commented Feb 25, 2023

wence- commented Feb 25, 2023

wence- commented Feb 28, 2023

jakirkham commented Feb 28, 2023

pentschev commented Mar 1, 2023

wence- commented Mar 1, 2023

pentschev commented Jul 28, 2023