Vendor `nvidia-ml-py-11.515.48` #4109

dmitryduev · 2022-08-16T08:08:19Z

Fixes WB-NNNN
Fixes #NNNN

Description

Context:

We have been using an outdated version (from 2015!) of NVIDIA's Python bindings for the NVML library
Some users report problems with the recent NVIDIA drivers on Windows resulting in BSoD (e.g. _disable_stats doesn't work. wandb.init(settings=wandb.Settings(_disable_stats=True)) It still sends stats to WANDB, which in turn leads to BSOD due to incompatibility with the old PYNVML dependency in the vendor folder. #3597 and [App]: PAGE_FAULT_IN_NONPAGED_AREA Win10 blue screen of death error with pytorch lightning #3819), which might be related to that fact.
- I tested a range of Win+NVIDIA driver version combinations but didn't manage to repro the reported issues :`(

Proposed solution:

Vendor the Python bindings available at https://pypi.org/project/nvidia-ml-py/
- Modifications:
  - Added NVML_DLL_PATH env var check to the library initialization path
  - Applied black formatting to improve readability
  - Try different versions ("_v3", "_v2", "") of the functions: nvmlDeviceGetComputeRunningProcesses, nvmlDeviceGetGraphicsRunningProcesses, and nvmlDeviceGetMPSComputeRunningProcesses. Reason: older driver versions don't understand _v3's that is now the only option in the lib, so I had to add this enumeration/trial loop.
Add nightly testing on a Win+GPU executor in Circle
Add nightly checks for new releases of vendored packages (for now, only this one) (+ anything from our requirements?), posting "interesting findings" to Slack.

Reasoning:

https://pypi.org/project/nvidia-ml-py/ seems to be the most up-to-date version of the bindings on PyPI.
There is one change that we have in our current vendored version (NVML_DLL_PATH env var check) that is absent in the lib. It might be not necessary, but I did find a few references to it, so it seems safer to keep it there.
https://pypi.org/project/nvidia-ml-py seems to lack a GH repo, so I can't submit a PR.
There is another project, https://pypi.org/project/pynvml/ that does have a GH repo that is alive, but it is less frequently updated. The main contributor seems to be from NVIDIA, so I'm not sure what its relationship with the other library is.
I think the safest option is to vendor the "most official" bindings + add testing around it. If we find the NVML_DLL_PATH stuff unnecessary (check with NVIDIA!), might as well rm the vendored version and add a requirement instead.
Thought about monkey-patching the relevant code path, but that seems risky as it's the main global function that loads the module that needs patching, it gets too funky to foresee all the possible scenarios.

Random note:
Get the latest available version of pytorch with the CUDA support that you need:

pytorch_v=`pip install --extra-index-url https://download.pytorch.org/whl/cu101 torch== 2>&1 | grep -oE '(\(.*\))' | awk -F:\  '{print$NF}' | sed -E 's/( |\))//g' | tr ',' '\n' | grep cu101 | tail -1`

Testing

Manual nightly: ensured that polling GPU metrics on both lin/gpu and win/gpu works with the new bindings:
https://app.circleci.com/pipelines/github/wandb/wandb/14989/workflows/761624a2-c9fc-45fe-98c9-ca5b0a0c31f4

Checklist

Include reference to internal ticket "Fixes WB-NNNN" (and github issue "Fixes #NNNN" if applicable)

codecov · 2022-08-16T08:27:09Z

Codecov Report

Merging #4109 (6fdd136) into master (9c77726) will decrease coverage by 0.02%.
The diff coverage is n/a.

❗ Current head 6fdd136 differs from pull request most recent head b1cd149. Consider uploading reports for the commit b1cd149 to get more accurate results

@@            Coverage Diff             @@
##           master    #4109      +/-   ##
==========================================
- Coverage   82.71%   82.68%   -0.03%     
==========================================
  Files         256      256              
  Lines       32534    32512      -22     
==========================================
- Hits        26909    26884      -25     
- Misses       5625     5628       +3

Flag	Coverage Δ
functest	`55.00% <ø> (+<0.01%)`	⬆️
unittest	`73.43% <ø> (-0.02%)`	⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files	Coverage Δ
wandb/sdk/internal/meta.py	`90.79% <ø> (+3.06%)`	⬆️
wandb/sdk/wandb_watch.py	`74.54% <0.00%> (-6.94%)`	⬇️
wandb/sdk/lib/sock_client.py	`89.92% <0.00%> (-2.67%)`	⬇️
wandb/wandb_torch.py	`59.93% <0.00%> (-1.36%)`	⬇️
wandb/filesync/step_prepare.py	`93.67% <0.00%> (-1.27%)`	⬇️
wandb/sdk/internal/file_stream.py	`88.85% <0.00%> (-1.02%)`	⬇️
wandb/sdk/wandb_run.py	`90.94% <0.00%> (-0.23%)`	⬇️
wandb/beta/workflows.py	`93.54% <0.00%> (-0.11%)`	⬇️
wandb/apis/reports/reports.py	`88.92% <0.00%> (-0.02%)`	⬇️
wandb/env.py	`74.89% <0.00%> (ø)`
... and 114 more

.circleci/config.yml

tox.ini

wandb/vendor/pynvml/pynvml.py

kptkin

🥇

raubitsj

Looks great, thanks for going the extra mile on the testing and picking a good path vendor vs ?

dmitryduev added 3 commits August 16, 2022 00:48

vendor nvidia-ml-py-11.515.48

875e66d

vendor nvidia-ml-py-11.515.48

5b9ffd7

Merge branch 'master' of https://github.com/wandb/wandb into bsod

fa7443f

dmitryduev added 6 commits August 16, 2022 01:31

add win/gpu testing

d3c2ebd

add win/gpu testing

459af75

add win/gpu testing

1ae0635

add win/gpu testing

cd0e436

add win/gpu testing

388b1e0

add win/gpu testing

66475b5

dmitryduev requested a review from a team August 16, 2022 08:53

dmitryduev and others added 13 commits August 16, 2022 02:00

add win/gpu testing

9e0fe73

unskip relevant tests on win; add note to pynvml

96e61ae

move tests to nightly

b53f980

more fixes to the vendored pynvml

9d677cf

fix wincovercircle tox testenv

f262453

fix win job

66ff482

fix win job

80c29d7

stop wasting time on a useless test case

ee11cbb

Merge branch 'master' of https://github.com/wandb/wandb into bsod

7426112

Merge branch 'master' into bsod

909470d

Merge branch 'bsod' of https://github.com/wandb/wandb into bsod

5831864

fix wincovercircle

82ed159

fix wincovercircle

d7af173

dmitryduev commented Aug 17, 2022

View reviewed changes

.circleci/config.yml Show resolved Hide resolved

dmitryduev commented Aug 17, 2022

View reviewed changes

tox.ini Show resolved Hide resolved

dmitryduev commented Aug 17, 2022

View reviewed changes

wandb/vendor/pynvml/pynvml.py Show resolved Hide resolved

dmitryduev added 3 commits August 17, 2022 13:47

crank up timeout for win tests

4f2893a

crank up timeout for win tests

0ebac7f

update config.yml

c5e2d1b

Merge branch 'master' of https://github.com/wandb/wandb into bsod

6fdd136

kptkin approved these changes Aug 18, 2022

View reviewed changes

raubitsj self-requested a review August 18, 2022 18:45

raubitsj approved these changes Aug 18, 2022

View reviewed changes

Merge branch 'master' into bsod

b1cd149

dmitryduev enabled auto-merge (squash) August 19, 2022 08:47

dmitryduev disabled auto-merge August 19, 2022 08:47

dmitryduev merged commit ead6a1e into master Aug 19, 2022

dmitryduev deleted the bsod branch August 19, 2022 08:50

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Vendor `nvidia-ml-py-11.515.48` #4109

Vendor `nvidia-ml-py-11.515.48` #4109

dmitryduev commented Aug 16, 2022 •

edited

codecov bot commented Aug 16, 2022 •

edited

kptkin left a comment

raubitsj left a comment

Vendor nvidia-ml-py-11.515.48 #4109

Vendor nvidia-ml-py-11.515.48 #4109

Conversation

dmitryduev commented Aug 16, 2022 • edited

Description

Testing

Checklist

codecov bot commented Aug 16, 2022 • edited

Codecov Report

kptkin left a comment

Choose a reason for hiding this comment

raubitsj left a comment

Choose a reason for hiding this comment

Vendor `nvidia-ml-py-11.515.48` #4109

Vendor `nvidia-ml-py-11.515.48` #4109

dmitryduev commented Aug 16, 2022 •

edited

codecov bot commented Aug 16, 2022 •

edited