New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Vendor nvidia-ml-py-11.515.48
#4109
Conversation
Codecov Report
@@ Coverage Diff @@
## master #4109 +/- ##
==========================================
- Coverage 82.71% 82.68% -0.03%
==========================================
Files 256 256
Lines 32534 32512 -22
==========================================
- Hits 26909 26884 -25
- Misses 5625 5628 +3
Flags with carried forward coverage won't be shown. Click here to find out more.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🥇
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks great, thanks for going the extra mile on the testing and picking a good path vendor vs ?
Fixes WB-NNNN
Fixes #NNNN
Description
Context:
_disable_stats
doesn't work.wandb.init(settings=wandb.Settings(_disable_stats=True))
It still sends stats to WANDB, which in turn leads to BSOD due to incompatibility with the old PYNVML dependency in the vendor folder. #3597 and [App]: PAGE_FAULT_IN_NONPAGED_AREA Win10 blue screen of death error with pytorch lightning #3819), which might be related to that fact.Proposed solution:
Vendor the Python bindings available at https://pypi.org/project/nvidia-ml-py/
NVML_DLL_PATH
env var check to the library initialization pathnvmlDeviceGetComputeRunningProcesses
,nvmlDeviceGetGraphicsRunningProcesses
, andnvmlDeviceGetMPSComputeRunningProcesses
. Reason: older driver versions don't understand _v3's that is now the only option in the lib, so I had to add this enumeration/trial loop.Add nightly testing on a Win+GPU executor in Circle
Add nightly checks for new releases of vendored packages (for now, only this one) (+ anything from our requirements?), posting "interesting findings" to Slack.
Reasoning:
NVML_DLL_PATH
env var check) that is absent in the lib. It might be not necessary, but I did find a few references to it, so it seems safer to keep it there.NVML_DLL_PATH
stuff unnecessary (check with NVIDIA!), might as well rm the vendored version and add a requirement instead.Random note:
Get the latest available version of pytorch with the CUDA support that you need:
pytorch_v=`pip install --extra-index-url https://download.pytorch.org/whl/cu101 torch== 2>&1 | grep -oE '(\(.*\))' | awk -F:\ '{print$NF}' | sed -E 's/( |\))//g' | tr ',' '\n' | grep cu101 | tail -1`
Testing
Manual nightly: ensured that polling GPU metrics on both lin/gpu and win/gpu works with the new bindings:
https://app.circleci.com/pipelines/github/wandb/wandb/14989/workflows/761624a2-c9fc-45fe-98c9-ca5b0a0c31f4
Checklist