Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Vendor nvidia-ml-py-11.515.48 #4109

Merged
merged 27 commits into from Aug 19, 2022
Merged

Vendor nvidia-ml-py-11.515.48 #4109

merged 27 commits into from Aug 19, 2022

Conversation

dmitryduev
Copy link
Member

@dmitryduev dmitryduev commented Aug 16, 2022

Fixes WB-NNNN
Fixes #NNNN

Description

Context:

Proposed solution:

  • Vendor the Python bindings available at https://pypi.org/project/nvidia-ml-py/

    • Modifications:
      • Added NVML_DLL_PATH env var check to the library initialization path
      • Applied black formatting to improve readability
      • Try different versions ("_v3", "_v2", "") of the functions: nvmlDeviceGetComputeRunningProcesses, nvmlDeviceGetGraphicsRunningProcesses, and nvmlDeviceGetMPSComputeRunningProcesses. Reason: older driver versions don't understand _v3's that is now the only option in the lib, so I had to add this enumeration/trial loop.
  • Add nightly testing on a Win+GPU executor in Circle

  • Add nightly checks for new releases of vendored packages (for now, only this one) (+ anything from our requirements?), posting "interesting findings" to Slack.

Reasoning:

  • https://pypi.org/project/nvidia-ml-py/ seems to be the most up-to-date version of the bindings on PyPI.
  • There is one change that we have in our current vendored version (NVML_DLL_PATH env var check) that is absent in the lib. It might be not necessary, but I did find a few references to it, so it seems safer to keep it there.
  • https://pypi.org/project/nvidia-ml-py seems to lack a GH repo, so I can't submit a PR.
  • There is another project, https://pypi.org/project/pynvml/ that does have a GH repo that is alive, but it is less frequently updated. The main contributor seems to be from NVIDIA, so I'm not sure what its relationship with the other library is.
  • I think the safest option is to vendor the "most official" bindings + add testing around it. If we find the NVML_DLL_PATH stuff unnecessary (check with NVIDIA!), might as well rm the vendored version and add a requirement instead.
  • Thought about monkey-patching the relevant code path, but that seems risky as it's the main global function that loads the module that needs patching, it gets too funky to foresee all the possible scenarios.

Random note:
Get the latest available version of pytorch with the CUDA support that you need:

pytorch_v=`pip install --extra-index-url https://download.pytorch.org/whl/cu101 torch== 2>&1 | grep -oE '(\(.*\))' | awk -F:\  '{print$NF}' | sed -E 's/( |\))//g' | tr ',' '\n' | grep cu101 | tail -1`

Testing

Manual nightly: ensured that polling GPU metrics on both lin/gpu and win/gpu works with the new bindings:
https://app.circleci.com/pipelines/github/wandb/wandb/14989/workflows/761624a2-c9fc-45fe-98c9-ca5b0a0c31f4

Checklist

  • Include reference to internal ticket "Fixes WB-NNNN" (and github issue "Fixes #NNNN" if applicable)

@codecov
Copy link

codecov bot commented Aug 16, 2022

Codecov Report

Merging #4109 (6fdd136) into master (9c77726) will decrease coverage by 0.02%.
The diff coverage is n/a.

❗ Current head 6fdd136 differs from pull request most recent head b1cd149. Consider uploading reports for the commit b1cd149 to get more accurate results

Impacted file tree graph

@@            Coverage Diff             @@
##           master    #4109      +/-   ##
==========================================
- Coverage   82.71%   82.68%   -0.03%     
==========================================
  Files         256      256              
  Lines       32534    32512      -22     
==========================================
- Hits        26909    26884      -25     
- Misses       5625     5628       +3     
Flag Coverage Δ
functest 55.00% <ø> (+<0.01%) ⬆️
unittest 73.43% <ø> (-0.02%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
wandb/sdk/internal/meta.py 90.79% <ø> (+3.06%) ⬆️
wandb/sdk/wandb_watch.py 74.54% <0.00%> (-6.94%) ⬇️
wandb/sdk/lib/sock_client.py 89.92% <0.00%> (-2.67%) ⬇️
wandb/wandb_torch.py 59.93% <0.00%> (-1.36%) ⬇️
wandb/filesync/step_prepare.py 93.67% <0.00%> (-1.27%) ⬇️
wandb/sdk/internal/file_stream.py 88.85% <0.00%> (-1.02%) ⬇️
wandb/sdk/wandb_run.py 90.94% <0.00%> (-0.23%) ⬇️
wandb/beta/workflows.py 93.54% <0.00%> (-0.11%) ⬇️
wandb/apis/reports/reports.py 88.92% <0.00%> (-0.02%) ⬇️
wandb/env.py 74.89% <0.00%> (ø)
... and 114 more

@dmitryduev dmitryduev requested a review from a team August 16, 2022 08:53
tox.ini Show resolved Hide resolved
Copy link
Contributor

@kptkin kptkin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🥇

@raubitsj raubitsj self-requested a review August 18, 2022 18:45
Copy link
Member

@raubitsj raubitsj left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great, thanks for going the extra mile on the testing and picking a good path vendor vs ?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants