Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

_disable_stats doesn't work. wandb.init(settings=wandb.Settings(_disable_stats=True)) It still sends stats to WANDB, which in turn leads to BSOD due to incompatibility with the old PYNVML dependency in the vendor folder. #3597

Open
anatolii-kotov opened this issue Apr 29, 2022 · 31 comments
Labels
a:cli Area: Client c:system-metrics s:nexus-fix Stage: will be fixed with the new sdk backend

Comments

@anatolii-kotov
Copy link

anatolii-kotov commented Apr 29, 2022

_disable_stats doesn't work. wandb.init(settings=wandb.Settings(_disable_stats=True)) It still sends stats to WANDB, which in turn leads to BSOD due to incompatibility with the old PYNVML dependency in the vendor folder.

Originally posted by @CosmicHazel in #473 (comment)

Can confirm that this is causing BSOD on Windows platform with Nvidia GPU with latest drivers. And since there's no way to disable it there's practically now way to use wandb on Windows

@exalate-issue-sync
Copy link

Leslie commented:
Hi Anatolli, thank you for bringing this up with us. Can you tell me the use case for using this hidden method?

@exalate-issue-sync
Copy link

Leslie commented:
Hi,

We wanted to follow up with you regarding your support request as we have not heard back from you. Please let us know if we can be of further assistance or if your issue has been resolved.

@dmitryduev
Copy link
Member

dmitryduev commented May 5, 2022

Hi @anatolii-kotov, thanks for reporting this! This issue was solved in #3510 and shipped with the latest release 2 days ago (https://github.com/wandb/client/releases/tag/v0.12.16). Could you please give it a try and let us know whether it worked for you?

@anatolii-kotov
Copy link
Author

Hi @dmitryduev , tried the last version but still having BSOD. In this case it appeared after everything was sent via wandb API. System: Win11 22000.613 & Nvidia 512.59

@exalate-issue-sync
Copy link

WandB Internal User commented:
dmitryduev commented:
Hi @anatolii-kotov, thanks for reporting this! This issue was solved in #3510 and shipped with the latest release 2 days ago (https://github.com/wandb/client/releases/tag/v0.12.16). Could you please give it a try and let us know whether it worked for you?

@anatolii-kotov
Copy link
Author

anatolii-kotov commented May 8, 2022

Could you please give it a try and let us know whether it worked for you?

yes, I tried it with the latest version, still getting BSOD

@dmitryduev
Copy link
Member

thanks for the update @anatolii-kotov, we'll look into this.

@benjamincburns
Copy link

benjamincburns commented Jun 6, 2022

@dmitryduev has there been any progress on this issue?

If it helps any, it seems to be an issue related to pynvml and NVIDIA drivers later than version 472.12. That is, any package that calls pynvml.nvmlInit seems to cause a BSOD for me. This includes tools like nvitop, etc.

I haven't tested drivers prior to 472.12, however - so it might require an exact match, or it might just be a regression and that's the last good version.

@benjamincburns
Copy link

@dmitryduev on further investigation, I think the issue is that the wandb sdk bundles an old version of pynvml in wandb.vendor.pynvml. Rather than bundling the old version, wandb should instead depend on nvidia-ml-py which is maintained by nvidia and updated regularly (last update was 19 May 2022).

@lesliewandb
Copy link

Thank you for the extra information @benjamincburns! @dmitryduev is going to do more investigation on this early next week

@exalate-issue-sync
Copy link

WandB Internal User commented:
benjamincburns commented:
@dmitryduev on further investigation, I think the issue is that the wandb sdk bundles an old version of pynvml in wandb.vendor.pynvml. Rather than bundling the old version, wandb should instead depend on nvidia-ml-py which is maintained by nvidia and updated regularly (last update was 19 May 2022).

@PeterKeffer
Copy link

Anything new here? Can't use wandb, because I'm getting BSOD every single time. I got crazy in the beginning, because I didn't know wandb was the issue...

@lesliewandb
Copy link

I'm so sorry for the wait! I talked to the engineer in charge of this and they mentioned that they would work on it this week

@dmitryduev
Copy link
Member

dmitryduev commented Aug 19, 2022

Hey all, many thanks for bringing this to our attention and please accept my apologies for it taking us so long to properly look into. We have updated the vendored version of nvidia-ml-py here and that PR has been merged into master.
Could you please try installing wandb from master and let us know if it works now? Would really appreciate that!

@PeterKeffer
Copy link

Thank you so much! I just tried it remotely, but it looks like it crashed again. I will try it again tomorrow, when I'm at home. I got an RTX 2080 Ti and Intel CPU. My wandb version says: 0.13.2.dev1
python -m pip install git+https://github.com/wandb/wandb correct?

@benjamincburns
Copy link

benjamincburns commented Aug 22, 2022

I participate in a reinforcement learning community via discord, and the common (horribly ugly) workaround in that group is to edit the wandb code in site-packages and comment out any calls to pynvml.nvmlInit. Given @PeterKeffer's results, perhaps a code change that has the same effect is also warranted? That would at least give a workaround to people who experience this issue who don't want to stay stuck on a nearly year-old driver version.

Also I see from #4109 that you weren't able to repro the original problem @dmitryduev? Do you happen to have access to a computer with a 2080 Ti? It seems to repro reliably with that card on either Windows 10 or Windows 11. Also for the purpose of reproducing the problem, I would avoid driver versions 472.12, and the current 516.94 driver (I've heard from one person that the crash went away on that driver version).

@dmitryduev
Copy link
Member

Many thanks for the updates, @benjamincburns and @PeterKeffer!

@PeterKeffer: would you mind trying to update the driver to 516.94 and see if it still crashes?

@benjamincburns: I tried repro'ing on a bunch of different Tesla cards on Win 10, 11, and Server 2019, with a number of driver versions within (and outside!) the range you mentioned. Also tried a plain 2080 and it also works. Closing in on a machine with a 2080 Ti, might have an update soon.

In the mean time, to turn off sys metrics logging completely (instead of commenting out pynvml calls), could you try

wandb.init(settings=wandb.Settings(_disable_stats=True, _disable_meta=True))

@benjamincburns
Copy link

benjamincburns commented Aug 23, 2022

Many thanks for the updates, @benjamincburns and @PeterKeffer!

@PeterKeffer: would you mind trying to update the driver to 516.94 and see if it still crashes?

@benjamincburns: I tried repro'ing on a bunch of different Tesla cards on Win 10, 11, and Server 2019, with a number of driver versions within (and outside!) the range you mentioned. Also tried a plain 2080 and it also works. Closing in on a machine with a 2080 Ti, might have an update soon.

Ah interesting. I'm really curious to know why it doesn't repro for you on all of those boxes. I know Tesla GPUs are using a different driver series, but I wouldn't expect much of any difference between the 2080 and the 2080 Ti. Thanks for going on such a scavenger hunt!

In the mean time, to turn off sys metrics logging completely (instead of commenting out pynvml calls), could you try

wandb.init(settings=wandb.Settings(_disable_stats=True, _disable_meta=True))

Unfortunately unless there has been a change, per the title of this issue, running with _disable_stats=True wasn't enough (at the time of writing, anyway) to avoid the BSOD. I'll give it another try sometime in the next week and report back, however.

Edit: oh, I see - we need the extra _disable_meta arg. Thanks, I'll make sure to include that when I test next time.

@dmitryduev
Copy link
Member

@benjamincburns, yea, _disable_stats disables stats collection during a run while _disable_meta turns off probing system hardware at the start of a run (the stuff that is then displayed on the Run Overview page such as the number of CPU cores / number of GPUs etc.), which also tries to init pynvml.

@exalate-issue-sync
Copy link

WandB Internal User commented:
PeterKeffer commented:
Thank you so much! I just tried it remotely, but it looks like it crashed again. I will try it again tomorrow, when I'm at home. I got an RTX 2080 Ti and Intel CPU. My wandb version says: 0.13.2.dev1
python -m pip install git+https://github.com/wandb/wandb correct?

@exalate-issue-sync
Copy link

WandB Internal User commented:
benjamincburns commented:
I participate in a reinforcement learning community via discord, and the common (horribly ugly) workaround in that group is to edit the wandb code in side-packages and comment out any calls to pynvml.nvmlInit. Given @PeterKeffer's results, perhaps a code change that has the same effect is also warranted? That would at least give a workaround to people who experience this issue who don't want to stay stuck on a nearly year-old driver version.

Also I see from #4109 that you weren't able to repro the original problem @dmitryduev? Do you happen to have access to a computer with a 2080 Ti? It seems to repro reliably with that card on either Windows 10 or Windows 11. Also for the purpose of reproducing the problem, I would avoid driver versions 472.12, and the current 516.94 driver.

@PeterKeffer
Copy link

I have great news:
I installed Nvidia driver version 516.94 (Before I had 516.59) and now it doesn't crash anymore! Now I can continue advertising wandb to all my colleagues and friends! I even plan to do a presentation about wandb in one of my courses because nobody knows it, even though they are Deep Learning enthusiasts and wandb is awesome!

@dmitryduev Thank you so much for your efforts! @benjamincburns Also thank you for your valuable inputs! :)

@benjamincburns
Copy link

benjamincburns commented Aug 31, 2022

@lesliewandb @dmitryduev why was this issue closed? Running the wandb python client with most nvidia driver versions in use today still causes BSODs.

If the issue is going to be closed as completed you should at least capture notes about the workaround on the troubleshooting FAQ page. Given that I don't see that here, I strongly suspect that many users will continue encountering this problem for quite some time. https://docs.wandb.ai/guides/technical-faq/troubleshooting

@benjamincburns
Copy link

benjamincburns commented Aug 31, 2022

Right now this is the only thing in the FAQ that addresses crashes caused by WandB's client. I think the lack of clear resolution here really doesn't align with the values being conveyed in this FAQ entry, as a BSOD clearly affects my training run.

image

@lesliewandb
Copy link

Sorry for the confusion about closing the issue. Since I saw that there was no a BSOD with @PeterKeffer that this was solved. I'll make an internal ticket to get this fixed in the docs

@exalate-issue-sync
Copy link

WandB Internal User commented:
benjamincburns commented:
Right now this is the only thing in the FAQ that addresses crashes caused by WandB's client. I think the lack of clear resolution here really doesn't align with the values being conveyed in this FAQ entry.

image

@benjamincburns
Copy link

Sorry for the confusion about closing the issue. Since I saw that there was no a BSOD with @PeterKeffer that this was solved. I'll make an internal ticket to get this fixed in the docs

@lesliewandb there are still users, including myself who are reporting BSODs. I don't think this issue should be closed.

@Hzzone
Copy link

Hzzone commented Sep 6, 2022

The arg _disable_meta still does not work with the latest wandb python client.

@lesliewandb
Copy link

I understand, however our engineers have done what could be done on our end. Past this is a nvidia drivers + windows issue that's why this issue is closed now

@picpic117
Copy link

I'd like to claim that this problem still exists now...

@benjamincburns
Copy link

benjamincburns commented Feb 12, 2024

@lesliewandb there is a way for wandb to prevent the BSOD from occurring on windows - make sure that no NVML calls are made either when the _disable_stats arg is True, or as a result of setting some other TBD arg that is part of the stable public API.

Otherwise my only way of working around this issue today is by hand editing the wandb client in site_packages to comment out calls to pynvml.

@kptkin kptkin reopened this Mar 6, 2024
@kptkin kptkin added a:cli Area: Client s:nexus-fix Stage: will be fixed with the new sdk backend labels Apr 18, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
a:cli Area: Client c:system-metrics s:nexus-fix Stage: will be fixed with the new sdk backend
Projects
None yet
Development

No branches or pull requests

8 participants