Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GPU Server Loses GPU #1952

Open
StanHatko opened this issue Feb 29, 2024 · 8 comments
Open

GPU Server Loses GPU #1952

StanHatko opened this issue Feb 29, 2024 · 8 comments
Labels
kind/bug Something isn't working triage/support

Comments

@StanHatko
Copy link
Contributor

In the past couple of days I've encountered the case of GPU servers suddenly losing the GPU. This has very rarely occurred in the past, but yesterday and today is occurring very frequently and is making GPU servers close to unusable.

It occurs in the following situation: if a process using the GPU exits (either normally at end of program or by ctrl-c) and a new task uses the GPU starts, there's a good chance the GPU will no longer be available for the new task. An existing nvidia-smi -l 1 processing running will continue to run and report 0 GPU usage, but if terminated and restarted nvidia-smi will not work, generating the error shown in the screenshot.

image

@StanHatko StanHatko added kind/bug Something isn't working triage/support labels Feb 29, 2024
@StanHatko
Copy link
Contributor Author

Possible workaround I am testing today: Open an ipython session, run the following, and leave it open in a separate terminal. Idea is to keep the GPU device in use (with a small tensor on the GPU) and prevent the GPU from detaching.

Code to run in ipython:

import torch
d = torch.device('cuda:0')
x = torch.randn([4, 4]).to(d)
x.device

@StanHatko
Copy link
Contributor Author

That workaround seems to have been working for me so far today.

@chuckbelisle
Copy link
Contributor

Thanks for the update @StanHatko ! I've added it to the AAW issue backlog and will be assessing it at a later date.

@StanHatko
Copy link
Contributor Author

This workaround usually works (always for me until today), but on one server today the workaround failed and the GPU still detached. Hopefully this failure with the workaround remains rare but it can occur.

@StanHatko
Copy link
Contributor Author

This workaround failed on another GPU server. It seems the workaround basically no longer works, at least today.

@StanHatko
Copy link
Contributor Author

But restarting those servers and not using the workaround, it worked. So today it was inverted, problems occurred with the workaround but not without the workaround (just using the server normally).

@StanHatko
Copy link
Contributor Author

It occurred for me just now without the workaround being on (so it can occur in both cases), though it seems less frequent today when not having the workaround active.

@StanHatko
Copy link
Contributor Author

StanHatko commented Mar 19, 2024

I'm currently trying the following modification below to the workaround to keep the GPU device active and stop it from detaching. So far it seems to be working, but that could be a coincidence.

import time
import torch
d = torch.device('cuda:0')
x = torch.randn([4, 4]).to(d)
print(x.device)

with torch.no_grad():
    while True:
        x = x + 0.01
        time.sleep(0.5)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Something isn't working triage/support
Projects
None yet
Development

No branches or pull requests

2 participants