Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

test_cuda_visible_devices_and_memory_limit_and_nthreads spews benign (?) warnings on systems with fewer than eight GPUs #1127

Open
wence- opened this issue Feb 23, 2023 · 3 comments
Assignees

Comments

@wence-
Copy link
Contributor

wence- commented Feb 23, 2023

          > > it doesn't look like it failed any tests though. Is this a problem?

This looks bad to me. I wonder if it is happening not just on this branch, I will investigate

On closer examination, this error in the test logs comes from this test

@patch.dict(os.environ, {"CUDA_VISIBLE_DEVICES": "0,3,7,8"})
def test_cuda_visible_devices_and_memory_limit_and_nthreads(loop): # noqa: F811
nthreads = 4
with popen(["dask", "scheduler", "--port", "9359", "--no-dashboard"]):
with popen(
[
"dask",
"cuda",
"worker",
"127.0.0.1:9359",
"--host",
"127.0.0.1",
"--device-memory-limit",
"1 MB",
"--nthreads",
str(nthreads),
"--no-dashboard",
"--worker-class",
"dask_cuda.utils.MockWorker",
]
):
with Client("127.0.0.1:9359", loop=loop) as client:
assert wait_workers(client, n_gpus=4)
def get_visible_devices():
return os.environ["CUDA_VISIBLE_DEVICES"]
# verify 4 workers with the 4 expected CUDA_VISIBLE_DEVICES
result = client.run(get_visible_devices)
expected = {"0,3,7,8": 1, "3,7,8,0": 1, "7,8,0,3": 1, "8,0,3,7": 1}
for v in result.values():
del expected[v]
workers = client.scheduler_info()["workers"]
for w in workers.values():
assert w["memory_limit"] == MEMORY_LIMIT // len(workers)

Which is written assuming eight GPUs are available on the system running the test. So I think these problems in the logs are benign, but I will open a separate PR to fix this latter problem for the 23.04 branch.

Originally posted by @wence- in #1123 (comment)

@wence- wence- self-assigned this Feb 23, 2023
@pentschev
Copy link
Member

This is intentional, we're testing the ability to set CUDA_VISIBLE_DEVICES appropriately for each worker, and as noticed the downside are those errors that are harmless when fewer GPUs are available, but the logic test is still successful nevertheless. We could eventually attempt to ingest the warnings and suppress them instead, but other than that we don't want to remove the test or dumb it down only to prevent the stdout/stderr output.

@wence-
Copy link
Contributor Author

wence- commented Feb 23, 2023

My proposal would be to try and capture the warnings if the number of devices is fewer than required (rather than dumbing down the test).

@pentschev
Copy link
Member

Yes, that would be ideal in this situation.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants