Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Raise error when setting communication protocols if only 1 GPU is used or if hardware is uncapable #1066

Open
Joachimoe opened this issue Dec 14, 2022 · 1 comment

Comments

@Joachimoe
Copy link

Hi,

I have been testing various set-ups using dask-cuda, and I experienced two different things, and I am unsure if this should be two separate issues. When reading through the documentation of spilling, https://docs.rapids.ai/api/dask-cuda/stable/spilling.html, it is unclear how the CPU-GPU communication is done if UCX is actively chosen as the communication protocol. Will workers write to disk via UCX if UCX is chosen? I suppose so.

Secondly, I am working on a local machine an Intel i7-6700K CPU and GeForce RTX 2070. The following piece of code executes with no problem:

if __name__ == '__main__':
    file = sys.argv[1]
    cluster = LocalCUDACluster('0',
                                n_workers = 1,
                                enable_nvlink = True,
                                rmm_pool_size="2GB"
                               )    
    client = Client(cluster)                                    ##We create a local cluster here 
    rmm.reinitialize(managed_memory =True)

    f = read(file)
    y = benchmark(bench, ( f, ),  n_repeat= 3, n_warmup=1)
    print(parse_cupy(y))
    client.restart()

Although the CPU and GPU has PCIe links, no errors are raised when setting the enable_nvlink to True, even though one worker is specifically set. Therefore, I was at least a bit mislead. I understand that this problem is maybe on my part, but an error message would have been nice :-)

@pentschev
Copy link
Member

When reading through the documentation of spilling, https://docs.rapids.ai/api/dask-cuda/stable/spilling.html, it is unclear how the CPU-GPU communication is done if UCX is actively chosen as the communication protocol. Will workers write to disk via UCX if UCX is chosen? I suppose so.

Communication between CPU and GPU(s) on the same host always goes through PCIe, independent whether UCX is used or not, since that is the "closest" path.

Although the CPU and GPU has PCIe links, no errors are raised when setting the enable_nvlink to True, even though one worker is specifically set. Therefore, I was at least a bit mislead. I understand that this problem is maybe on my part, but an error message would have been nice :-)

NVLink only exists between multiple GPUs and only if there's an NVLink bridge connecting the two, therefore it doesn't apply anywhere else and it won't raise an error because that's telling UCX to enable NVLink to use it when available. It is also now recommended to use automatic UCX configuration.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants