Raise error when setting communication protocols if only 1 GPU is used or if hardware is uncapable #1066

Joachimoe · 2022-12-14T15:27:48Z

Hi,

I have been testing various set-ups using dask-cuda, and I experienced two different things, and I am unsure if this should be two separate issues. When reading through the documentation of spilling, https://docs.rapids.ai/api/dask-cuda/stable/spilling.html, it is unclear how the CPU-GPU communication is done if UCX is actively chosen as the communication protocol. Will workers write to disk via UCX if UCX is chosen? I suppose so.

Secondly, I am working on a local machine an Intel i7-6700K CPU and GeForce RTX 2070. The following piece of code executes with no problem:

if __name__ == '__main__':
    file = sys.argv[1]
    cluster = LocalCUDACluster('0',
                                n_workers = 1,
                                enable_nvlink = True,
                                rmm_pool_size="2GB"
                               )    
    client = Client(cluster)                                    ##We create a local cluster here 
    rmm.reinitialize(managed_memory =True)

    f = read(file)
    y = benchmark(bench, ( f, ),  n_repeat= 3, n_warmup=1)
    print(parse_cupy(y))
    client.restart()

Although the CPU and GPU has PCIe links, no errors are raised when setting the enable_nvlink to True, even though one worker is specifically set. Therefore, I was at least a bit mislead. I understand that this problem is maybe on my part, but an error message would have been nice :-)

The text was updated successfully, but these errors were encountered:

pentschev · 2023-01-06T14:15:32Z

When reading through the documentation of spilling, https://docs.rapids.ai/api/dask-cuda/stable/spilling.html, it is unclear how the CPU-GPU communication is done if UCX is actively chosen as the communication protocol. Will workers write to disk via UCX if UCX is chosen? I suppose so.

Communication between CPU and GPU(s) on the same host always goes through PCIe, independent whether UCX is used or not, since that is the "closest" path.

Although the CPU and GPU has PCIe links, no errors are raised when setting the enable_nvlink to True, even though one worker is specifically set. Therefore, I was at least a bit mislead. I understand that this problem is maybe on my part, but an error message would have been nice :-)

NVLink only exists between multiple GPUs and only if there's an NVLink bridge connecting the two, therefore it doesn't apply anywhere else and it won't raise an error because that's telling UCX to enable NVLink to use it when available. It is also now recommended to use automatic UCX configuration.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Raise error when setting communication protocols if only 1 GPU is used or if hardware is uncapable #1066

Raise error when setting communication protocols if only 1 GPU is used or if hardware is uncapable #1066

Joachimoe commented Dec 14, 2022

pentschev commented Jan 6, 2023

Raise error when setting communication protocols if only 1 GPU is used or if hardware is uncapable #1066

Raise error when setting communication protocols if only 1 GPU is used or if hardware is uncapable #1066

Comments

Joachimoe commented Dec 14, 2022

pentschev commented Jan 6, 2023