Client() control scope of resources #6805

pseudotensor · 2020-11-04T23:17:31Z

Currently I cannot see way to manage both GPUs and CPUs using dask "resources".

Suppose I have 2 machines each with 1 GPU and 8 cores. I can set labels for GPU and CPU counts, but dask_cudf/RAPIDS side requires 1 worker/GPU. But then for any CPU tasks I am suddenly forced to having 1 worker process for an entire node. I can set nthreads to all cores divided by number of GPUs. For numpy/panda/scipy things that release GIL that might be kinda ok, but as docs say this is not optimal if something does not release GIL as threads become useless except for I/O.

If I try to add extra workers just for CPU resources and have other workers just for GPU resources, this does not work either. Packages like xgboost automatically consume all workers, as the only way to restrict resources currently is to use .compute() or client.submit, but the client itself cannot be restricted it seems. One hits: dmlc/xgboost#6344 because of this. If I could call Client() with a "resources=" argument so any use of that client would be using limited resources like .compute, .submit, then that would work. But AFAIK, I see no such possibility even though would be more general that current scheme of only allowing for .compute, .submit, etc.

I can use 2 schedulers for each type of resource, but that defeats purpose of scheduling and resource management. E.g. I know that xgboost is pretty efficient using only GPU and not CPU, so I could run CPU and GPU at same time roughly. But other packages like lightgbm use alot of CPU even when running on GPU, so such a split of scheduling would drag system to crawl with other dask tasks.

Basically, request is for Client() itself to consume resources requests to limit scope of all dask tasks, not just ones that take explicit resources like .compute, client.submit since often we do not use dask in such a fine-grained way, like for xgboost package we have no such control.

pseudotensor · 2020-11-05T06:43:43Z

I'm curious if one work-around is to use the client.submit with resource control and then inside that call do some dask operations. It seems a bit unwieldy but might work.

sjperkins · 2020-11-05T11:12:43Z

@pseudotensor This is probably a good use case for task annotations. We've made some progress towards adding them already. The remaining work is to transmit and unpack them on the distributed scheduler.

sjperkins · 2020-11-05T11:17:26Z

See https://docs.dask.org/en/latest/api.html#dask.annotate

pseudotensor · 2020-11-05T20:13:46Z

@sjperkins Thanks. It's seems (to me naively) that just providing a resources to Client() is more straight-forward for user than extra contextual annotations. That is, one can already use client as a context manager, and if one could pass priority/resouces/etc. to Client isn't that enough? Why does there need to be yet another context manager?

I'll check out the docs you pointed to to see if I can understand. It's not clear to me what is available now vs. in development.

pseudotensor mentioned this issue Nov 5, 2020

Finer control over dask workers. dmlc/xgboost#6344

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Client() control scope of resources #6805

Client() control scope of resources #6805

pseudotensor commented Nov 4, 2020 •

edited

pseudotensor commented Nov 5, 2020

sjperkins commented Nov 5, 2020 •

edited

sjperkins commented Nov 5, 2020

pseudotensor commented Nov 5, 2020 •

edited

Client() control scope of resources #6805

Client() control scope of resources #6805

Comments

pseudotensor commented Nov 4, 2020 • edited

pseudotensor commented Nov 5, 2020

sjperkins commented Nov 5, 2020 • edited

sjperkins commented Nov 5, 2020

pseudotensor commented Nov 5, 2020 • edited

pseudotensor commented Nov 4, 2020 •

edited

sjperkins commented Nov 5, 2020 •

edited

pseudotensor commented Nov 5, 2020 •

edited