Finer control over dask workers. #6344

pseudotensor · 2020-11-04T22:56:35Z

Exception in gpu_hist: [12:40:37] /root/repo/xgboost/src/common/device_helpers.cu:64: Check failed: n_uniques == world (1 vs. 10) : Multiple processes within communication group running on same CUDA device is not supported\nStack trace:\n  [bt] (0) /home/jon/minicondadai/lib/python3.6/site-packages/xgboost/lib/libxgboost.so(dh::AllReducer::Init(int)+0x9b9) [0x1500ac26d9a9]\n  [bt] (1) /home/jon/minicondadai/lib/python3.6/site-packages/xgboost/lib/libxgboost.so(xgboost::tree::GPUHistMakerSpecialised<xgboost::detail::GradientPairInternal<double> >::InitDataOnce(xgboost::DMatrix*)+0x12a) [0x1500ac3b10ea]\n  [bt] (2) /home/jon/minicondadai/lib/python3.6/site-packages/xgboost/lib/libxgboost.so(xgboost::tree::GPUHistMakerSpecialised<xgboost::detail::GradientPairInternal<double> >::Update(xgboost::HostDeviceVector<xgboost::detail::GradientPairInternal<float> >*, xgboost::DMatrix*, std::vector<xgboost::RegTree*, std::allocator<xgboost::RegTree*> > const&)+0x26b) [0x1500ac3ba01b]\n  [bt] (3) /home/jon/minicondadai/lib/python3.6/site-packages/xgboost/lib/libxgboost.so(xgboost::gbm::GBTree::BoostNewTrees(xgboost::HostDeviceVector<xgboost::detail::GradientPairInternal<float> >*, xgboost::DMatrix*, int, std::vector<std::unique_ptr<xgboost::RegTree, std::default_delete<xgboost::RegTree> >, std::allocator<std::unique_ptr<xgboost::RegTree, std::default_delete<xgboost::RegTree> > > >*)+0xc0d) [0x1500ac0f980d]\n  [bt] (4) /home/jon/minicondadai/lib/python3.6/site-packages/xgboost/lib/libxgboost.so(xgboost::gbm::GBTree::DoBoost(xgboost::DMatrix*, xgboost::HostDeviceVector<xgboost::detail::GradientPairInternal<float> >*, xgboost::PredictionCacheEntry*)+0x10c) [0x1500ac0fb0ec]\n  [bt] (5) /home/jon/minicondadai/lib/python3.6/site-packages/xgboost/lib/libxgboost.so(xgboost::LearnerImpl::UpdateOneIter(int, std::shared_ptr<xgboost::DMatrix>)+0x39d) [0x1500ac13337d]\n  [bt] (6) /home/jon/minicondadai/lib/python3.6/site-packages/xgboost/lib/libxgboost.so(XGBoosterUpdateOneIter+0x54) [0x1500ac02cac4]\n  [bt] (7) /home/jon/minicondadai/lib/python3.6/lib-dynload/../../libffi.so.6(ffi_call_unix64+0x4c) [0x1501c379c630]\n  [bt] (8) /home/jon/minicondadai/lib/python3.6/lib-dynload/../../libffi.so.6(ffi_call+0x22d) [0x1501c379bfed]\n\n\n\nStack trace:\n  [bt] (0) /home/jon/minicondadai/lib/python3.6/site-packages/xgboost/lib/libxgboost.so(dmlc::LogMessageFatal::~LogMessageFatal()+0x6a) [0x1500ac028b3a]\n  [bt] (1) /home/jon/minicondadai/lib/python3.6/site-packages/xgboost/lib/libxgboost.so(xgboost::tree::GPUHistMakerSpecialised<xgboost::detail::GradientPairInternal<double> >::Update(xgboost::HostDeviceVector<xgboost::detail::GradientPairInternal<float> >*, xgboost::DMatrix*, std::vector<xgboost::RegTree*, std::allocator<xgboost::RegTree*> > const&)+0xaae) [0x1500ac3ba85e]\n  [bt] (2) /home/jon/minicondadai/lib/python3.6/site-packages/xgboost/lib/libxgboost.so(xgboost::gbm::GBTree::BoostNewTrees(xgboost::HostDeviceVector<xgboost::detail::GradientPairInternal<float> >*, xgboost::DMatrix*, int, std::vector<std::unique_ptr<xgboost::RegTree, std::default_delete<xgboost::RegTree> >, std::allocator<std::unique_ptr<xgboost::RegTree, std::default_delete<xgboost::RegTree> > > >*)+0xc0d) [0x1500ac0f980d]\n  [bt] (3) /home/jon/minicondadai/lib/python3.6/site-packages/xgboost/lib/libxgboost.so(xgboost::gbm::GBTree::DoBoost(xgboost::DMatrix*, xgboost::HostDeviceVector<xgboost::detail::GradientPairInternal<float> >*, xgboost::PredictionCacheEntry*)+0x10c) [0x1500ac0fb0ec]\n  [bt] (4) /home/jon/minicondadai/lib/python3.6/site-packages/xgboost/lib/libxgboost.so(xgboost::LearnerImpl::UpdateOneIter(int, std::shared_ptr<xgboost::DMatrix>)+0x39d) [0x1500ac13337d]\n  [bt] (5) /home/jon/minicondadai/lib/python3.6/site-packages/xgboost/lib/libxgboost.so(XGBoosterUpdateOneIter+0x54) [0x1500ac02cac4]\n  [bt] (6) /home/jon/minicondadai/lib/python3.6/lib-dynload/../../libffi.so.6(ffi_call_unix64+0x4c) [0x1501c379c630]\n  [bt] (7) /home/jon/minicondadai/lib/python3.6/lib-dynload/../../libffi.so.6(ffi_call+0x22d) [0x1501c379bfed]\n  [bt] (8) /home/jon/minicondadai/lib/python3.6/lib-dynload/_ctypes.cpython-36m-x86_64-linux-gnu.so(_ctypes_callproc+0x2ce) [0x1501c37b2f9e]\n\n',)

In single node setting one can use

from dask.distributed import Client, wait
from dask_cuda import LocalCUDACluster
with LocalCUDACluster() as cluster:
    with Client(cluster) as client:

In multinode setting, NVIDIA/dask recommend CLI binary to launch one worker per GPU. That's fine, but in general with dask one might have other workers on the same node managing other resources.

However, xgboost always consumes all workers, and there appears no way to control this.

From dask perspective, one can use resources to specify what each worker has, but this can only be used for things like .compute() or client.submit() and is not able to be set by Client() that only accesses the entire cluster.

I'm not sure if this problem is key to xgboost or dask, but either way it's hard to manage having both CPUs and GPUs with dask and xgboost.

One can have 2 schedulers and groups of workers per node, but then this loses point of "resources" with dask and having a single scheduler manage these.

Any ideas?

The text was updated successfully, but these errors were encountered:

pseudotensor · 2020-11-05T00:34:36Z

Other context here that is relevant is that even if one has a large GPU cluster, it doesn't mean one wants xgboost to use entire system. This can be quite inefficient. When using .compute and .submit one can have dask limit resources, but xgboost offers no such thing. So entire GPU cluster will be consumed.

If I'm confused and there is way to manage this, please let me know, thanks!

In this context Client does have a way to setup a LocalCluster and options like n_workers and threads_per_worker are passed to that (https://examples.dask.org/machine-learning/xgboost.html).

However, this does not control how many workers used for an existing cluster (and it's only relevant for a single node).

pseudotensor · 2020-11-05T00:49:36Z

In past I might imagine we would use n_gpus for this. n_gpus is now deprecated, but it could be used to control how many workers are used.

trivialfis · 2020-11-05T06:03:24Z

The interface in xgboost does not support fine grained controlling of resources. This PR might help: #6343 . It will make xgboost run only on the nodes that data are resided in. But further controlling through xgboost might not come, as right now xgboost does not move nor copy any data in between workers, which makes things a lot simpler.

pseudotensor · 2020-11-05T06:29:13Z

Thanks @trivialfis . Understood, just that if one goes beyond model of single node DGX-like setup and go into multinode GPU usage, it's not at all practical in general to use absolutely every GPU across a cluster.

Co-locating computation and data is good, but dask_cudf doesn't allow control over number of workers either. It just inherits what the client can access and so still uses all.

I understand this can be seen as a dask issue, but NVIDIA might be interested in figuring it out on dask or xgboost side, so that more general clusters of GPUs can be used.

pseudotensor · 2020-11-05T09:06:38Z

Related
#6008
#4735

trivialfis · 2020-11-05T12:12:21Z

I'm not sure why are you seeing this error. That means multiple xgboost workers are running on the same device. The followed discussion doesn't seem to be related to this particular error and using dask_cuda should not reach to such state. Could you please show me how to reproduce?

pseudotensor · 2020-11-05T17:24:02Z

It's not dask_cuda that leads to this error, only xgboost.

As described here dask/dask#6805 , I'm running CLI to launch GPU workers, one per node via dask-cuda-worker, and another set of CPU workers. As said there, I'm doing this because GPU worker has 1 process with resource label GPU, but this is bad for CPU operations, so I have also 1 worker per process as separate worker with resource label CPU.

But xgboost sees only entire cluster, so mixes up GPU and CPU ones because there is no control over which workers are used.

But despite that, even if have GPU cluster, I can't even control how many of those are used.

trivialfis · 2020-11-05T17:34:40Z

I changed the title, feel free to change again if you think it's not suitable.

But despite that, even if have GPU cluster, I can't even control how many of those are used.

That's a real issue that we need to come up with something we can recommend. cc @JohnZed

JohnZed · 2020-11-05T19:01:34Z

The easiest Dask approach is usually to spin up a cluster with just the workers you want to use together. Doing fine-grained manipulations of which workers run which tasks can be possible by passing worker lists to compute and submit, but it's going to take a LOT of bookkeeping in all user and library code, so it's not something we've supported today. (Other Dask libraries like dask-ml also don't support this.)
Ultimately, I think this is more of a Dask question. If Dask provides a nice mechanism for partitioning clusters, I think XGBoost would be happy to support that, but it's not a well-supported paradigm right now.

pseudotensor · 2020-11-06T05:19:13Z

Spinning up just what one would need works for one-off single-user solutions in some jupyter notebook, but it doesn't scale to general multi-user applications.

trivialfis · 2020-12-17T13:14:09Z

I will close this for now. As a downstream project, xgboost is largely orthogonal to the dask cluster, client is the way xgboost talk to workers. Feel free to continue the discussion on dask/distributed and please keep me in the loop. ;-)

pseudotensor mentioned this issue Nov 4, 2020

Client() control scope of resources dask/dask#6805

Open

trivialfis changed the title ~~Check failed: n_uniques == world (1 vs. 10) : Multiple processes within communication group running on same CUDA device is not supported~~ Finer control over dask workers. Nov 5, 2020

trivialfis closed this as completed Dec 17, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Finer control over dask workers. #6344

Finer control over dask workers. #6344

pseudotensor commented Nov 4, 2020 •

edited

pseudotensor commented Nov 5, 2020 •

edited

pseudotensor commented Nov 5, 2020

trivialfis commented Nov 5, 2020 •

edited

pseudotensor commented Nov 5, 2020

pseudotensor commented Nov 5, 2020 •

edited

trivialfis commented Nov 5, 2020

pseudotensor commented Nov 5, 2020

trivialfis commented Nov 5, 2020

JohnZed commented Nov 5, 2020

pseudotensor commented Nov 6, 2020

trivialfis commented Dec 17, 2020

Finer control over dask workers. #6344

Finer control over dask workers. #6344

Comments

pseudotensor commented Nov 4, 2020 • edited

pseudotensor commented Nov 5, 2020 • edited

pseudotensor commented Nov 5, 2020

trivialfis commented Nov 5, 2020 • edited

pseudotensor commented Nov 5, 2020

pseudotensor commented Nov 5, 2020 • edited

trivialfis commented Nov 5, 2020

pseudotensor commented Nov 5, 2020

trivialfis commented Nov 5, 2020

JohnZed commented Nov 5, 2020

pseudotensor commented Nov 6, 2020

trivialfis commented Dec 17, 2020

pseudotensor commented Nov 4, 2020 •

edited

pseudotensor commented Nov 5, 2020 •

edited

trivialfis commented Nov 5, 2020 •

edited

pseudotensor commented Nov 5, 2020 •

edited