Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Finer control over dask workers. #6344

Closed
pseudotensor opened this issue Nov 4, 2020 · 11 comments
Closed

Finer control over dask workers. #6344

pseudotensor opened this issue Nov 4, 2020 · 11 comments

Comments

@pseudotensor
Copy link
Contributor

pseudotensor commented Nov 4, 2020

Exception in gpu_hist: [12:40:37] /root/repo/xgboost/src/common/device_helpers.cu:64: Check failed: n_uniques == world (1 vs. 10) : Multiple processes within communication group running on same CUDA device is not supported\nStack trace:\n  [bt] (0) /home/jon/minicondadai/lib/python3.6/site-packages/xgboost/lib/libxgboost.so(dh::AllReducer::Init(int)+0x9b9) [0x1500ac26d9a9]\n  [bt] (1) /home/jon/minicondadai/lib/python3.6/site-packages/xgboost/lib/libxgboost.so(xgboost::tree::GPUHistMakerSpecialised<xgboost::detail::GradientPairInternal<double> >::InitDataOnce(xgboost::DMatrix*)+0x12a) [0x1500ac3b10ea]\n  [bt] (2) /home/jon/minicondadai/lib/python3.6/site-packages/xgboost/lib/libxgboost.so(xgboost::tree::GPUHistMakerSpecialised<xgboost::detail::GradientPairInternal<double> >::Update(xgboost::HostDeviceVector<xgboost::detail::GradientPairInternal<float> >*, xgboost::DMatrix*, std::vector<xgboost::RegTree*, std::allocator<xgboost::RegTree*> > const&)+0x26b) [0x1500ac3ba01b]\n  [bt] (3) /home/jon/minicondadai/lib/python3.6/site-packages/xgboost/lib/libxgboost.so(xgboost::gbm::GBTree::BoostNewTrees(xgboost::HostDeviceVector<xgboost::detail::GradientPairInternal<float> >*, xgboost::DMatrix*, int, std::vector<std::unique_ptr<xgboost::RegTree, std::default_delete<xgboost::RegTree> >, std::allocator<std::unique_ptr<xgboost::RegTree, std::default_delete<xgboost::RegTree> > > >*)+0xc0d) [0x1500ac0f980d]\n  [bt] (4) /home/jon/minicondadai/lib/python3.6/site-packages/xgboost/lib/libxgboost.so(xgboost::gbm::GBTree::DoBoost(xgboost::DMatrix*, xgboost::HostDeviceVector<xgboost::detail::GradientPairInternal<float> >*, xgboost::PredictionCacheEntry*)+0x10c) [0x1500ac0fb0ec]\n  [bt] (5) /home/jon/minicondadai/lib/python3.6/site-packages/xgboost/lib/libxgboost.so(xgboost::LearnerImpl::UpdateOneIter(int, std::shared_ptr<xgboost::DMatrix>)+0x39d) [0x1500ac13337d]\n  [bt] (6) /home/jon/minicondadai/lib/python3.6/site-packages/xgboost/lib/libxgboost.so(XGBoosterUpdateOneIter+0x54) [0x1500ac02cac4]\n  [bt] (7) /home/jon/minicondadai/lib/python3.6/lib-dynload/../../libffi.so.6(ffi_call_unix64+0x4c) [0x1501c379c630]\n  [bt] (8) /home/jon/minicondadai/lib/python3.6/lib-dynload/../../libffi.so.6(ffi_call+0x22d) [0x1501c379bfed]\n\n\n\nStack trace:\n  [bt] (0) /home/jon/minicondadai/lib/python3.6/site-packages/xgboost/lib/libxgboost.so(dmlc::LogMessageFatal::~LogMessageFatal()+0x6a) [0x1500ac028b3a]\n  [bt] (1) /home/jon/minicondadai/lib/python3.6/site-packages/xgboost/lib/libxgboost.so(xgboost::tree::GPUHistMakerSpecialised<xgboost::detail::GradientPairInternal<double> >::Update(xgboost::HostDeviceVector<xgboost::detail::GradientPairInternal<float> >*, xgboost::DMatrix*, std::vector<xgboost::RegTree*, std::allocator<xgboost::RegTree*> > const&)+0xaae) [0x1500ac3ba85e]\n  [bt] (2) /home/jon/minicondadai/lib/python3.6/site-packages/xgboost/lib/libxgboost.so(xgboost::gbm::GBTree::BoostNewTrees(xgboost::HostDeviceVector<xgboost::detail::GradientPairInternal<float> >*, xgboost::DMatrix*, int, std::vector<std::unique_ptr<xgboost::RegTree, std::default_delete<xgboost::RegTree> >, std::allocator<std::unique_ptr<xgboost::RegTree, std::default_delete<xgboost::RegTree> > > >*)+0xc0d) [0x1500ac0f980d]\n  [bt] (3) /home/jon/minicondadai/lib/python3.6/site-packages/xgboost/lib/libxgboost.so(xgboost::gbm::GBTree::DoBoost(xgboost::DMatrix*, xgboost::HostDeviceVector<xgboost::detail::GradientPairInternal<float> >*, xgboost::PredictionCacheEntry*)+0x10c) [0x1500ac0fb0ec]\n  [bt] (4) /home/jon/minicondadai/lib/python3.6/site-packages/xgboost/lib/libxgboost.so(xgboost::LearnerImpl::UpdateOneIter(int, std::shared_ptr<xgboost::DMatrix>)+0x39d) [0x1500ac13337d]\n  [bt] (5) /home/jon/minicondadai/lib/python3.6/site-packages/xgboost/lib/libxgboost.so(XGBoosterUpdateOneIter+0x54) [0x1500ac02cac4]\n  [bt] (6) /home/jon/minicondadai/lib/python3.6/lib-dynload/../../libffi.so.6(ffi_call_unix64+0x4c) [0x1501c379c630]\n  [bt] (7) /home/jon/minicondadai/lib/python3.6/lib-dynload/../../libffi.so.6(ffi_call+0x22d) [0x1501c379bfed]\n  [bt] (8) /home/jon/minicondadai/lib/python3.6/lib-dynload/_ctypes.cpython-36m-x86_64-linux-gnu.so(_ctypes_callproc+0x2ce) [0x1501c37b2f9e]\n\n',)

In single node setting one can use

from dask.distributed import Client, wait
from dask_cuda import LocalCUDACluster
with LocalCUDACluster() as cluster:
    with Client(cluster) as client:

In multinode setting, NVIDIA/dask recommend CLI binary to launch one worker per GPU. That's fine, but in general with dask one might have other workers on the same node managing other resources.

However, xgboost always consumes all workers, and there appears no way to control this.

From dask perspective, one can use resources to specify what each worker has, but this can only be used for things like .compute() or client.submit() and is not able to be set by Client() that only accesses the entire cluster.

I'm not sure if this problem is key to xgboost or dask, but either way it's hard to manage having both CPUs and GPUs with dask and xgboost.

One can have 2 schedulers and groups of workers per node, but then this loses point of "resources" with dask and having a single scheduler manage these.

Any ideas?

@pseudotensor
Copy link
Contributor Author

pseudotensor commented Nov 5, 2020

Other context here that is relevant is that even if one has a large GPU cluster, it doesn't mean one wants xgboost to use entire system. This can be quite inefficient. When using .compute and .submit one can have dask limit resources, but xgboost offers no such thing. So entire GPU cluster will be consumed.

If I'm confused and there is way to manage this, please let me know, thanks!

In this context Client does have a way to setup a LocalCluster and options like n_workers and threads_per_worker are passed to that (https://examples.dask.org/machine-learning/xgboost.html).

However, this does not control how many workers used for an existing cluster (and it's only relevant for a single node).

@pseudotensor
Copy link
Contributor Author

In past I might imagine we would use n_gpus for this. n_gpus is now deprecated, but it could be used to control how many workers are used.

@trivialfis
Copy link
Member

trivialfis commented Nov 5, 2020

The interface in xgboost does not support fine grained controlling of resources. This PR might help: #6343 . It will make xgboost run only on the nodes that data are resided in. But further controlling through xgboost might not come, as right now xgboost does not move nor copy any data in between workers, which makes things a lot simpler.

@pseudotensor
Copy link
Contributor Author

Thanks @trivialfis . Understood, just that if one goes beyond model of single node DGX-like setup and go into multinode GPU usage, it's not at all practical in general to use absolutely every GPU across a cluster.

Co-locating computation and data is good, but dask_cudf doesn't allow control over number of workers either. It just inherits what the client can access and so still uses all.

I understand this can be seen as a dask issue, but NVIDIA might be interested in figuring it out on dask or xgboost side, so that more general clusters of GPUs can be used.

@pseudotensor
Copy link
Contributor Author

pseudotensor commented Nov 5, 2020

Related
#6008
#4735

@trivialfis
Copy link
Member

I'm not sure why are you seeing this error. That means multiple xgboost workers are running on the same device. The followed discussion doesn't seem to be related to this particular error and using dask_cuda should not reach to such state. Could you please show me how to reproduce?

@pseudotensor
Copy link
Contributor Author

It's not dask_cuda that leads to this error, only xgboost.

As described here dask/dask#6805 , I'm running CLI to launch GPU workers, one per node via dask-cuda-worker, and another set of CPU workers. As said there, I'm doing this because GPU worker has 1 process with resource label GPU, but this is bad for CPU operations, so I have also 1 worker per process as separate worker with resource label CPU.

But xgboost sees only entire cluster, so mixes up GPU and CPU ones because there is no control over which workers are used.

But despite that, even if have GPU cluster, I can't even control how many of those are used.

@trivialfis trivialfis changed the title Check failed: n_uniques == world (1 vs. 10) : Multiple processes within communication group running on same CUDA device is not supported Finer control over dask workers. Nov 5, 2020
@trivialfis
Copy link
Member

I changed the title, feel free to change again if you think it's not suitable.

But despite that, even if have GPU cluster, I can't even control how many of those are used.

That's a real issue that we need to come up with something we can recommend. cc @JohnZed

@JohnZed
Copy link
Contributor

JohnZed commented Nov 5, 2020

The easiest Dask approach is usually to spin up a cluster with just the workers you want to use together. Doing fine-grained manipulations of which workers run which tasks can be possible by passing worker lists to compute and submit, but it's going to take a LOT of bookkeeping in all user and library code, so it's not something we've supported today. (Other Dask libraries like dask-ml also don't support this.)
Ultimately, I think this is more of a Dask question. If Dask provides a nice mechanism for partitioning clusters, I think XGBoost would be happy to support that, but it's not a well-supported paradigm right now.

@pseudotensor
Copy link
Contributor Author

Spinning up just what one would need works for one-off single-user solutions in some jupyter notebook, but it doesn't scale to general multi-user applications.

@trivialfis
Copy link
Member

I will close this for now. As a downstream project, xgboost is largely orthogonal to the dask cluster, client is the way xgboost talk to workers. Feel free to continue the discussion on dask/distributed and please keep me in the loop. ;-)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants