Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[dask] xgb.dask.train() fails on dask-kubernetes cluster #6390

Closed
jameslamb opened this issue Nov 13, 2020 · 4 comments
Closed

[dask] xgb.dask.train() fails on dask-kubernetes cluster #6390

jameslamb opened this issue Nov 13, 2020 · 4 comments

Comments

@jameslamb
Copy link
Contributor

I tried tonight to test the recent xgboost.dask changes on a dask-kubernetes cluster on EKS (per #6343 (comment)).

Unfortunately, I ran into this error:

AttributeError: /opt/conda/envs/saturn/lib/libxgboost.so: undefined symbol: XGDMatrixSetDenseInfo

Reproduction Information

training code

I omitted the code I used to create my client (...[CLIENT CODE]...) because it uses a dask-kubernetes cluster provisioned with a commercial product. I can see that work is getting scheduled onto that cluster when the DaskDMatrix is set up and when training starts, so I'm confident that that isn't the issue.

import os
import time

import dask.array as da
import xgboost as xgb

from dask.distributed import Client, wait
from dask_ml.metrics import mean_absolute_error
from dask_saturn import SaturnCluster

.....[CLIENT CODE].....

num_obs = 1e5
num_features = 50

X = da.random.random(
    size=(num_obs, num_features),
    chunks=(1000, num_features)
)
y = da.random.random(
    size=(num_obs, 1),
    chunks=(1000, 1)
)

X = X.persist()
_ = wait(X)

y = y.persist()
_ = wait(y)

dtrain = xgb.dask.DaskDMatrix(
    client=client,
    data=X,
    label=y
)

bst = xgb.dask.train(
    client=client,
    params={
        "verbosity": 2,
        "tree_method": "hist",
        "objective": "reg:squarederror"
    },
    dtrain=dtrain,
    num_boost_round=10,
)

I installed xgboost by cloning from latest master (https://github.com/dmlc/xgboost/tree/fcfeb4959c6e361f2fd1cd18c3b61b598dc205ae).

sudo apt update
sudo apt-get install -y cmake build-essential
git clone https://github.com/dmlc/xgboost.git /tmp/xgboost
pushd /tmp/xgboost/python-package
    git submodule init
    git submodule update
    python setup.py install
popd
full stacktrace
---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
<ipython-input-5-462f078606cf> in <module>
      8     dtrain=dtrain,
      9     num_boost_round=10,
---> 10     evals=[(dtrain, 'train')]
     11 )

/srv/conda/envs/saturn/lib/python3.7/site-packages/xgboost/dask.py in train(client, params, dtrain, evals, early_stopping_rounds, *args, **kwargs)
    742     return client.sync(
    743         _train_async, client, params, dtrain=dtrain, *args, evals=evals,
--> 744         early_stopping_rounds=early_stopping_rounds, **kwargs)
    745 
    746 

/srv/conda/envs/saturn/lib/python3.7/site-packages/distributed/client.py in sync(self, func, asynchronous, callback_timeout, *args, **kwargs)
    831         else:
    832             return sync(
--> 833                 self.loop, func, *args, callback_timeout=callback_timeout, **kwargs
    834             )
    835 

/srv/conda/envs/saturn/lib/python3.7/site-packages/distributed/utils.py in sync(loop, func, callback_timeout, *args, **kwargs)
    338     if error[0]:
    339         typ, exc, tb = error[0]
--> 340         raise exc.with_traceback(tb)
    341     else:
    342         return result[0]

/srv/conda/envs/saturn/lib/python3.7/site-packages/distributed/utils.py in f()
    322             if callback_timeout is not None:
    323                 future = asyncio.wait_for(future, callback_timeout)
--> 324             result[0] = yield future
    325         except Exception as exc:
    326             error[0] = sys.exc_info()

/srv/conda/envs/saturn/lib/python3.7/site-packages/tornado/gen.py in run(self)
    733 
    734                     try:
--> 735                         value = future.result()
    736                     except Exception:
    737                         exc_info = sys.exc_info()

/srv/conda/envs/saturn/lib/python3.7/site-packages/xgboost/dask.py in _train_async(client, params, dtrain, evals, early_stopping_rounds, *args, **kwargs)
    705         futures.append(f)
    706 
--> 707     results = await client.gather(futures)
    708     return list(filter(lambda ret: ret is not None, results))[0]
    709 

/srv/conda/envs/saturn/lib/python3.7/site-packages/distributed/client.py in _gather(self, futures, errors, direct, local_worker)
   1849                             exc = CancelledError(key)
   1850                         else:
-> 1851                             raise exception.with_traceback(traceback)
   1852                         raise exc
   1853                     if errors == "skip":

/srv/conda/envs/saturn/lib/python3.7/site-packages/xgboost/dask.py in dispatched_train()
    654         worker = distributed.get_worker()
    655         with RabitContext(rabit_args):
--> 656             local_dtrain = _dmatrix_from_list_of_parts(**dtrain_ref)
    657             local_evals = []
    658             if evals_ref:

/opt/conda/envs/saturn/lib/python3.7/site-packages/xgboost/dask.py in _dmatrix_from_list_of_parts()
    605     if is_quantile:
    606         return _create_device_quantile_dmatrix(**kwargs)
--> 607     return _create_dmatrix(**kwargs)
    608 
    609 

/opt/conda/envs/saturn/lib/python3.7/site-packages/xgboost/dask.py in _create_dmatrix()
    595                       feature_names=feature_names,
    596                       feature_types=feature_types,
--> 597                       nthread=worker.nthreads)
    598     dmatrix.set_info(base_margin=base_margin, weight=weights,
    599                      label_lower_bound=label_lower_bound,

/opt/conda/envs/saturn/lib/python3.7/site-packages/xgboost/core.py in __init__()
    506         self.handle = handle
    507 
--> 508         self.set_info(label=label, weight=weight, base_margin=base_margin)
    509 
    510         self.feature_names = feature_names

/opt/conda/envs/saturn/lib/python3.7/site-packages/xgboost/core.py in inner_f()
    419         for k, arg in zip(sig.parameters, args):
    420             kwargs[k] = arg
--> 421         return f(**kwargs)
    422 
    423     return inner_f

/opt/conda/envs/saturn/lib/python3.7/site-packages/xgboost/core.py in set_info()
    527         '''Set meta info for DMatrix.'''
    528         if label is not None:
--> 529             self.set_label(label)
    530         if weight is not None:
    531             self.set_weight(weight)

/opt/conda/envs/saturn/lib/python3.7/site-packages/xgboost/core.py in set_label()
    656         """
    657         from .data import dispatch_meta_backend
--> 658         dispatch_meta_backend(self, label, 'label', 'float')
    659 
    660     def set_weight(self, weight):

/opt/conda/envs/saturn/lib/python3.7/site-packages/xgboost/data.py in dispatch_meta_backend()
    663         return
    664     if _is_numpy_array(data):
--> 665         _meta_from_numpy(data, name, dtype, handle)
    666         return
    667     if _is_pandas_df(data):

/opt/conda/envs/saturn/lib/python3.7/site-packages/xgboost/data.py in _meta_from_numpy()
    597     ptr = interface['data'][0]
    598     ptr = ctypes.c_void_p(ptr)
--> 599     _check_call(_LIB.XGDMatrixSetDenseInfo(
    600         handle,
    601         c_str(field),

/opt/conda/envs/saturn/lib/python3.7/ctypes/__init__.py in __getattr__()
    375         if name.startswith('__') and name.endswith('__'):
    376             raise AttributeError(name)
--> 377         func = self.__getitem__(name)
    378         setattr(self, name, func)
    379         return func

/opt/conda/envs/saturn/lib/python3.7/ctypes/__init__.py in __getitem__()
    380 
    381     def __getitem__(self, name_or_ordinal):
--> 382         func = self._FuncPtr((name_or_ordinal, self))
    383         if not isinstance(name_or_ordinal, int):
    384             func.__name__ = name_or_ordinal

AttributeError: /opt/conda/envs/saturn/lib/libxgboost.so: undefined symbol: XGDMatrixSetDenseInfo
output of conda info`
     active environment : saturn
    active env location : /opt/conda/envs/saturn
            shell level : 2
       user config file : /home/jovyan/.condarc
 populated config files : /opt/conda/.condarc
          conda version : 4.8.2
    conda-build version : not installed
         python version : 3.7.7.final.0
       virtual packages : __glibc=2.28
       base environment : /opt/conda  (writable)
           channel URLs : https://conda.saturncloud.io/pkgs/linux-64
                          https://conda.saturncloud.io/pkgs/noarch
                          https://conda.anaconda.org/conda-forge/linux-64
                          https://conda.anaconda.org/conda-forge/noarch
                          https://repo.anaconda.com/pkgs/main/linux-64
                          https://repo.anaconda.com/pkgs/main/noarch
                          https://repo.anaconda.com/pkgs/r/linux-64
                          https://repo.anaconda.com/pkgs/r/noarch
          package cache : /opt/conda/pkgs
                          /home/jovyan/.conda/pkgs
       envs directories : /opt/conda/envs
                          /home/jovyan/.conda/envs
               platform : linux-64
             user-agent : conda/4.8.2 requests/2.22.0 CPython/3.7.7 Linux/4.14.193-149.317.amzn2.x86_64 debian/10 glibc/2.28
                UID:GID : 1000:100
             netrc file : None
           offline mode : False

Other Notes

I'll try to come up with a reproducible example using dask-cloudprovider so that' it's 100% reproducible (no redacted code).

@trivialfis
Copy link
Member

I think you have an outdated libxgboost.so. Could you please check your image?

@jameslamb
Copy link
Contributor Author

Yep you're right, I ran find / -name "libxgboost.so"and found that even this doesn't remove some old libxgboost.so that I have on PATH

conda uninstall -y xgboost
pip uninstall -y xgboost

I'll remove the old library from my image and try again, thanks.

@trivialfis
Copy link
Member

Thanks for testing! Feel free to let me know if there's anything I can help. I will close this one now as this specific issue is resolved.

@jameslamb
Copy link
Contributor Author

jameslamb commented Nov 13, 2020

@trivialfis I'm very happy to tell you that after I was able to clear out my old libxgboost.sos, training worked on dask-kubernetes + EKS! (using the code snippet I shared above).

Thanks for all the great work!!! 🎉

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants