This library adds a new backend for XGBoost utilizing Ray.
Please note that this is an early version and both the API and the behavior can change without prior notice.
You can install XGBoost on Ray (xgboost_ray
) like this:
git clone https://github.com/ray-project/xgboost_ray.git
cd xgboost_ray
pip install -e .
After installation, you can import XGBoost on Ray via two ways:
import xgboost_ray
# or
import ray.util.xgboost
xgboost_ray
provides a drop-in replacement for XGBoost's train
function. To pass data, instead of using xgb.DMatrix
you will
have to use ray.util.xgboost.RayDMatrix
.
Here is a simplified example:
.. literalinclude:: /../../python/ray/util/xgboost/simple_example.py :language: python :start-after: __xgboost_begin__ :end-before: __xgboost_end__
Data is passed to xgboost_ray
via a RayDMatrix
object.
The RayDMatrix
lazy loads data and stores it sharded in the
Ray object store. The Ray XGBoost actors then access these
shards to run their training on.
A RayDMatrix
support various data and file types, like
Pandas DataFrames, Numpy Arrays, CSV files and Parquet files.
Example loading multiple parquet files:
import glob
from ray.util.xgboost import RayDMatrix, RayFileType
# We can also pass a list of files
path = list(sorted(glob.glob("/data/nyc-taxi/*/*/*.parquet")))
# This argument will be passed to pd.read_parquet()
columns = [
"passenger_count",
"trip_distance", "pickup_longitude", "pickup_latitude",
"dropoff_longitude", "dropoff_latitude",
"fare_amount", "extra", "mta_tax", "tip_amount",
"tolls_amount", "total_amount"
]
dtrain = RayDMatrix(
path,
label="passenger_count", # Will select this column as the label
columns=columns,
filetype=RayFileType.PARQUET)
xgboost_ray
integrates with Ray Tune (:ref:`tune-main`) to provide distributed hyperparameter tuning for your
distributed XGBoost models. You can run multiple xgboost_ray
training runs in parallel, each with a different
hyperparameter configuration, with each individual training run parallelized.
First, move your training code into a function. This function should take in a config
argument which specifies the
hyperparameters for the xgboost model.
.. literalinclude:: /../../python/ray/util/xgboost/simple_tune.py :language: python :start-after: __train_begin__ :end-before: __train_end__
Then, you import tune and use tune's search primitives to define a hyperparameter search space.
.. literalinclude:: /../../python/ray/util/xgboost/simple_tune.py :language: python :start-after: __tune_begin__ :end-before: __tune_end__
Finally, you call tune.run
passing in the training function and the config
. Internally, tune will resolve the
hyperparameter search space and invoke the training function multiple times, each with different hyperparameters.
.. literalinclude:: /../../python/ray/util/xgboost/simple_tune.py :language: python :start-after: __tune_run_begin__ :end-before: __tune_run_end__
Make sure you set the extra_cpu
field appropriately so tune is aware of the total number of resources each trial
requires.
By default, xgboost_ray
tries to determine the number of CPUs
available and distributes them evenly across actors.
In the case of very large clusters or clusters with many different
machine sizes, it makes sense to limit the number of CPUs per actor
by setting the cpus_per_actor
argument. Consider always
setting this explicitly.
The number of XGBoost actors always has to be set manually with
the num_actors
argument.
Fore complete end to end examples, please have a look at the examples folder:
- Simple sklearn breastcancer dataset example (requires sklearn)
- Simple sklearn breastcancer dataset example with Ray Tune (requires sklearn)
- HIGGS classification example * [download dataset (2.6 GB)]
- HIGGS classification example with Parquet (uses the same dataset)
- Test data classification (uses a self-generated dataset)