XGBoost on Ray

This library adds a new backend for XGBoost utilizing Ray.

Please note that this is an early version and both the API and the behavior can change without prior notice.

Installation

You can install XGBoost on Ray (xgboost_ray) like this:

git clone https://github.com/ray-project/xgboost_ray.git
cd xgboost_ray
pip install -e .

Usage

After installation, you can import XGBoost on Ray via two ways:

import xgboost_ray
# or
import ray.util.xgboost

xgboost_ray provides a drop-in replacement for XGBoost's train function. To pass data, instead of using xgb.DMatrix you will have to use ray.util.xgboost.RayDMatrix.

Here is a simplified example:

.. literalinclude:: /../../python/ray/util/xgboost/simple_example.py
   :language: python
   :start-after:  __xgboost_begin__
   :end-before:  __xgboost_end__

Data loading

Data is passed to xgboost_ray via a RayDMatrix object.

The RayDMatrix lazy loads data and stores it sharded in the Ray object store. The Ray XGBoost actors then access these shards to run their training on.

A RayDMatrix support various data and file types, like Pandas DataFrames, Numpy Arrays, CSV files and Parquet files.

Example loading multiple parquet files:

import glob
from ray.util.xgboost import RayDMatrix, RayFileType

# We can also pass a list of files
path = list(sorted(glob.glob("/data/nyc-taxi/*/*/*.parquet")))

# This argument will be passed to pd.read_parquet()
columns = [
    "passenger_count",
    "trip_distance", "pickup_longitude", "pickup_latitude",
    "dropoff_longitude", "dropoff_latitude",
    "fare_amount", "extra", "mta_tax", "tip_amount",
    "tolls_amount", "total_amount"
]

dtrain = RayDMatrix(
    path,
    label="passenger_count",  # Will select this column as the label
    columns=columns,
    filetype=RayFileType.PARQUET)

Hyperparameter Tuning

xgboost_ray integrates with Ray Tune (:ref:`tune-main`) to provide distributed hyperparameter tuning for your distributed XGBoost models. You can run multiple xgboost_ray training runs in parallel, each with a different hyperparameter configuration, with each individual training run parallelized.

First, move your training code into a function. This function should take in a config argument which specifies the hyperparameters for the xgboost model.

.. literalinclude:: /../../python/ray/util/xgboost/simple_tune.py
   :language: python
   :start-after:  __train_begin__
   :end-before:  __train_end__

Then, you import tune and use tune's search primitives to define a hyperparameter search space.

.. literalinclude:: /../../python/ray/util/xgboost/simple_tune.py
   :language: python
   :start-after:  __tune_begin__
   :end-before:  __tune_end__

Finally, you call tune.run passing in the training function and the config. Internally, tune will resolve the hyperparameter search space and invoke the training function multiple times, each with different hyperparameters.

.. literalinclude:: /../../python/ray/util/xgboost/simple_tune.py
   :language: python
   :start-after:  __tune_run_begin__
   :end-before:  __tune_run_end__

Make sure you set the extra_cpu field appropriately so tune is aware of the total number of resources each trial requires.

Resources

By default, xgboost_ray tries to determine the number of CPUs available and distributes them evenly across actors.

In the case of very large clusters or clusters with many different machine sizes, it makes sense to limit the number of CPUs per actor by setting the cpus_per_actor argument. Consider always setting this explicitly.

The number of XGBoost actors always has to be set manually with the num_actors argument.

More examples

Fore complete end to end examples, please have a look at the examples folder:

Simple sklearn breastcancer dataset example (requires sklearn)
Simple sklearn breastcancer dataset example with Ray Tune (requires sklearn)
HIGGS classification example * [download dataset (2.6 GB)]
HIGGS classification example with Parquet (uses the same dataset)
Test data classification (uses a self-generated dataset)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

xgboost-ray.rst

xgboost-ray.rst

XGBoost on Ray

Installation

Usage

Data loading

Hyperparameter Tuning

Resources

More examples

Files

xgboost-ray.rst

Latest commit

History

xgboost-ray.rst

File metadata and controls

XGBoost on Ray

Installation

Usage

Data loading

Hyperparameter Tuning

Resources

More examples