Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Documentation: train + RayDMatrix vs XGBoostTrainer + Dataset #309

Open
andreostrovsky opened this issue Apr 1, 2024 · 0 comments
Open

Comments

@andreostrovsky
Copy link

Could you please provide some clarification on the differences and/or how to choose between using xgboost_ray.train + xgboost_ray.RayDMatrix or ray.train.xgboost.XGBoostTrainer + ray.data.Dataset?

My use case is running Ray Tune on Azure Databricks, which operates on Spark. According to the Databricks docs, one creates a Ray Cluster using the Ray on Spark API, and creates a Ray Dataset from Parquet files.

Below are the questions I would like clarification on. Any help you could provide would be greatly appreciated.

Data

According to the README.md one can create a RayDMatrix from either Parquet files or a Ray Dataset:

xgboost_ray/README.md

Lines 450 to 465 in e904925

### Data sources
The following data sources can be used with a `RayDMatrix` object.
| Type | Centralized loading | Distributed loading |
|------------------------------------------------------------------|---------------------|---------------------|
| Numpy array | Yes | No |
| Pandas dataframe | Yes | No |
| Single CSV | Yes | No |
| Multi CSV | Yes | Yes |
| Single Parquet | Yes | No |
| Multi Parquet | Yes | Yes |
| [Ray Dataset](https://docs.ray.io/en/latest/data/dataset.html) | Yes | Yes |
| [Petastorm](https://github.com/uber/petastorm) | Yes | Yes |
| [Dask dataframe](https://docs.dask.org/en/latest/dataframe.html) | Yes | Yes |
| [Modin dataframe](https://modin.readthedocs.io/en/latest/) | Yes | Yes |

So if using xgboost_ray, should I

  • Create a Ray Dataset from Parquet files, then create a RayDMatrix from that Dataset
    or
  • Create the RayDMatrix directly from Parquet files

Training

Should I use Ray Tune with XGBoostTrainer or with xgboost_ray.train, running on this Ray on Spark Cluster?

I also intend to implement CV with early stopping. Since tune-sklearn is now deprecated, I understand that I'll need to implement this myself. As explained in ray-project/ray#21848 (comment), this can be done with ray.tune.stopper.TrialPlateauStopper. But according to #301 we can also use XGBoost's native xgb.callback.EarlyStopping. Which approach would you recommend? Can TrialPlateauStopper be used with xgboost_ray?

Thank you very much for any help you can offer.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant