Documentation: train + RayDMatrix vs XGBoostTrainer + Dataset #309

andreostrovsky · 2024-04-01T18:41:30Z

Could you please provide some clarification on the differences and/or how to choose between using xgboost_ray.train + xgboost_ray.RayDMatrix or ray.train.xgboost.XGBoostTrainer + ray.data.Dataset?

My use case is running Ray Tune on Azure Databricks, which operates on Spark. According to the Databricks docs, one creates a Ray Cluster using the Ray on Spark API, and creates a Ray Dataset from Parquet files.

Below are the questions I would like clarification on. Any help you could provide would be greatly appreciated.

Data

According to the README.md one can create a RayDMatrix from either Parquet files or a Ray Dataset:

xgboost_ray/README.md

Lines 450 to 465 in e904925

    
           ### Data sources 
        
           The following data sources can be used with a `RayDMatrix` object. 
        
           | Type                                                             | Centralized loading | Distributed loading | 
        
           |------------------------------------------------------------------|---------------------|---------------------| 
        
           | Numpy array                                                      | Yes                 | No                  | 
        
           | Pandas dataframe                                                 | Yes                 | No                  | 
        
           | Single CSV                                                       | Yes                 | No                  | 
        
           | Multi CSV                                                        | Yes                 | Yes                 | 
        
           | Single Parquet                                                   | Yes                 | No                  | 
        
           | Multi Parquet                                                    | Yes                 | Yes                 | 
        
           | [Ray Dataset](https://docs.ray.io/en/latest/data/dataset.html)   | Yes                 | Yes                 | 
        
           | [Petastorm](https://github.com/uber/petastorm)                   | Yes                 | Yes                 | 
        
           | [Dask dataframe](https://docs.dask.org/en/latest/dataframe.html) | Yes                 | Yes                 | 
        
           | [Modin dataframe](https://modin.readthedocs.io/en/latest/)       | Yes                 | Yes                 |

So if using xgboost_ray, should I

Create a Ray Dataset from Parquet files, then create a RayDMatrix from that Dataset
or
Create the RayDMatrix directly from Parquet files

Training

Should I use Ray Tune with XGBoostTrainer or with xgboost_ray.train, running on this Ray on Spark Cluster?

I also intend to implement CV with early stopping. Since tune-sklearn is now deprecated, I understand that I'll need to implement this myself. As explained in ray-project/ray#21848 (comment), this can be done with ray.tune.stopper.TrialPlateauStopper. But according to #301 we can also use XGBoost's native xgb.callback.EarlyStopping. Which approach would you recommend? Can TrialPlateauStopper be used with xgboost_ray?

Thank you very much for any help you can offer.

The text was updated successfully, but these errors were encountered:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Documentation: train + RayDMatrix vs XGBoostTrainer + Dataset #309

Documentation: train + RayDMatrix vs XGBoostTrainer + Dataset #309

andreostrovsky commented Apr 1, 2024

Documentation: train + RayDMatrix vs XGBoostTrainer + Dataset #309

Documentation: train + RayDMatrix vs XGBoostTrainer + Dataset #309

Comments

andreostrovsky commented Apr 1, 2024

Data

Training