New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[pyspark][doc] add more doc for pyspark #8271
Changes from 2 commits
00a9b4e
6f8fc34
668e492
80f51af
7cf4e6a
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change | ||||
---|---|---|---|---|---|---|
@@ -1,12 +1,19 @@ | ||||||
############################### | ||||||
Using XGBoost PySpark Estimator | ||||||
############################### | ||||||
################################ | ||||||
Distributed XGBoost with PySpark | ||||||
################################ | ||||||
Starting from version 2.0, xgboost supports pyspark estimator APIs. | ||||||
The feature is still experimental and not yet ready for production use. | ||||||
|
||||||
***************** | ||||||
.. contents:: | ||||||
:backlinks: none | ||||||
:local: | ||||||
|
||||||
************************* | ||||||
XGBoost PySpark Estimator | ||||||
************************* | ||||||
|
||||||
SparkXGBRegressor | ||||||
***************** | ||||||
================= | ||||||
|
||||||
SparkXGBRegressor is a PySpark ML estimator. It implements the XGBoost classification | ||||||
algorithm based on XGBoost python library, and it can be used in PySpark Pipeline | ||||||
|
@@ -17,10 +24,14 @@ We can create a `SparkXGBRegressor` estimator like: | |||||
.. code-block:: python | ||||||
|
||||||
from xgboost.spark import SparkXGBRegressor | ||||||
spark_reg_estimator = SparkXGBRegressor(num_workers=2, max_depth=5) | ||||||
spark_reg_estimator = SparkXGBRegressor( | ||||||
features_col="features", | ||||||
label_col="label", | ||||||
num_workers=2, | ||||||
) | ||||||
|
||||||
|
||||||
The above snippet create an spark estimator which can fit on a spark dataset, | ||||||
The above snippet creates a spark estimator which can fit on a spark dataset, | ||||||
and return a spark model that can transform a spark dataset and generate dataset | ||||||
with prediction column. We can set almost all of xgboost sklearn estimator parameters | ||||||
as `SparkXGBRegressor` parameters, but some parameter such as `nthread` is forbidden | ||||||
|
@@ -30,8 +41,9 @@ such as `weight_col`, `validation_indicator_col`, `use_gpu`, for details please | |||||
|
||||||
The following code snippet shows how to train a spark xgboost regressor model, | ||||||
first we need to prepare a training dataset as a spark dataframe contains | ||||||
"features" and "label" column, the "features" column must be `pyspark.ml.linalg.Vector` | ||||||
type or spark array type. | ||||||
"label" column and "features" column(s), the "features" column(s) must be `pyspark.ml.linalg.Vector` | ||||||
type or spark array type or a list of feature column names. | ||||||
|
||||||
|
||||||
.. code-block:: python | ||||||
|
||||||
|
@@ -51,17 +63,141 @@ type or spark array type. | |||||
The above snippet code returns a `transformed_test_spark_dataframe` that contains the input | ||||||
dataset columns and an appended column "prediction" representing the prediction results. | ||||||
|
||||||
|
||||||
****************** | ||||||
SparkXGBClassifier | ||||||
****************** | ||||||
|
||||||
================== | ||||||
|
||||||
`SparkXGBClassifier` estimator has similar API with `SparkXGBRegressor`, but it has some | ||||||
pyspark classifier specific params, e.g. `raw_prediction_col` and `probability_col` parameters. | ||||||
Correspondingly, by default, `SparkXGBClassifierModel` transforming test dataset will | ||||||
generate result dataset with 3 new columns: | ||||||
|
||||||
- "prediction": represents the predicted label. | ||||||
- "raw_prediction": represents the output margin values. | ||||||
- "probability": represents the prediction probability on each label. | ||||||
|
||||||
|
||||||
*************************** | ||||||
XGBoost PySpark GPU support | ||||||
*************************** | ||||||
|
||||||
XGBoost PySpark supports GPU training and prediction. To enable GPU support, you first need | ||||||
to install the xgboost and cudf packages. Then you can set `use_gpu` parameter to `True`. | ||||||
|
||||||
Below tutorial will show you how to train a model with XGBoost PySpark GPU on Spark | ||||||
standalone cluster. | ||||||
|
||||||
|
||||||
Write your PySpark application | ||||||
============================== | ||||||
|
||||||
.. code-block:: python | ||||||
|
||||||
from xgboost.spark import SparkXGBRegressor | ||||||
spark = SparkSession.builder.getOrCreate() | ||||||
|
||||||
# read data into spark dataframe | ||||||
train_data_path = "xxxx/train" | ||||||
train_df = spark.read.parquet(data_path) | ||||||
|
||||||
test_data_path = "xxxx/test" | ||||||
test_df = spark.read.parquet(test_data_path) | ||||||
|
||||||
# assume the label column is named "class" | ||||||
label_name = "class" | ||||||
|
||||||
# get a list with feature column names | ||||||
feature_names = [x.name for x in train_df.schema if x.name != label] | ||||||
|
||||||
# create a xgboost pyspark regressor estimator and set use_gpu=True | ||||||
regressor = SparkXGBRegressor( | ||||||
features_col=feature_names, | ||||||
label_col=label_name, | ||||||
num_workers=2, | ||||||
use_gpu=True, | ||||||
) | ||||||
|
||||||
# train and return the model | ||||||
model = regressor.fit(train_df) | ||||||
|
||||||
# predict on test data | ||||||
predict_df = model.transform(test_df) | ||||||
predict_df.show() | ||||||
|
||||||
Prepare the necessary packages | ||||||
============================== | ||||||
|
||||||
We recommend using Conda or Virtualenv to manage python dependencies | ||||||
in PySpark. Please refer to | ||||||
`How to Manage Python Dependencies in PySpark <https://www.databricks.com/blog/2020/12/22/how-to-manage-python-dependencies-in-pyspark.html>`_. | ||||||
|
||||||
.. code-block:: bash | ||||||
|
||||||
conda create -y -n xgboost-env -c conda-forge conda-pack python=3.9 | ||||||
conda activate xgboost-env | ||||||
pip install xgboost | ||||||
pip install cudf | ||||||
conda pack -f -o xgboost-env.tar.gz | ||||||
|
||||||
|
||||||
Submit the PySpark application | ||||||
============================== | ||||||
|
||||||
Assuming you have configured your Spark cluster with GPU support, if not yet, please | ||||||
refer to `spark standalone configuration with GPU support <https://nvidia.github.io/spark-rapids/docs/get-started/getting-started-on-prem.html#spark-standalone-cluster>`_. | ||||||
|
||||||
.. code-block:: bash | ||||||
|
||||||
export PYSPARK_DRIVER_PYTHON=python | ||||||
export PYSPARK_PYTHON=./environment/bin/python | ||||||
|
||||||
spark-submit \ | ||||||
--master spark://<master-ip>:7077 \ | ||||||
--conf spark.executor.resource.gpu.amount=1 \ | ||||||
--conf spark.task.resource.gpu.amount=1 \ | ||||||
--archives xgboost-env.tar.gz#environment \ | ||||||
xgboost_app.py | ||||||
|
||||||
|
||||||
Model Persistence | ||||||
================= | ||||||
|
||||||
.. code-block:: python | ||||||
|
||||||
# save the model | ||||||
model.save("/tmp/xgboost-pyspark-model") | ||||||
|
||||||
# load the model | ||||||
model2 = SparkXGBRankerModel.load("/tmp/xgboost-pyspark-model") | ||||||
|
||||||
The above code snippet shows how to save/load xgboost pyspark model. And you can also | ||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I strongly prefer using the booster attribute and would like to keep this special file name as a workaround that should be used sparingly. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Hmm, this is the standard spark way to save/load model. I can also mention the booster attribute. |
||||||
load the model with xgboost python package directly without involving spark. | ||||||
|
||||||
.. code-block:: python | ||||||
|
||||||
import xgboost as xgb | ||||||
bst = xgb.Booster() | ||||||
bst.load_model("/tmp/xgboost-pyspark-model/model/part-00000") | ||||||
|
||||||
|
||||||
Accelerate the whole pipeline of xgboost pyspark | ||||||
================================================ | ||||||
|
||||||
With `RAPIDS Accelerator for Apache Spark <https://nvidia.github.io/spark-rapids/>`_, | ||||||
you can accelerate the whole pipeline (ETL, Train, Transform) for xgboost pyspark | ||||||
without any code change by leveraging GPU. | ||||||
|
||||||
You only need to add some configurations to enable RAPIDS plugin when submitting. | ||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Suggested change
Below is a simple example submit command for enabling GPU acceleration: There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Done |
||||||
|
||||||
.. code-block:: bash | ||||||
|
||||||
export PYSPARK_DRIVER_PYTHON=python | ||||||
export PYSPARK_PYTHON=./environment/bin/python | ||||||
|
||||||
spark-submit \ | ||||||
--master spark://<master-ip>:7077 \ | ||||||
--conf spark.executor.resource.gpu.amount=1 \ | ||||||
--conf spark.task.resource.gpu.amount=1 \ | ||||||
--packages com.nvidia:rapids-4-spark_2.12:22.08.0 \ | ||||||
--conf spark.plugins=com.nvidia.spark.SQLPlugin \ | ||||||
--archives xgboost-env.tar.gz#environment \ | ||||||
xgboost_app.py | ||||||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This varies between different providers of pyspark environments. On dataproc we can't submit the environment through spark-submit.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I mentioned this tutorial is for spark standalone mode, I didn't want to involve other CSPs to xgboost.