Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[pyspark][doc] add more doc for pyspark #8271

Merged
merged 5 commits into from Sep 29, 2022
Merged
Changes from 2 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Jump to
Jump to file
Failed to load files.
Diff view
Diff view
164 changes: 150 additions & 14 deletions doc/tutorials/spark_estimator.rst
@@ -1,12 +1,19 @@
###############################
Using XGBoost PySpark Estimator
###############################
################################
Distributed XGBoost with PySpark
################################
Starting from version 2.0, xgboost supports pyspark estimator APIs.
The feature is still experimental and not yet ready for production use.

*****************
.. contents::
:backlinks: none
:local:

*************************
XGBoost PySpark Estimator
*************************

SparkXGBRegressor
*****************
=================

SparkXGBRegressor is a PySpark ML estimator. It implements the XGBoost classification
algorithm based on XGBoost python library, and it can be used in PySpark Pipeline
Expand All @@ -17,10 +24,14 @@ We can create a `SparkXGBRegressor` estimator like:
.. code-block:: python

from xgboost.spark import SparkXGBRegressor
spark_reg_estimator = SparkXGBRegressor(num_workers=2, max_depth=5)
spark_reg_estimator = SparkXGBRegressor(
features_col="features",
label_col="label",
num_workers=2,
)


The above snippet create an spark estimator which can fit on a spark dataset,
The above snippet creates a spark estimator which can fit on a spark dataset,
and return a spark model that can transform a spark dataset and generate dataset
with prediction column. We can set almost all of xgboost sklearn estimator parameters
as `SparkXGBRegressor` parameters, but some parameter such as `nthread` is forbidden
Expand All @@ -30,8 +41,9 @@ such as `weight_col`, `validation_indicator_col`, `use_gpu`, for details please

The following code snippet shows how to train a spark xgboost regressor model,
first we need to prepare a training dataset as a spark dataframe contains
"features" and "label" column, the "features" column must be `pyspark.ml.linalg.Vector`
type or spark array type.
"label" column and "features" column(s), the "features" column(s) must be `pyspark.ml.linalg.Vector`
type or spark array type or a list of feature column names.


.. code-block:: python

Expand All @@ -51,17 +63,141 @@ type or spark array type.
The above snippet code returns a `transformed_test_spark_dataframe` that contains the input
dataset columns and an appended column "prediction" representing the prediction results.


******************
SparkXGBClassifier
******************

==================

`SparkXGBClassifier` estimator has similar API with `SparkXGBRegressor`, but it has some
pyspark classifier specific params, e.g. `raw_prediction_col` and `probability_col` parameters.
Correspondingly, by default, `SparkXGBClassifierModel` transforming test dataset will
generate result dataset with 3 new columns:

- "prediction": represents the predicted label.
- "raw_prediction": represents the output margin values.
- "probability": represents the prediction probability on each label.


***************************
XGBoost PySpark GPU support
***************************

XGBoost PySpark supports GPU training and prediction. To enable GPU support, you first need
to install the xgboost and cudf packages. Then you can set `use_gpu` parameter to `True`.

Below tutorial will show you how to train a model with XGBoost PySpark GPU on Spark
standalone cluster.


Write your PySpark application
==============================

.. code-block:: python

from xgboost.spark import SparkXGBRegressor
spark = SparkSession.builder.getOrCreate()

# read data into spark dataframe
train_data_path = "xxxx/train"
train_df = spark.read.parquet(data_path)

test_data_path = "xxxx/test"
test_df = spark.read.parquet(test_data_path)

# assume the label column is named "class"
label_name = "class"

# get a list with feature column names
feature_names = [x.name for x in train_df.schema if x.name != label]

# create a xgboost pyspark regressor estimator and set use_gpu=True
regressor = SparkXGBRegressor(
features_col=feature_names,
label_col=label_name,
num_workers=2,
use_gpu=True,
)

# train and return the model
model = regressor.fit(train_df)

# predict on test data
predict_df = model.transform(test_df)
predict_df.show()

Prepare the necessary packages
==============================

We recommend using Conda or Virtualenv to manage python dependencies
in PySpark. Please refer to
`How to Manage Python Dependencies in PySpark <https://www.databricks.com/blog/2020/12/22/how-to-manage-python-dependencies-in-pyspark.html>`_.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This varies between different providers of pyspark environments. On dataproc we can't submit the environment through spark-submit.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I mentioned this tutorial is for spark standalone mode, I didn't want to involve other CSPs to xgboost.


.. code-block:: bash

conda create -y -n xgboost-env -c conda-forge conda-pack python=3.9
conda activate xgboost-env
pip install xgboost
pip install cudf
conda pack -f -o xgboost-env.tar.gz


Submit the PySpark application
==============================

Assuming you have configured your Spark cluster with GPU support, if not yet, please
refer to `spark standalone configuration with GPU support <https://nvidia.github.io/spark-rapids/docs/get-started/getting-started-on-prem.html#spark-standalone-cluster>`_.

.. code-block:: bash

export PYSPARK_DRIVER_PYTHON=python
export PYSPARK_PYTHON=./environment/bin/python

spark-submit \
--master spark://<master-ip>:7077 \
--conf spark.executor.resource.gpu.amount=1 \
--conf spark.task.resource.gpu.amount=1 \
--archives xgboost-env.tar.gz#environment \
xgboost_app.py


Model Persistence
=================

.. code-block:: python

# save the model
model.save("/tmp/xgboost-pyspark-model")

# load the model
model2 = SparkXGBRankerModel.load("/tmp/xgboost-pyspark-model")

The above code snippet shows how to save/load xgboost pyspark model. And you can also
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I strongly prefer using the booster attribute and would like to keep this special file name as a workaround that should be used sparingly.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, this is the standard spark way to save/load model. I can also mention the booster attribute.

load the model with xgboost python package directly without involving spark.

.. code-block:: python

import xgboost as xgb
bst = xgb.Booster()
bst.load_model("/tmp/xgboost-pyspark-model/model/part-00000")


Accelerate the whole pipeline of xgboost pyspark
================================================

With `RAPIDS Accelerator for Apache Spark <https://nvidia.github.io/spark-rapids/>`_,
you can accelerate the whole pipeline (ETL, Train, Transform) for xgboost pyspark
without any code change by leveraging GPU.

You only need to add some configurations to enable RAPIDS plugin when submitting.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
You only need to add some configurations to enable RAPIDS plugin when submitting.
You only need to add some configurations to enable RAPIDS plugin when submitting.

Below is a simple example submit command for enabling GPU acceleration:

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done


.. code-block:: bash

export PYSPARK_DRIVER_PYTHON=python
export PYSPARK_PYTHON=./environment/bin/python

spark-submit \
--master spark://<master-ip>:7077 \
--conf spark.executor.resource.gpu.amount=1 \
--conf spark.task.resource.gpu.amount=1 \
--packages com.nvidia:rapids-4-spark_2.12:22.08.0 \
--conf spark.plugins=com.nvidia.spark.SQLPlugin \
--archives xgboost-env.tar.gz#environment \
xgboost_app.py