New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[pyspark][doc] add more doc for pyspark #8271
Conversation
@WeichenXu123 @trivialfis please help to review this PR. Thx |
|
||
We recommend using Conda or Virtualenv to manage python dependencies | ||
in PySpark. Please refer to | ||
`How to Manage Python Dependencies in PySpark <https://www.databricks.com/blog/2020/12/22/how-to-manage-python-dependencies-in-pyspark.html>`_. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This varies between different providers of pyspark environments. On dataproc we can't submit the environment through spark-submit.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I mentioned this tutorial is for spark standalone mode, I didn't want to involve other CSPs to xgboost.
doc/tutorials/spark_estimator.rst
Outdated
# load the model | ||
model2 = SparkXGBRankerModel.load("/tmp/xgboost-pyspark-model") | ||
|
||
The above code snippet shows how to save/load xgboost pyspark model. And you can also |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I strongly prefer using the booster attribute and would like to keep this special file name as a workaround that should be used sparingly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, this is the standard spark way to save/load model. I can also mention the booster attribute.
doc/tutorials/spark_estimator.rst
Outdated
you can accelerate the whole pipeline (ETL, Train, Transform) for xgboost pyspark | ||
without any code change by leveraging GPU. | ||
|
||
You only need to add some configurations to enable RAPIDS plugin when submitting. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You only need to add some configurations to enable RAPIDS plugin when submitting. | |
You only need to add some configurations to enable RAPIDS plugin when submitting. |
Below is a simple example submit command for enabling GPU acceleration:
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
@trivialfis please help to review it again. Thx |
@wbo4958 I made some modifications to your PR, do you have any concerns? |
It looks pretty much better than my original expression. Thx |
Add doc for pyspark gpu support.