Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[pyspark][doc] add more doc for pyspark #8271

Merged
merged 5 commits into from Sep 29, 2022
Merged

Conversation

wbo4958
Copy link
Contributor

@wbo4958 wbo4958 commented Sep 26, 2022

Add doc for pyspark gpu support.

@wbo4958 wbo4958 marked this pull request as ready for review September 26, 2022 03:40
@wbo4958
Copy link
Contributor Author

wbo4958 commented Sep 26, 2022

@WeichenXu123 @trivialfis please help to review this PR. Thx


We recommend using Conda or Virtualenv to manage python dependencies
in PySpark. Please refer to
`How to Manage Python Dependencies in PySpark <https://www.databricks.com/blog/2020/12/22/how-to-manage-python-dependencies-in-pyspark.html>`_.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This varies between different providers of pyspark environments. On dataproc we can't submit the environment through spark-submit.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I mentioned this tutorial is for spark standalone mode, I didn't want to involve other CSPs to xgboost.

# load the model
model2 = SparkXGBRankerModel.load("/tmp/xgboost-pyspark-model")

The above code snippet shows how to save/load xgboost pyspark model. And you can also
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I strongly prefer using the booster attribute and would like to keep this special file name as a workaround that should be used sparingly.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, this is the standard spark way to save/load model. I can also mention the booster attribute.

you can accelerate the whole pipeline (ETL, Train, Transform) for xgboost pyspark
without any code change by leveraging GPU.

You only need to add some configurations to enable RAPIDS plugin when submitting.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
You only need to add some configurations to enable RAPIDS plugin when submitting.
You only need to add some configurations to enable RAPIDS plugin when submitting.

Below is a simple example submit command for enabling GPU acceleration:

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

Copy link
Contributor

@WeichenXu123 WeichenXu123 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@wbo4958
Copy link
Contributor Author

wbo4958 commented Sep 27, 2022

@trivialfis please help to review it again. Thx

@trivialfis trivialfis added this to In progress in 1.7 Roadmap Sep 28, 2022
@wbo4958 wbo4958 mentioned this pull request Sep 28, 2022
1.7 Roadmap automation moved this from In progress to Reviewer approved Sep 28, 2022
@trivialfis
Copy link
Member

@wbo4958 I made some modifications to your PR, do you have any concerns?

@wbo4958
Copy link
Contributor Author

wbo4958 commented Sep 29, 2022

It looks pretty much better than my original expression. Thx

@trivialfis trivialfis merged commit cbf3a5f into dmlc:master Sep 29, 2022
1.7 Roadmap automation moved this from Reviewer approved to Done Sep 29, 2022
@wbo4958 wbo4958 deleted the pyspark-doc branch April 23, 2024 07:43
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
No open projects
Development

Successfully merging this pull request may close these issues.

None yet

3 participants