Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[PySpark] change the returning model type to string from binary #8085

Merged
merged 3 commits into from Jul 19, 2022

Conversation

wbo4958
Copy link
Contributor

@wbo4958 wbo4958 commented Jul 17, 2022

RAPIDS Accelerator is a Spark plugin that leverages the GPUs to accelerate the spark SQL and dataframe API. But not all types can be supported by RAPIDS Accelerator. For example, It has not supported the BINARY type when doing mapInPandas.

22/07/17 09:06:41 WARN GpuOverrides: 
!Exec <MapInPandasExec> cannot run even partially on the GPU because unsupported data types in output: BinaryType [booster_bytes#31]; not all expressions can be replaced
  !Expression <AttributeReference> booster_bytes#31 cannot run on GPU because expression AttributeReference booster_bytes#31 produces an unsupported type BinaryType
  !Expression <PythonUDF> _train_booster(values#15, label#14) blocks running on GPU because expression PythonUDF _train_booster(values#15, label#14) produces an unsupported type StructType(StructField(booster_bytes,BinaryType,true))
    @Expression <AttributeReference> values#15 could run on GPU
    @Expression <AttributeReference> label#14 could run on GPU
  *Exec <ShuffleExchangeExec> will run on GPU
    *Partitioning <SinglePartition$> will run on GPU
    *Exec <ProjectExec> will run on GPU
      *Expression <Alias> cast(array(sepal_length#0, sepal_width#1, petal_length#2, petal_width#3) as array<float>) AS values#15 will run on GPU
        *Expression <Cast> cast(array(sepal_length#0, sepal_width#1, petal_length#2, petal_width#3) as array<float>) will run on GPU
          *Expression <CreateArray> array(sepal_length#0, sepal_width#1, petal_length#2, petal_width#3) will run on GPU
      *Expression <Alias> cast(class#4 as float) AS label#14 will run on GPU
        *Expression <Cast> cast(class#4 as float) will run on GPU
      *Exec <FileSourceScanExec> will run on GPU

So this PR changes returning model from "bytes" to "string", which can make RAPIDS Accelerator speed up XGBoost PySpark seamlessly. And there is no side effect for the XGBoost PySpark itself.

*Exec <MapInPandasExec> will partially run on GPU
  *Expression <PythonUDF> _train_booster(values#15, label#14) will not block GPU acceleration
  *Exec <ShuffleExchangeExec> will run on GPU
    *Partitioning <SinglePartition$> will run on GPU
    *Exec <ProjectExec> will run on GPU
      *Expression <Alias> cast(array(sepal_length#0, sepal_width#1, petal_length#2, petal_width#3) as array<float>) AS values#15 will run on GPU
        *Expression <Cast> cast(array(sepal_length#0, sepal_width#1, petal_length#2, petal_width#3) as array<float>) will run on GPU
          *Expression <CreateArray> array(sepal_length#0, sepal_width#1, petal_length#2, petal_width#3) will run on GPU
      *Expression <Alias> cast(class#4 as float) AS label#14 will run on GPU
        *Expression <Cast> cast(class#4 as float) will run on GPU
      *Exec <FileSourceScanExec> will run on GPU

I have a rough performance test, XGBoost PySpark can have ~35% improvement when using RAPIDS Accelerator.

XGBoost pyspark can be can be accelerated by RAPIDS Accelerator seamlessly by
changing the returning model type from binary to string.
@wbo4958
Copy link
Contributor Author

wbo4958 commented Jul 18, 2022

@trivialfis, Looks like booster.save_raw("json").decode("utf-8") didn't save the configuration?

@trivialfis
Copy link
Member

No, it doesn't save the configuration. Is it required? Because the configuration is not stable across versions.

@wbo4958
Copy link
Contributor Author

wbo4958 commented Jul 19, 2022

No, it doesn't save the configuration. Is it required? Because the configuration is not stable across versions.

I just find another way to handle the configurations.

@wbo4958
Copy link
Contributor Author

wbo4958 commented Jul 19, 2022

@WeichenXu123 @trivialfis Could you help to review this PR?

@trivialfis trivialfis merged commit f801d3c into dmlc:master Jul 19, 2022
@wbo4958 wbo4958 deleted the string-model branch April 23, 2024 09:27
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants