You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Describe the bug
First observed when attempting to run pyspark's CrossValidator + VectorAssembler + pyspark version of XGBoost under review in this PR: dmlc/xgboost#8020. Parts of this should fall back to cpu due to the use of VectorUDT injected by VectorAssembler. Running time of certain steps however jumps from a few minutes to over an hour when ParquetCachedBatchSerializer is enabled vs disabled, with spark-rapids plugin enabled in both cases. Attempted to reproduce in a more self-contained manner per below code snippet that incorporates some of the relevant logic from CrossValidator and XGboost.
In my environment, this bit of code takes a few seconds to run in spark-shell with ParquetCachedBatchSerializer disabled but almost 2 min when enabled.
Another issue with this example is that if the line val df3 = ... is replaced with val df3 = df2.withColumn("filter",rand()).filter($"filter" < 0.5) (i.e. no VectorUDT column added), an Array index out of bounds exception is encountered with ParquetCachedBatchSerializer enabled, while no error with it disabled.
A pyspark version of the above example shows similar behavior.
Expected behavior
Much smaller performance penalty with ParquetCachedBatchSerializer enabled in this example, which should resolve the main issue encountered with pyspark CrossValidator.
Environment details (please complete the following information)
Environment location: Standalone, local server, single gpu
Spark configuration settings related to the issue:
Describe the bug
First observed when attempting to run pyspark's CrossValidator + VectorAssembler + pyspark version of XGBoost under review in this PR: dmlc/xgboost#8020. Parts of this should fall back to cpu due to the use of VectorUDT injected by VectorAssembler. Running time of certain steps however jumps from a few minutes to over an hour when ParquetCachedBatchSerializer is enabled vs disabled, with spark-rapids plugin enabled in both cases. Attempted to reproduce in a more self-contained manner per below code snippet that incorporates some of the relevant logic from CrossValidator and XGboost.
Steps/Code to reproduce bug
In my environment, this bit of code takes a few seconds to run in
spark-shell
with ParquetCachedBatchSerializer disabled but almost 2 min when enabled.Another issue with this example is that if the line
val df3 = ...
is replaced withval df3 = df2.withColumn("filter",rand()).filter($"filter" < 0.5)
(i.e. no VectorUDT column added), an Array index out of bounds exception is encountered with ParquetCachedBatchSerializer enabled, while no error with it disabled.A pyspark version of the above example shows similar behavior.
Expected behavior
Much smaller performance penalty with ParquetCachedBatchSerializer enabled in this example, which should resolve the main issue encountered with pyspark CrossValidator.
Environment details (please complete the following information)
I then remove
--conf spark.sql.cache.serializer=com.nvidia.spark.ParquetCachedBatchSerializer
to disable ParquetCachedBatchSerializer.The text was updated successfully, but these errors were encountered: