[BUG] significant slow down with ParquetCachedBatchSerializer and pyspark CrossValidator #5975

eordentlich · 2022-07-08T19:41:59Z

Describe the bug
First observed when attempting to run pyspark's CrossValidator + VectorAssembler + pyspark version of XGBoost under review in this PR: dmlc/xgboost#8020. Parts of this should fall back to cpu due to the use of VectorUDT injected by VectorAssembler. Running time of certain steps however jumps from a few minutes to over an hour when ParquetCachedBatchSerializer is enabled vs disabled, with spark-rapids plugin enabled in both cases. Attempted to reproduce in a more self-contained manner per below code snippet that incorporates some of the relevant logic from CrossValidator and XGboost.

Steps/Code to reproduce bug

import org.apache.spark.ml.feature.VectorAssembler
import org.apache.spark.sql.functions._
import org.apache.spark.sql.types._
val df = spark.range(0, 10000000).toDF("col_1")
val df2 = df.withColumn("rand1",rand()).withColumn("rand2",rand()).withColumn("rand3",rand())
val va = new VectorAssembler().setInputCols(Array("rand1","rand2","rand3")).setOutputCol("vector")
val df3 = va.transform(df2).withColumn("filter",rand()).filter($"filter" < 0.5)
df3.cache()
val df4 = df3.repartition(2)
df4.count

In my environment, this bit of code takes a few seconds to run in spark-shell with ParquetCachedBatchSerializer disabled but almost 2 min when enabled.

Another issue with this example is that if the line val df3 = ... is replaced with
val df3 = df2.withColumn("filter",rand()).filter($"filter" < 0.5) (i.e. no VectorUDT column added), an Array index out of bounds exception is encountered with ParquetCachedBatchSerializer enabled, while no error with it disabled.

A pyspark version of the above example shows similar behavior.

Expected behavior
Much smaller performance penalty with ParquetCachedBatchSerializer enabled in this example, which should resolve the main issue encountered with pyspark CrossValidator.

Environment details (please complete the following information)

Environment location: Standalone, local server, single gpu
Spark configuration settings related to the issue:

$SPARK_HOME/bin/pyspark --master ${SPARK_URL} --deploy-mode client --driver-memory 10G --executor-memory 60G --num-executors 1 --executor-cores 12 --conf spark.cores.max=96 --conf spark.task.cpus=1 --conf spark.locality.wait=0 --conf spark.yarn.maxAppAttempts=1 --conf spark.sql.files.maxPartitionBytes=1024m --conf spark.task.resource.gpu.amount=0.08 --conf spark.executor.resource.gpu.amount=1 --conf spark.executor.resource.gpu.discoveryScript=./getGpusResources.sh --conf spark.executorEnv.CUDA_MPS_PIPE_DIRECTORY=/tmp/nvidia-mps \
                            --conf spark.plugins=com.nvidia.spark.SQLPlugin \
                            --conf spark.rapids.sql.enabled=true \
                            --conf spark.sql.files.maxPartitionBytes=1G \
                            --conf spark.sql.shuffle.partitions=192 \
                            --conf spark.rapids.sql.explain=ALL \
                            --conf spark.rapids.sql.incompatibleOps.enabled=true \
                            --conf spark.rapids.sql.batchSizeBytes=512M \
                            --conf spark.rapids.sql.reader.batchSizeBytes=768M \
                            --conf spark.rapids.sql.rowBasedUDF.enabled=true \
                            --conf spark.rapids.sql.variableFloatAgg.enabled=true \
                            --conf spark.rapids.sql.hasNans=false \
                            --conf spark.rapids.memory.gpu.minAllocFraction=0.0001 \
                            --conf spark.rapids.memory.gpu.maxAllocFraction=0.5 \
                            --conf spark.rapids.memory.gpu.allocFraction=0.5 \
                            --conf spark.sql.adaptive.enabled=false \
                            --conf spark.sql.cache.serializer=com.nvidia.spark.ParquetCachedBatchSerializer \
                            --conf spark.executorEnv.CUDA_MPS_LOG_DIRECTORY=/tmp/nvidia-log --files $SPARK_HOME/examples/src/main/scripts/getGpusResources.sh --jars ${SPARK_RAPIDS_PLUGIN_JAR}

I then remove --conf spark.sql.cache.serializer=com.nvidia.spark.ParquetCachedBatchSerializer to disable ParquetCachedBatchSerializer.

The text was updated successfully, but these errors were encountered:

WeichenXu123 · 2022-07-09T01:15:29Z

I guess it probably be issue in ParquetCachedBatchSerializer?
Is it relates to xgboost pyspark integration code ?

eordentlich · 2022-07-09T01:23:26Z

It is not specific to the xgboost pyspark code. Just happened to encounter the issue when trying that.

WeichenXu123 · 2022-07-09T01:26:33Z

It is not specific to the xgboost pyspark code. Just happened to encounter the issue when trying that.

But happy to see you tried the xgboost pyspark code.
If you found any performance issue pls reported to me.
Thanks!

eordentlich added ? - Needs Triage Need team to review and classify bug Something isn't working labels Jul 8, 2022

sameerz removed the ? - Needs Triage Need team to review and classify label Jul 12, 2022

eordentlich mentioned this issue Jun 1, 2023

[BUG] significant slow down with VectorUDT and ParquetCachedBatchSerializer #8474

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] significant slow down with ParquetCachedBatchSerializer and pyspark CrossValidator #5975

[BUG] significant slow down with ParquetCachedBatchSerializer and pyspark CrossValidator #5975

eordentlich commented Jul 8, 2022

WeichenXu123 commented Jul 9, 2022

eordentlich commented Jul 9, 2022

WeichenXu123 commented Jul 9, 2022

[BUG] significant slow down with ParquetCachedBatchSerializer and pyspark CrossValidator #5975

[BUG] significant slow down with ParquetCachedBatchSerializer and pyspark CrossValidator #5975

Comments

eordentlich commented Jul 8, 2022

WeichenXu123 commented Jul 9, 2022

eordentlich commented Jul 9, 2022

WeichenXu123 commented Jul 9, 2022