SPARK-48380: SerDeUtil.javaToPython to support batchSize parameter #46691

zshao9 · 2024-05-21T18:57:12Z

What changes were proposed in this pull request?

This simple PR adds a parameter batchSize to the underlying javaToPython function. This parameter is already available on pairRDDToPython.

Why are the changes needed?

With this, pyspark program can decide to use a different batch size when the default AutoBatchPickler ends up with 2GB exceeded error as in the JIRA ticket: https://issues.apache.org/jira/browse/SPARK-48380

Does this PR introduce any user-facing change?

This introduces a new way to access python RDD from python DataFrame:

# python
# The existing way to convert dataframe to RDD with auto batching (NO CHANGES)
df.rdd
# The new way to convert dataframe to RDD with batchSize of 1 (New API)
df._jdf.javaToPython(1)

How was this patch tested?

mvn test
(still testing)

Was this patch authored or co-authored using generative AI tooling?

No.

core/src/main/scala/org/apache/spark/api/python/SerDeUtil.scala

HyukjinKwon · 2024-05-21T23:55:46Z

Let's:

Create a PR against master branch
Add a test case
Fix the PR title (see also https://spark.apache.org/contributing.html)

zshao9 · 2024-05-22T03:51:07Z

Will move to #46697

SPARK-48380: SerDeUtil.javaToPython to support batchSize parameter

2a2f070

github-actions bot added SQL CORE PYTHON labels May 21, 2024

sunchao reviewed May 21, 2024

View reviewed changes

core/src/main/scala/org/apache/spark/api/python/SerDeUtil.scala Show resolved Hide resolved

core/src/main/scala/org/apache/spark/api/python/SerDeUtil.scala Show resolved Hide resolved

zshao9 closed this May 22, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

SPARK-48380: SerDeUtil.javaToPython to support batchSize parameter #46691

SPARK-48380: SerDeUtil.javaToPython to support batchSize parameter #46691

zshao9 commented May 21, 2024 •

edited

HyukjinKwon commented May 21, 2024

zshao9 commented May 22, 2024

SPARK-48380: SerDeUtil.javaToPython to support batchSize parameter #46691

SPARK-48380: SerDeUtil.javaToPython to support batchSize parameter #46691

Conversation

zshao9 commented May 21, 2024 • edited

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Was this patch authored or co-authored using generative AI tooling?

HyukjinKwon commented May 21, 2024

zshao9 commented May 22, 2024

zshao9 commented May 21, 2024 •

edited