Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SPARK-48380: SerDeUtil.javaToPython to support batchSize parameter #46691

Closed
wants to merge 1 commit into from

Conversation

zshao9
Copy link

@zshao9 zshao9 commented May 21, 2024

What changes were proposed in this pull request?

This simple PR adds a parameter batchSize to the underlying javaToPython function. This parameter is already available on pairRDDToPython.

Why are the changes needed?

With this, pyspark program can decide to use a different batch size when the default AutoBatchPickler ends up with 2GB exceeded error as in the JIRA ticket: https://issues.apache.org/jira/browse/SPARK-48380

Does this PR introduce any user-facing change?

This introduces a new way to access python RDD from python DataFrame:

# python
# The existing way to convert dataframe to RDD with auto batching (NO CHANGES)
df.rdd
# The new way to convert dataframe to RDD with batchSize of 1 (New API)
df._jdf.javaToPython(1)

How was this patch tested?

mvn test
(still testing)

Was this patch authored or co-authored using generative AI tooling?

No.

@HyukjinKwon
Copy link
Member

Let's:

  1. Create a PR against master branch
  2. Add a test case
  3. Fix the PR title (see also https://spark.apache.org/contributing.html)

@zshao9
Copy link
Author

zshao9 commented May 22, 2024

Will move to #46697

@zshao9 zshao9 closed this May 22, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
3 participants