Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Expose some Parquet per-column configuration options via the python API #15613

Merged
merged 47 commits into from
May 22, 2024

Conversation

etseidl
Copy link
Contributor

@etseidl etseidl commented Apr 29, 2024

Description

Several recent PRs (#15081, #15411, #15600) added the ability to control some aspects of Parquet file writing on a per-column basis. During discussion of #15081 it was suggested that these options be exposed by cuDF-python in a manner similar to pyarrow. This PR adds the ability to control per-column encoding, compression, binary output, and fixed-length data width, using fully qualified Parquet column names. For example, given a cuDF table with an integer column 'a', and a list<int32> column 'b', the fully qualified column names would be 'a' and 'b.list.element'.

Addresses "Add cuDF-python API support for specifying encodings" task in #13501.

Checklist

  • I am familiar with the Contributing Guidelines.
  • New or existing tests cover these changes.
  • The documentation is up to date with these changes.

Copy link

copy-pr-bot bot commented Apr 29, 2024

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@github-actions github-actions bot added libcudf Affects libcudf (C++/CUDA) code. cuDF (Python) Affects Python cuDF API. labels Apr 29, 2024
@etseidl
Copy link
Contributor Author

etseidl commented Apr 29, 2024

@GregoryKimball does this do what you wanted?

@vuule vuule added the 3 - Ready for Review Ready for review by team label May 10, 2024
@etseidl
Copy link
Contributor Author

etseidl commented May 10, 2024

I think we would benefit from some tests using write_parquet function and the ParquetWriter to write data and then using pyarrow parquet reader to read and verify correct encodings for each column.

@mhaseeb123 I added some parameterization to at least verify that all the valid Parquet encodings work on the write side. There is already testing elsewhere to verify encoding interoperability with pyarrow/pandas.[1] I also added a check that the column that should still be compressed actually is.

The option set for ParquetWriter has fallen far behind the DataFrame.to_parquet() path. Bringing the former up to date is beyond scope for this PR IMHO.

[1] With the exception of BYTE_STREAM_SPLIT, which was only recently fully implemented in arrow 16. This too can be addressed in a follow up PR.

Copy link
Member

@mhaseeb123 mhaseeb123 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Partial review. Just a minor comment. Looks good otherwise

Copy link
Member

@mhaseeb123 mhaseeb123 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me!

@mhaseeb123
Copy link
Member

/ok to test

Copy link
Contributor

@vyasr vyasr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall looks great, thanks! One change to make then this should be good to go.

python/cudf/cudf/_lib/pylibcudf/libcudf/io/types.pxd Outdated Show resolved Hide resolved
@etseidl etseidl requested a review from vyasr May 14, 2024 22:46
@vuule
Copy link
Contributor

vuule commented May 16, 2024

/ok to test

@GregoryKimball
Copy link
Contributor

GregoryKimball commented May 20, 2024

@galipremsagar Would you please check how cudf.pandas runs when column_encoding is specified?

The user command would be something like df.to_parquet(columns_encoding='DELTA_BYTE_ARRAY').

By the way, what would cudf.pandas do if the user specified an engine such as:
df.to_parquet(engine='pyarrow', columns_encoding='DELTA_BYTE_ARRAY')?

@galipremsagar
Copy link
Contributor

@galipremsagar Would you please check how cudf.pandas runs when column_encoding is specified?

The user command would be something like df.to_parquet(columns_encoding='DELTA_BYTE_ARRAY').

This will invoke cudf parquet writer.

By the way, what would cudf.pandas do if the user specified an engine such as: df.to_parquet(engine='pyarrow', columns_encoding='DELTA_BYTE_ARRAY')?

This will invoke pyarrow's parquet writer through cudf code, but ignore columns_encoding

@galipremsagar
Copy link
Contributor

/okay to test

Copy link
Contributor

@vyasr vyasr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@vuule
Copy link
Contributor

vuule commented May 22, 2024

/merge

@rapids-bot rapids-bot bot merged commit b4ce6e4 into rapidsai:branch-24.06 May 22, 2024
70 checks passed
@etseidl etseidl deleted the python_metadata branch May 22, 2024 23:07
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
3 - Ready for Review Ready for review by team cuDF (Python) Affects Python cuDF API. cuIO cuIO issue feature request New feature or request non-breaking Non-breaking change
Projects
Status: Done
Status: Landed
Development

Successfully merging this pull request may close these issues.

None yet

6 participants