Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support DurationType in cudf parquet reader via arrow:schema #15617

Merged

Conversation

mhaseeb123
Copy link
Member

@mhaseeb123 mhaseeb123 commented Apr 30, 2024

Description

This PR adds the support for reading and using the arrow:schema struct from the serialized arrow:ipc message written at the key-value metadata section of the Parquet file with ARROW:schema key. This allows cudf to read and interop with arrow for non-standard parquet types (DurationType in this PR).

Arrow uses Google flatbuffers (inside Schema.fbs) to serialize the arrow:Schema structure (containing column descriptors) and puts it (padded for 8 byte alignment) into the header of an empty ipc:Message (also a flatbuffer-serialized structure inside Message.fbs). The ipc:Message is prepended with two integers containing a validity message and the size of the header (the arrow:Schema + padding). The final message is endoded as a base64 string and written to Parquet file footer key-value metadata using "ARROW:schema" key.

In this PR, we base64-decode the ipc:Message, then we decode the validity message and the header size, and offset pointers to the arrow:Schema flatbuffer. We then use Flatbuffer structs to walk the arrow:Schema and collect information on columns of interest as an unordered_map (using column name as key). This unordered_map is used inside select_columns function to build cudf Table columns and get the correct dtype.

Closes #13410

Checklist

  • I am familiar with the Contributing Guidelines.
  • New or existing tests cover these changes.
  • The documentation is up to date with these changes.

@github-actions github-actions bot added the libcudf Affects libcudf (C++/CUDA) code. label Apr 30, 2024
@mhaseeb123
Copy link
Member Author

CC: @GregoryKimball @vuule

@mhaseeb123 mhaseeb123 added cuIO cuIO issue 2 - In Progress Currently a work in progress improvement Improvement / enhancement to an existing function breaking Breaking change Reliability labels Apr 30, 2024
Copy link
Contributor

@etseidl etseidl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some early comments. That's a lot of code for one data type 😄. Do we want to include all the google code, or add a library dependency? Or follow the thrift/protobuf model and roll our own parser?

Have you an idea of how long it takes to parse the schema? I'm wondering if it's better to make it optional (like add a use_arrow_schema reader option).

cpp/src/io/parquet/reader_impl.cpp Outdated Show resolved Hide resolved
cpp/src/io/parquet/reader_impl_helpers.cpp Outdated Show resolved Hide resolved
cpp/src/io/parquet/reader_impl_helpers.cpp Outdated Show resolved Hide resolved
cpp/src/io/parquet/reader_impl_helpers.cpp Outdated Show resolved Hide resolved
cpp/src/io/parquet/reader_impl_helpers.cpp Outdated Show resolved Hide resolved
cpp/src/io/parquet/reader_impl_helpers.cpp Outdated Show resolved Hide resolved
cpp/src/io/parquet/reader_impl_helpers.cpp Outdated Show resolved Hide resolved
cpp/src/io/parquet/reader_impl_helpers.hpp Outdated Show resolved Hide resolved
Copy link
Contributor

@etseidl etseidl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Github ate some of my comments ☹️ A note about the column naming.

cpp/src/io/parquet/reader_impl_helpers.cpp Outdated Show resolved Hide resolved
@mhaseeb123
Copy link
Member Author

mhaseeb123 commented Apr 30, 2024

Some early comments. That's a lot of code for one data type 😄. Do we want to include all the google code, or add a library dependency? Or follow the thrift/protobuf model and roll our own parser?

Have you an idea of how long it takes to parse the schema? I'm wondering if it's better to make it optional (like add a use_arrow_schema reader option).

Thank you for looking at this @etseidl. I am also thinking of making it optional as well in the updates I am working on. I am also thinking of removing all the pushed flatbuffer code and adding it as dependency instead.

@mhaseeb123 mhaseeb123 changed the title Support DurationType in cudf parquet via arrow:schema Support DurationType in cudf parquet reader via arrow:schema May 1, 2024
@github-actions github-actions bot added the cuDF (Python) Affects Python cuDF API. label May 1, 2024
@github-actions github-actions bot added the CMake CMake build issue label May 2, 2024
@mhaseeb123
Copy link
Member Author

Thank you for the helpful input @etseidl and @galipremsagar. I am going to remove the use_arrow_schema from the public Python APIs as I also agree keeping our args consistent with other readers. I will keep a default=True use_arrow_schema option in the Cython code in case we need to expose it at a later point.

@galipremsagar
Copy link
Contributor

Thank you for the helpful input @etseidl and @galipremsagar. I am going to remove the use_arrow_schema from the public Python APIs as I also agree keeping our args consistent with other readers. I will keep a default=True use_arrow_schema option in the Cython code in case we need to expose it at a later point.

Sounds good to me 👍 Thanks for helping me understand @etseidl !

@mhaseeb123
Copy link
Member Author

@mroeschke

Thank you for the helpful input @etseidl and @galipremsagar. I am going to remove the use_arrow_schema from the public Python APIs as I also agree keeping our args consistent with other readers. I will keep a default=True use_arrow_schema option in the Cython code in case we need to expose it at a later point.

@mroeschke Apologies for the back and forth but since we have decided to remove the option from Python side altogether (above discussion) , I am going to remove the 2nd part of the tests (which was using the assert_neq) altogether.

Copy link
Contributor

@galipremsagar galipremsagar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can just do this.

python/cudf/cudf/_lib/parquet.pyx Outdated Show resolved Hide resolved
python/cudf/cudf/_lib/parquet.pyx Outdated Show resolved Hide resolved
python/cudf/cudf/_lib/parquet.pyx Outdated Show resolved Hide resolved
mhaseeb123 and others added 2 commits May 13, 2024 19:19
@mhaseeb123 mhaseeb123 requested a review from mroeschke May 14, 2024 02:24
@galipremsagar
Copy link
Contributor

/okay to test

@mhaseeb123 mhaseeb123 added 5 - Ready to Merge Testing and reviews complete, ready to merge and removed 3 - Ready for Review Ready for review by team labels May 14, 2024
@mhaseeb123
Copy link
Member Author

/ok to test

@mhaseeb123 mhaseeb123 added 4 - Needs Review Waiting for reviewer to review or respond and removed 5 - Ready to Merge Testing and reviews complete, ready to merge labels May 14, 2024
.pre-commit-config.yaml Outdated Show resolved Hide resolved
Co-authored-by: Bradley Dice <bdice@bradleydice.com>
@galipremsagar
Copy link
Contributor

/okay to test

@galipremsagar galipremsagar added 5 - Ready to Merge Testing and reviews complete, ready to merge and removed 4 - Needs Review Waiting for reviewer to review or respond labels May 14, 2024
@mhaseeb123
Copy link
Member Author

/ok to test

@mhaseeb123
Copy link
Member Author

/merge

@rapids-bot rapids-bot bot merged commit c5c95b7 into rapidsai:branch-24.06 May 15, 2024
75 checks passed
@mhaseeb123 mhaseeb123 deleted the arrow-schema-support-pq-reader branch May 15, 2024 17:04
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
5 - Ready to Merge Testing and reviews complete, ready to merge breaking Breaking change CMake CMake build issue cuDF (Python) Affects Python cuDF API. cuIO cuIO issue improvement Improvement / enhancement to an existing function libcudf Affects libcudf (C++/CUDA) code. Reliability
Projects
Status: Done
Status: Landed
Development

Successfully merging this pull request may close these issues.

[BUG] Parquet reader unable to read duration types written by pyarrow
7 participants