Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Rust] Parquet data source does not support complex types #83

Closed
alamb opened this issue Apr 26, 2021 · 12 comments
Closed

[Rust] Parquet data source does not support complex types #83

alamb opened this issue Apr 26, 2021 · 12 comments
Labels
datafusion Changes in the datafusion crate

Comments

@alamb
Copy link
Contributor

alamb commented Apr 26, 2021

Note: migrated from original JIRA: https://issues.apache.org/jira/browse/ARROW-4863

Once ARROW-4466 is merged, I would like to add support for reading parquet files that contain LIST and STRUCT.

 

@alamb alamb added the datafusion Changes in the datafusion crate label Apr 26, 2021
@alamb
Copy link
Contributor Author

alamb commented Apr 26, 2021

Comment from Wes McKinney(wesm) @ 2019-03-14T22:28:39.984+0000:

This is a fairly tricky task (we still don't have this fully done in C++). I'm moving to 0.14 as I expect it to take a little time

Comment from Neville Dipale(nevi_me) @ 2020-11-28T12:41:50.557+0000:

[~andygrove] I'm going through old PRs and closing them. The writer will support nested types to our heart's content, would we need to do anything further to enable this in DataFusion, or can we close this?

Comment from Andy Grove(andygrove) @ 2020-11-28T17:42:28.962+0000:

Thanks [~nevi_me] I filed https://issues.apache.org/jira/browse/ARROW-10761 for the work we need to do in DataFusion

Comment from Andrew Lamb(alamb) @ 2021-04-26T11:23:22.697+0000:

Migrated to github: https://github.com/apache/arrow-rs/issues/39

@Igosuki
Copy link
Contributor

Igosuki commented Aug 22, 2021

Started hacking here https://github.com/Igosuki/arrow-datafusion/tree/map_access
Works for arrays, haven't gotten around to do the physical plan for dictionary because of generics

@lexi-sh
Copy link

lexi-sh commented Nov 25, 2021

Is it expected that datafusion cannot currently read parquets with nested objects at all, even if we never utilize the column? While attempting to read a parquet that has a nested object, I get an error because arrow_reader::get_schema returns num_fields counting nested objects as a single field, but row_groups.columns flattens and has more columns than num_fields returns.

Should the ability to read parquets with nested objects be implemented (only panicking if transformations utilizing that field), or would it be better to just work on this issue as a whole?

@alamb
Copy link
Contributor Author

alamb commented Nov 27, 2021

Should the ability to read parquets with nested objects be implemented (only panicking if transformations utilizing that field)

This seems like a valuable addition to me (allowing queries on parquet files that had nested objects but were not read)

, or would it be better to just work on this issue as a whole?

Well of course, supporting queries on the data would be better than just not crashing/erroring when they weren't read :) I think the choice of approach is probably best determined by whoever implements this feature

@houqp
Copy link
Member

houqp commented Dec 2, 2021

perhaps we could use #1383 to track the issue of handling stats for nested types in parquet. For query on nested field, didn't @Igosuki already added the support for this? What is left to be done here?

@houqp
Copy link
Member

houqp commented Dec 2, 2021

perhaps one of the remaining items would be supported nested columns in physical_plan::Statistics

@Igosuki
Copy link
Contributor

Igosuki commented Dec 2, 2021

The indexed map access code will work on the plan so the only thing the parquet reader has to do is simply deserialize nested structures recursively.
As long as there is a struct for string keys, or a list for int keys at the corresponding level, it will return the proper column.

@houqp I see that support in parquet2 was added (have not tested the arrow2 branch yet) jorgecarleitao/parquet2#64 so it's only a matter of adding it to the reader.
As for arrow-rs it looks like it's still not done https://github.com/apache/arrow-rs/blob/master/parquet/src/arrow/array_reader.rs#L803

@tustvold
Copy link
Contributor

I believe this issue can now be closed, as of apache/arrow-rs#2500 parquet has full support for arbitrarily nested types. Feel free to reopen if I have missed something

@alamb
Copy link
Contributor Author

alamb commented Oct 20, 2022

@ShraddhaKishan
Copy link

I believe it still cannot process everything. I was reading a parquet file through ArrowRecordBatchReader and when trying to collect to Vec<RecordBatch> I still get the error that says data type Json not supported in nested map for json writer.

I looked it up further, and found the code originating from within arrow-json/src/writer.rs:351 where we compare the data type of keys with Utf8 and subsequently return an error.

@alamb
Copy link
Contributor Author

alamb commented Apr 5, 2023

Thank you for the report @ShraddhaKishan -- would it be possible to file a ticket in https://github.com/apache/arrow-rs with a reproducer (or at least the parquet file that can not be read)?

@ShraddhaKishan
Copy link

Sure thing.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
datafusion Changes in the datafusion crate
Projects
None yet
Development

No branches or pull requests

6 participants