New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Cannot read Parquet files that do not specify Map keys as required #5606
Comments
It was hard to say whether this should be regarded as a bug or feature request. It's a bug from the perspective that we'd expect broad compatibility. |
The conclusion of apache/arrow#37389 appears to be that we are correct to refuse to read such malformed files, am I missing something here? |
It was discussed, but I don't think that was the conclusion. The creator's issue was resolved by rewriting a file. In order to operate with precious Parquet files from huge data lakes (e.g. DataFusion probably would want to support files produced by other systems), I'm of the opinion that it should tolerate this like most of the other implementations do (e.g. DuckDB, parquet-tools, and probably many more). I'm all for correctness, but in this particular case you need to consider the intention and purpose. There is no way that an optional key can be intentional. Being compatible with a vast amount of data is the purpose of Parquet integration.
|
We could probably just ignore the malformed map logical type and decode such columns as a regular list of structs. This would allow the data to be read, without needing to implement custom dremel shredding logic to handle the case of malformed MapArray, and allowing users to determine how they wish to handle this situation. Tagging @mapleFU who may have further thoughts on this |
It works when removing/reducing the check with all files I tested. I have not been able to produce any files that have invalid data to match such a schema, but I'd assume it would error as it would for any invalid data. |
Describe the bug
The Parquet Format specifies that:
I.e.
However, most implementations of the format do not appear to enforce this in the Thrift schema, and producers such as Hive/Spark/Presto/Trino/AWS Athena do not produce Parquet files like this. A huge number of such files are widely found on data lakes everywhere, and rewriting such files in order to comply with this does not seem feasable.
To Reproduce
Results in:
Expected behavior
Map keys are assumed to be required, regardless of explicit specification in the Thrift schema, and data is read accordingly.
Additional context
This has come up in a PyArrow issue: apache/arrow#37389
Enforced here:
arrow-rs/parquet/src/arrow/schema/complex.rs
Lines 289 to 291 in 9fda7ea
The text was updated successfully, but these errors were encountered: