Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cannot read Parquet files that do not specify Map keys as required #5606

Closed
jupiter opened this issue Apr 8, 2024 · 5 comments · Fixed by #5630
Closed

Cannot read Parquet files that do not specify Map keys as required #5606

jupiter opened this issue Apr 8, 2024 · 5 comments · Fixed by #5630
Labels
question Further information is requested

Comments

@jupiter
Copy link
Contributor

jupiter commented Apr 8, 2024

Describe the bug

The Parquet Format specifies that:

The key field encodes the map's key type. This field must have repetition required and must always be present.

I.e.

<map-repetition> group <name> (MAP) {
  repeated group key_value {
    required <key-type> key;
    <value-repetition> <value-type> value;
  }
}

However, most implementations of the format do not appear to enforce this in the Thrift schema, and producers such as Hive/Spark/Presto/Trino/AWS Athena do not produce Parquet files like this. A huge number of such files are widely found on data lakes everywhere, and rewriting such files in order to comply with this does not seem feasable.

To Reproduce

        let url_str = "https://overturemaps-us-west-2.s3.us-west-2.amazonaws.com/release/2023-07-26-alpha.0/theme%3Dtransportation/type%3Dconnector/20230726_134827_00007_dg6b6_01b086fc-f35b-487c-8d4e-5cdbbdc1785d";
        let url = Url::parse(url_str).unwrap();
        let storage_container = Arc::new(HttpBuilder::new().with_url(url).build().unwrap());
        let location = Path::from("");
        let meta = storage_container.head(&location).await.unwrap();
        let mut reader = ParquetObjectReader::new(storage_container, meta);

        // Parquet schema can be printed
        let mut p_metadata = reader.get_metadata().await.unwrap();
        print_schema(&mut std::io::stdout(), p_metadata.file_metadata().schema());

        // Metadata cannot be loaded
        let metadata = ArrowReaderMetadata::load_async(&mut reader, Default::default())
            .await
            .unwrap();

Results in:

thread 'main' panicked at src/main.rs:79:10:
called `Result::unwrap()` on an `Err` value: ArrowError("Map keys must be required")

Expected behavior

Map keys are assumed to be required, regardless of explicit specification in the Thrift schema, and data is read accordingly.

Additional context

This has come up in a PyArrow issue: apache/arrow#37389
Enforced here:

if map_key.get_basic_info().repetition() != Repetition::REQUIRED {
return Err(arrow_err!("Map keys must be required"));
}

@jupiter jupiter added the bug label Apr 8, 2024
@jupiter
Copy link
Contributor Author

jupiter commented Apr 8, 2024

It was hard to say whether this should be regarded as a bug or feature request. It's a bug from the perspective that we'd expect broad compatibility.

@tustvold
Copy link
Contributor

tustvold commented Apr 8, 2024

The conclusion of apache/arrow#37389 appears to be that we are correct to refuse to read such malformed files, am I missing something here?

@tustvold tustvold added question Further information is requested and removed bug labels Apr 8, 2024
@jupiter
Copy link
Contributor Author

jupiter commented Apr 8, 2024

It was discussed, but I don't think that was the conclusion. The creator's issue was resolved by rewriting a file.

In order to operate with precious Parquet files from huge data lakes (e.g. DataFusion probably would want to support files produced by other systems), I'm of the opinion that it should tolerate this like most of the other implementations do (e.g. DuckDB, parquet-tools, and probably many more).

I'm all for correctness, but in this particular case you need to consider the intention and purpose.

There is no way that an optional key can be intentional. Being compatible with a vast amount of data is the purpose of Parquet integration.

Also consider: https://en.wikipedia.org/wiki/Robustness_principle#:~:text=In%20computing%2C%20the%20robustness%20principle,what%20you%20accept%20from%20others%22.

In other words, programs that send messages to other machines (or to other programs on the same machine) should conform completely to the specifications, but programs that receive messages should accept non-conformant input as long as the meaning is clear.

@tustvold
Copy link
Contributor

tustvold commented Apr 8, 2024

We could probably just ignore the malformed map logical type and decode such columns as a regular list of structs. This would allow the data to be read, without needing to implement custom dremel shredding logic to handle the case of malformed MapArray, and allowing users to determine how they wish to handle this situation.

Tagging @mapleFU who may have further thoughts on this

@jupiter
Copy link
Contributor Author

jupiter commented Apr 11, 2024

It works when removing/reducing the check with all files I tested. I have not been able to produce any files that have invalid data to match such a schema, but I'd assume it would error as it would for any invalid data.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants