Cannot read Parquet files that do not specify Map keys as required #5606

jupiter · 2024-04-08T14:10:22Z

Describe the bug

The key field encodes the map's key type. This field must have repetition required and must always be present.

I.e.

<map-repetition> group <name> (MAP) {
  repeated group key_value {
    required <key-type> key;
    <value-repetition> <value-type> value;
  }
}

However, most implementations of the format do not appear to enforce this in the Thrift schema, and producers such as Hive/Spark/Presto/Trino/AWS Athena do not produce Parquet files like this. A huge number of such files are widely found on data lakes everywhere, and rewriting such files in order to comply with this does not seem feasable.

To Reproduce

        let url_str = "https://overturemaps-us-west-2.s3.us-west-2.amazonaws.com/release/2023-07-26-alpha.0/theme%3Dtransportation/type%3Dconnector/20230726_134827_00007_dg6b6_01b086fc-f35b-487c-8d4e-5cdbbdc1785d";
        let url = Url::parse(url_str).unwrap();
        let storage_container = Arc::new(HttpBuilder::new().with_url(url).build().unwrap());
        let location = Path::from("");
        let meta = storage_container.head(&location).await.unwrap();
        let mut reader = ParquetObjectReader::new(storage_container, meta);

        // Parquet schema can be printed
        let mut p_metadata = reader.get_metadata().await.unwrap();
        print_schema(&mut std::io::stdout(), p_metadata.file_metadata().schema());

        // Metadata cannot be loaded
        let metadata = ArrowReaderMetadata::load_async(&mut reader, Default::default())
            .await
            .unwrap();

Results in:

thread 'main' panicked at src/main.rs:79:10:
called `Result::unwrap()` on an `Err` value: ArrowError("Map keys must be required")

Expected behavior

Map keys are assumed to be required, regardless of explicit specification in the Thrift schema, and data is read accordingly.

Additional context

This has come up in a PyArrow issue: apache/arrow#37389
Enforced here:

arrow-rs/parquet/src/arrow/schema/complex.rs

Lines 289 to 291 in 9fda7ea

    
           if map_key.get_basic_info().repetition() != Repetition::REQUIRED { 
        
               return Err(arrow_err!("Map keys must be required")); 
        
           }

The text was updated successfully, but these errors were encountered:

jupiter · 2024-04-08T14:13:32Z

It was hard to say whether this should be regarded as a bug or feature request. It's a bug from the perspective that we'd expect broad compatibility.

tustvold · 2024-04-08T18:36:54Z

The conclusion of apache/arrow#37389 appears to be that we are correct to refuse to read such malformed files, am I missing something here?

jupiter · 2024-04-08T19:24:37Z

It was discussed, but I don't think that was the conclusion. The creator's issue was resolved by rewriting a file.

In order to operate with precious Parquet files from huge data lakes (e.g. DataFusion probably would want to support files produced by other systems), I'm of the opinion that it should tolerate this like most of the other implementations do (e.g. DuckDB, parquet-tools, and probably many more).

I'm all for correctness, but in this particular case you need to consider the intention and purpose.

There is no way that an optional key can be intentional. Being compatible with a vast amount of data is the purpose of Parquet integration.

Also consider: https://en.wikipedia.org/wiki/Robustness_principle#:~:text=In%20computing%2C%20the%20robustness%20principle,what%20you%20accept%20from%20others%22.

In other words, programs that send messages to other machines (or to other programs on the same machine) should conform completely to the specifications, but programs that receive messages should accept non-conformant input as long as the meaning is clear.

tustvold · 2024-04-08T21:58:06Z

We could probably just ignore the malformed map logical type and decode such columns as a regular list of structs. This would allow the data to be read, without needing to implement custom dremel shredding logic to handle the case of malformed MapArray, and allowing users to determine how they wish to handle this situation.

Tagging @mapleFU who may have further thoughts on this

jupiter · 2024-04-11T16:48:47Z

It works when removing/reducing the check with all files I tested. I have not been able to produce any files that have invalid data to match such a schema, but I'd assume it would error as it would for any invalid data.

jupiter added the bug label Apr 8, 2024

tustvold added question Further information is requested and removed bug labels Apr 8, 2024

jupiter mentioned this issue Apr 11, 2024

Accept parquet schemas without explicitly required Map keys #5630

Merged

tustvold closed this as completed in #5630 Apr 30, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Cannot read Parquet files that do not specify Map keys as required #5606

Cannot read Parquet files that do not specify Map keys as required #5606

jupiter commented Apr 8, 2024 •

edited

jupiter commented Apr 8, 2024

tustvold commented Apr 8, 2024

jupiter commented Apr 8, 2024 •

edited

tustvold commented Apr 8, 2024

jupiter commented Apr 11, 2024

Cannot read Parquet files that do not specify Map keys as required #5606

Cannot read Parquet files that do not specify Map keys as required #5606

Comments

jupiter commented Apr 8, 2024 • edited

jupiter commented Apr 8, 2024

tustvold commented Apr 8, 2024

jupiter commented Apr 8, 2024 • edited

tustvold commented Apr 8, 2024

jupiter commented Apr 11, 2024

jupiter commented Apr 8, 2024 •

edited

jupiter commented Apr 8, 2024 •

edited