Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MapArray columns don't handle null values correctly #2484

Closed
d-willis opened this issue Aug 17, 2022 · 4 comments · Fixed by #2500
Closed

MapArray columns don't handle null values correctly #2484

d-willis opened this issue Aug 17, 2022 · 4 comments · Fixed by #2500
Labels
arrow Changes to the arrow crate bug

Comments

@d-willis
Copy link

Describe the bug

Given a parquet file with the following nested map content:

 +--------------------------------------------------------+
 |map                                                     |
 +--------------------------------------------------------+
 |null                                                    |
 |null                                                    |
 |{three -> 3, four -> 4, five -> 5, six -> 6, seven -> 7}|
 +--------------------------------------------------------+

Reading the keys values for the third record gives the wrong data. We expect ["three", "four", "five", "six", "seven"] but get [null, null, "three", "four", "five"].

To Reproduce

The following test will reproduce this:

#[test]
// This test writes a parquet file with the following data:
// +--------------------------------------------------------+
// |map                                                     |
// +--------------------------------------------------------+
// |null                                                    |
// |null                                                    |
// |{three -> 3, four -> 4, five -> 5, six -> 6, seven -> 7}|
// +--------------------------------------------------------+
//
// It then attempts to read the data back and checks that the third record
// contains the expected values.
fn read_map_array_column() {
    // Make sure generated parquet file is removed whether test passes or not
    let _file_remover = FileRemover {
        filename: "read_map_column_with_leading_nulls.parquet".to_string(),
    };

    // Schema for single map of string to int32
    let path = Path::new("read_map_column_with_leading_nulls.parquet");
    let schema = Schema::new(vec![Field::new(
        "map",
        DataType::Map(
            Box::new(Field::new(
                "entries",
                DataType::Struct(vec![
                    Field::new("keys", DataType::Utf8, false),
                    Field::new("values", DataType::Int32, true),
                ]),
                false,
            )),
            false, // Map field not sorted
        ),
        true,
    )]);

    // Create builders for map
    let string_builder = StringBuilder::new(5);
    let ints_builder: PrimitiveBuilder<Int32Type> = PrimitiveBuilder::new(1);
    let mut map_builder = MapBuilder::new(None, string_builder, ints_builder);

    // Add two null records and one record with five entries
    map_builder.append(false).expect("adding null map entry");
    map_builder.append(false).expect("adding null map entry");
    map_builder.keys().append_value("three");
    map_builder.keys().append_value("four");
    map_builder.keys().append_value("five");
    map_builder.keys().append_value("six");
    map_builder.keys().append_value("seven");

    map_builder.values().append_value(3);
    map_builder.values().append_value(4);
    map_builder.values().append_value(5);
    map_builder.values().append_value(6);
    map_builder.values().append_value(7);
    map_builder.append(true).expect("adding map entry");

    // Create record batch
    let batch =
        RecordBatch::try_new(Arc::new(schema), vec![Arc::new(map_builder.finish())])
            .expect("create record batch");

    // Write record batch to file
    let props = WriterProperties::builder().build();
    let file = fs::File::create(&path).expect("create file");
    let mut writer = ArrowWriter::try_new(file, batch.schema(), Some(props))
        .expect("creat file writer");
    writer.write(&batch).expect("writing file");
    writer.close().expect("close writer");

    // Read file
    let file = fs::File::open(&Path::new("read_map_column_with_leading_nulls.parquet"))
        .expect("open file");
    let record_batch_reader = ParquetRecordBatchReaderBuilder::try_new(file).expect("Trying to read parquet file").build().expect("Getting batch iterator");
    for maybe_record_batch in record_batch_reader {
        let record_batch = maybe_record_batch.expect("Getting current batch");
        let col = record_batch.column(0);
        let map_entry = array::as_map_array(col).value(2);
        let struct_col = array::as_struct_array(&map_entry);
        let key_col = array::as_string_array(struct_col.column(0)); // Key column
        assert_eq!(key_col.value(0), "three");
        assert_eq!(key_col.value(1), "four");
        assert_eq!(key_col.value(2), "five");
        assert_eq!(key_col.value(3), "six");
        assert_eq!(key_col.value(4), "seven");
    }
    println!("finished reading");
    fs::remove_file("read_map_column_with_leading_nulls.parquet").expect("Removing file");
}

Expected behavior

Reading the keys values for the third record should give["three", "four", "five", "six", "seven"].

Additional context

Debugging the code shows that the offsets array for the MapArray's key column is this [0, 0, 0, 5, 9, 13] when I think it should be this [0, 5, 9, 13, 16, 21]. It looks like the issue is in parquet::arrow::array_reader::map_array::MapArrayReaderconsume_batch where the repetition and definition levels are used to determine the data offsets (starting at parquet\src\arrow\array_reader\map_array.rs line 123).

@d-willis d-willis added the bug label Aug 17, 2022
@d-willis
Copy link
Author

The above reproducer test can be found in the fork:
https://github.com/d-willis/arrow-rs/tree/d_willis/map_array_reader_issue

@tustvold
Copy link
Contributor

tustvold commented Aug 17, 2022

This is a duplicate of #1699, I will attempt to fix this in the coming days. It has been on my radar for a while

@d-willis
Copy link
Author

I had a look at that issue and thought it was more about MapArray nested in another Array rather than at the top level. Still, if you can get it to work as expected, that'd be great!

@tustvold
Copy link
Contributor

tustvold commented Aug 17, 2022

Yeah the ticket just covers what was clearly incorrect when I had a brief look at it, there are likely other issues with the logic. I fully intend to replace it wholesale with the ListArrayReader logic which is better tested, and actually correct (or at the very least more plausibly correct) 😅

tustvold added a commit to tustvold/arrow-rs that referenced this issue Aug 18, 2022
tustvold added a commit to tustvold/arrow-rs that referenced this issue Aug 18, 2022
tustvold added a commit that referenced this issue Aug 18, 2022
* Fix MapArrayReader (#2484) (#1699) (#1561)

* Fix comments

* Review feedback
@alamb alamb added the arrow Changes to the arrow crate label Aug 18, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
arrow Changes to the arrow crate bug
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants