Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

parquet::arrow::arrow_writer::ArrowWriter ignores page size properties #2853

Closed
Tracked by #3462
thinkharderdev opened this issue Oct 9, 2022 · 2 comments · Fixed by #2854 or #2890
Closed
Tracked by #3462

parquet::arrow::arrow_writer::ArrowWriter ignores page size properties #2853

thinkharderdev opened this issue Oct 9, 2022 · 2 comments · Fixed by #2854 or #2890
Labels
bug parquet Changes to the parquet crate

Comments

@thinkharderdev
Copy link
Contributor

Describe the bug

ArrowWrites ignores page size properties when writing to parquet. It also seems to always write just two pages, the first one a normal sized page and all the remaining data in the second page.

To Reproduce

    #[test]
    fn arrow_writer_page_size() {
        let mut rng = thread_rng();
        let schema = Arc::new(Schema::new(vec![Field::new("col", DataType::Utf8, false)]));

        let mut builder = StringBuilder::with_capacity(10_000, 2 * 10_0000);

        for _ in 0..100_000 {
            let value = (0..200)
                .map(|_| rng.gen_range(b'a'..=b'z') as char)
                .collect::<String>();

            builder.append_value(value);
        }

        let array = Arc::new(builder.finish());

        let batch = RecordBatch::try_new(schema, vec![array]).unwrap();

        let file = tempfile::tempfile().unwrap();

        let props = WriterProperties::builder()
            .set_max_row_group_size(usize::MAX)
            .set_data_pagesize_limit(512)
            .set_write_batch_size(512)
            .build();

        let mut writer = ArrowWriter::try_new(
            file.try_clone().unwrap(),
            batch.schema(),
            Some(props),
        )
            .expect("Unable to write file");
        writer.write(&batch).unwrap();
        writer.close().unwrap();

        let reader = SerializedFileReader::new(file.try_clone().unwrap()).unwrap();

        let column = reader.metadata().row_group(0).columns();

        let page_locations = read_pages_locations(&file, column).unwrap();

        let offset_index = page_locations[0].clone();

        assert!(offset_index.len() > 2, "Expected more than two pages but got {:#?}", offset_index);
    }

This outputs

thread 'arrow::arrow_writer::tests::arrow_writer_page_size' panicked at 'Expected more than two pages but got [
    PageLocation {
        offset: 1148953,
        compressed_page_size: 9595,
        first_row_index: 0,
    },
    PageLocation {
        offset: 1158548,
        compressed_page_size: 19251505,
        first_row_index: 5632,
    },
]'

Expected behavior

The writer should respect the page size properties and write similarly sized pages.

Additional context

@alamb alamb added the parquet Changes to the parquet crate label Oct 14, 2022
@alamb
Copy link
Contributor

alamb commented Oct 14, 2022

label_issue.py automatically added labels {'parquet'} from #2854

@alamb
Copy link
Contributor

alamb commented Oct 18, 2022

Reopening as @tustvold says it is not yet fixed #2890 (comment)

@alamb alamb reopened this Oct 18, 2022
tustvold added a commit that referenced this issue Oct 24, 2022
* Respect Page Size Limits in ArrowWriter (#2853)

* Update tests

* Add test required features

* Fix strings

* Review feedback
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug parquet Changes to the parquet crate
Projects
None yet
2 participants