Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Lack of examples on parquet file write #1745

Open
Uinelj opened this issue May 25, 2022 · 4 comments
Open

Lack of examples on parquet file write #1745

Uinelj opened this issue May 25, 2022 · 4 comments
Labels
documentation Improvements or additions to documentation question Further information is requested

Comments

@Uinelj
Copy link

Uinelj commented May 25, 2022

Which part is this question about
Documentation and examples about writing Parquet files using the ColumnReader API.

Describe your question
I'd like to know if there's a clear reason on the absence of examples regarding writing Parquet files using the ColumnReader API, and if not, I'd be glad to provide such examples.

Additional context

When looking for info about how to write Parquet files, the first place we end up is the parquet::file page on docs.rs, where sample code omits the part where the actual write is done :

// ...
while let Some(mut col_writer) = row_group_writer.next_column().unwrap() {
    // ... write values to a column writer
    row_group_writer.close_column(col_writer).unwrap();
}
// ...

While Parquet writing is obviously more complex than other, row-oriented formats and the API is relatively low-level, more documentation and especially examples with both simple and simply nested structures would benefit people, imho.

@Uinelj Uinelj added the question Further information is requested label May 25, 2022
@tustvold
Copy link
Contributor

tustvold commented May 25, 2022

The lack of documentation is not intentional and PRs to improve the situation would be most welcome. 😀

IMO the primary user-facing API is definitely intended to be arrow, as it is both easier to understand, Dremel is utterly mind-bending, and the columnar decoding is significantly more performant (although work still remains on the write path). That being said, there are valid use cases of the lower level APIs and we should make sure they are well documented 👍

FWIW #1719 has just been merged which hopefully makes the write API a bit easier to use.

@alamb alamb added the documentation Improvements or additions to documentation label May 26, 2022
@alamb
Copy link
Contributor

alamb commented May 26, 2022

@Uinelj thanks for the question -- what input format is your data? Maybe we can try and implement some examples showing how to create parquet files using that type of put (perhaps via Arrow as @tustvold mentions)?

@Uinelj
Copy link
Author

Uinelj commented May 30, 2022

@tustvold : I have not tried to use the arrow facing part as my usecase is centered around writing (huge) parquet files. Do you think that it would make sense to go through the Arrow API even if I'm only looking to write Parquet files?
The main gripe I have/had is around the whole Dremel logic that is hard to grasp, and even if a thorough tutorial/explanation might be out of scope, some pointers and an example could help people.

I have changed my code to comply with the 15.0.0 developments, and I feel that the writing, opening and closing parts are smoother now, thanks a lot for that 👍

@alamb Well I originally had a very nested schema, involving maps, nullable lists, required lists with nullable elements, etc. I'm not yet fixed on a format since I want to measure performance for a set of usecases, so I'll experiment on the format.

After failing, I went to try and create simple layouts, namely the one from the python tutorial (https://arrow.apache.org/docs/python/parquet.html#reading-and-writing-single-files), as well as another one using a required list with nullable items.

I think that having two or three examples increasing in complexity and involving optionality and some amount of nesting would be good (to show how to define lists/maps, compute def/rep level and how to manage Option<T>).

Here is the code that replicates the format from the Parquet Python tutorial. It may have too much data structures and no comments, but if you feel like it it could become one of the examples after being fixed!

use std::{fmt::Write, fs::File, sync::Arc};

use parquet::{
    column::writer::ColumnWriter,
    data_type::ByteArray,
    errors::ParquetError,
    file::{properties::WriterProperties, writer::SerializedFileWriter},
    schema::{parser::parse_message_type, types::Type},
};

/// Simple example struct
struct Simple {
    one: Option<f64>,
    two: String,
    three: bool,
}

/// row-oriented set of structs.
struct SimpleRows {
    ones: Vec<Option<f64>>,
    twos: Vec<ByteArray>,
    threes: Vec<bool>,
}

/// Represents a struct field along with rep and def levels
struct WriteData<T> {
    data: Vec<T>,
    def: Vec<i16>,
    rep: Vec<i16>,
}

impl SimpleRows {
    fn ones_all(&self) -> WriteData<f64> {
        let mut data = Vec::with_capacity(self.ones.len());
        let mut rep = vec![1; self.ones.len() - 1];
        rep.push(0);
        rep.reverse();

        let def = self
            .ones
            .iter()
            .map(|x| match x {
                Some(d) => {
                    data.push(*d);
                    1
                }
                None => 0,
            })
            .collect();

        WriteData { data, def, rep }
    }

    fn twos_all(&self) -> WriteData<ByteArray> {
        let def = vec![1; self.twos.len()];
        let rep = vec![1; self.twos.len()];
        let data = self.twos.to_vec();

        WriteData { data, def, rep }
    }

    fn threes_all(&self) -> WriteData<bool> {
        let def = vec![1; self.twos.len()];
        let rep = vec![1; self.twos.len()];
        let data = self.threes.to_vec();

        WriteData { data, def, rep }
    }
}

fn to_simplerows(s: &[Simple]) -> SimpleRows {
    let mut ones = Vec::with_capacity(s.len());
    let mut twos = Vec::with_capacity(s.len());
    let mut threes = Vec::with_capacity(s.len());

    for row in s {
        ones.push(row.one);
        twos.push(row.two.as_str().into());
        threes.push(row.three);
    }

    SimpleRows { ones, twos, threes }
}

fn write(schema: Type, rows: SimpleRows) -> Result<(), ParquetError> {
    let buf = File::create("./test.parquet").unwrap();
    let props = WriterProperties::builder().build();
    let mut w = SerializedFileWriter::new(buf, Arc::new(schema.clone()), Arc::new(props)).unwrap();

    let mut rg = w.next_row_group().unwrap();
    let mut nb_col = 0;
    while let Some(mut col_writer) = rg.next_column().unwrap() {
        match nb_col {
            0 => {
                if let ColumnWriter::DoubleColumnWriter(ref mut col_writer) = col_writer.untyped() {
                    let r = rows.ones_all();
                    col_writer
                        .write_batch(&r.data, Some(&r.def[..]), Some(&r.rep[..]))
                        .unwrap();
                } else {
                    panic!("wrong col type for nb col 0")
                }
            }
            1 => {
                if let ColumnWriter::ByteArrayColumnWriter(ref mut col_writer) =
                    col_writer.untyped()
                {
                    let r = rows.twos_all();
                    col_writer
                        .write_batch(&r.data, Some(&r.def[..]), Some(&r.rep[..]))
                        .unwrap();
                } else {
                    panic!("wrong col type for nb col 1")
                }
            }
            2 => {
                if let ColumnWriter::BoolColumnWriter(ref mut col_writer) = col_writer.untyped() {
                    let r = rows.threes_all();
                    col_writer
                        .write_batch(&r.data, Some(&r.def[..]), Some(&r.rep[..]))
                        .unwrap();
                } else {
                    panic!("wrong col type for nb col 2")
                }
            }
            _ => panic!("wrong col nb"),
        }
        nb_col += 1;
        col_writer.close()?;
    }
    rg.close()?;
    w.close()?;
    Ok(())
}

fn get_examples() -> Vec<Simple> {
    let a = Simple {
        one: Some(-1.0),
        two: "foo".to_string(),
        three: true,
    };
    let b = Simple {
        one: None,
        two: "bar".to_string(),
        three: false,
    };
    let c = Simple {
        one: Some(2.5),
        two: "baz".to_string(),
        three: true,
    };

    vec![a, b, c]
}
fn main() {
    // list non null, elements nullable
    let schema = r#"
        message documents {
            optional double one;
            optional binary two (string);
            optional boolean three;
        }
    "#;

    let schema = parse_message_type(schema).expect("invalid schema");
    let simples = to_simplerows(&get_examples());
    write(schema, simples).unwrap();
}

@tustvold
Copy link
Contributor

I think that having two or three examples increasing in complexity and involving optionality and some amount of nesting would be good.

Yes, if you're happy to contribute such documentation that would be amazing 👍

Do you think that it would make sense to go through the Arrow API even if I'm only looking to write Parquet files?

I think this really depends on what the source of your data is, and if it can be cheaply read into arrow. The selling point of arrow is as a columnar interchange format, allowing different systems to pass around buffers in a way that they can efficiently process. Assuming you can cheaply convert your input data to arrow, it should be faster...

That being said, currently the arrow writer has not had nearly as much attention paid to it as the reader side, and so will be slower in some cases than the row APIs. I've created a high level ticket #1764, but I'm not sure when I'll have time to get to it.

The main gripe I have/had is around the whole Dremel logic that is hard to grasp

Bit of an understatement here 😆, FWIW I've found this to be one of the more useful guides - https://akshays-blog.medium.com/wrapping-head-around-repetition-and-definition-levels-in-dremel-powering-bigquery-c1a33c9695da

My point still stands that in theory the promise of arrow is someone else will have handled this for you, but your mileage may vary.

Well I originally had a very nested schema, involving maps, nullable lists, required lists with nullable elements, etc. I'm not yet fixed on a format since I want to measure performance for a set of usecases, so I'll experiment on the format.

My 2 cents is that even if tooling supports nested schemas, it often comes with unexpected caveats. For example Presto/Trino has had bugs in projection pushdown for nested schemas for years. I would strongly advise that if you can flatten your schemas, you will save yourself a lot of headaches down the line if you do so 😅

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
documentation Improvements or additions to documentation question Further information is requested
Projects
None yet
Development

No branches or pull requests

3 participants