Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

unable to write parquet file with UTC timestamp #1932

Closed
msalib opened this issue Jun 23, 2022 · 5 comments · Fixed by #1937 or #1953
Closed

unable to write parquet file with UTC timestamp #1932

msalib opened this issue Jun 23, 2022 · 5 comments · Fixed by #1937 or #1953
Labels
bug parquet Changes to the parquet crate

Comments

@msalib
Copy link
Contributor

msalib commented Jun 23, 2022

Describe the bug
I cannot figure out how to write a parquet file with a timestamp column that gets encoded as UTC. All my efforts produce files with naive timestamps and no UTC metadata.

To Reproduce

Consider this program: it writes a tiny parquet file to /tmp/q.parquet. But using both pqrs and pandas/pyarrow on the resulting file shows that there is no timezone present -- the metric_date column is a naive timestamp.

use std::sync::Arc;

use arrow::{
    array::{StringArray, TimestampMillisecondArray},
    datatypes::{DataType, Field, Schema, TimeUnit},
    record_batch::RecordBatch,
};
use parquet::{
    arrow::arrow_writer::ArrowWriter,
    file::properties::{WriterProperties, WriterVersion},
};

fn main() {
    //let tz = Some("UTC".to_owned());
    let tz = None;
    let fields = vec![
        Field::new(
            "metric_date",
            DataType::Timestamp(TimeUnit::Millisecond, tz.clone()),
            false,
        ),
        Field::new("my_id", DataType::Utf8, false),
    ];
    let schema = Arc::new(Schema::new(fields));

    let my_ids = Arc::new(StringArray::from(vec!["hi", "there"]));
    let dates = Arc::new(TimestampMillisecondArray::from_vec(
        vec![1234532523, 1234124],
        tz,
    ));
    let batch = RecordBatch::try_new(schema.clone(), vec![dates, my_ids]).unwrap();

    let f = std::fs::File::create("/tmp/q.parquet").unwrap();
    let props = WriterProperties::builder()
        .set_writer_version(WriterVersion::PARQUET_2_0)
        .build();

    let mut writer = ArrowWriter::try_new(f, schema, Some(props)).unwrap();
    writer.write(&batch).unwrap();
    writer.close().unwrap();
    println!("Hello, world!");
}

Additional context
Tested using arrow="16.0.0" and parquet="16.0.0".

@msalib msalib added the bug label Jun 23, 2022
@msalib
Copy link
Contributor Author

msalib commented Jun 23, 2022

Here are some things I've tried (none of them make any difference):

  • setting tz to None or Some("+00:00".to_owned())
  • using v1 vs v2
  • using milliseconds vs nanoseconds vs microseconds

@tustvold
Copy link
Contributor

tustvold commented Jun 23, 2022

Could you expand a bit on what the expected behaviour is, as honestly cannot find any comprehensive document on how this is supposed to be handled. It's one of the many data model mismatches between arrow and parquet where it isn't really very clearly defined what is "correct" - #1666.

Ultimately Parquet does not have a native mechanism to encode timezone information in its schema, instead opting for something slightly different - https://github.com/apache/parquet-format/blob/master/LogicalTypes.md#timestamp. The arrow schema is embedded in the parquet file, but as documented in #1663 this cannot be treated as authoritative.

What I can say is the following:

  • The timezone is being stored in the embedded schema
  • As of parquet 15.0.0, in particular Fix Parquet Reader's Arrow Schema Inference #1682, parquet-rs roundtrips timezones correctly
  • pqrs is on parquet 12.0.0 where timezones did not roundtrip correctly
  • pyarrow appears to ignore the timezone stored within the arrow schema, I don't understand why

@msalib
Copy link
Contributor Author

msalib commented Jun 24, 2022

Sure! For me, expected behavior is that pandas will read a rust-produced parquet file with UTC timestamp columns and recognize that they're UTC. Like this:

import pandas as pd
assert str(pd.read_parquet("/tmp/q.parquet").dtypes.metric_date) == 'datetime64[ns, UTC]'
# and not 'datetime64[ns]'

Thank you so much for explaining that given the nature of the specification, this might not be feasible; I was going crazy. In part, because this used to work (I have a python unit test that invokes rust code and reads parquet files generated by rust). Up to version 14 of parquet+arrow, this worked fine. But as of version 15, the behavior changed.

@msalib
Copy link
Contributor Author

msalib commented Jun 24, 2022

This slightly simplified example shows different behavior when depending on arrow=14.0.0,parquet=14.0.0 versus arrow=15.0.0,parquet=15.0.0. The difference is visible both from pandas and pqrs schema.

use std::sync::Arc;

use arrow::{
    array::{StringArray, TimestampMillisecondArray},
    datatypes::{DataType, Field, Schema, TimeUnit},
    record_batch::RecordBatch,
};
use parquet::arrow::arrow_writer::ArrowWriter;

fn main() {
    let tz = Some("UTC".to_owned());
    let fields = vec![
        Field::new(
            "metric_date",
            DataType::Timestamp(TimeUnit::Millisecond, tz.clone()),
            false,
        ),
        Field::new("my_id", DataType::Utf8, false),
    ];
    let schema = Arc::new(Schema::new(fields));

    let my_ids = Arc::new(StringArray::from(vec!["hi", "there"]));
    let dates = Arc::new(TimestampMillisecondArray::from_vec(
        vec![1234532523, 1234124],
        tz,
    ));
    let batch = RecordBatch::try_new(schema.clone(), vec![dates, my_ids]).unwrap();

    let f = std::fs::File::create("/tmp/q.parquet").unwrap();

    let mut writer = ArrowWriter::try_new(f, schema, None).unwrap();
    writer.write(&batch).unwrap();
    writer.close().unwrap();
    println!("Hello, world!");
}

Given the unfortunate state of the specification, I understand that the changes in version 15 might be better in many ways and fix all manner of issues, but in this regard, they constitute a regression.

@msalib
Copy link
Contributor Author

msalib commented Jun 24, 2022

@tustvold Thank you so much for fixing this so quickly! I really appreciate it!

We're using rust+parquet+python+serverless for geospatial computing at work and arrow-rs' work has been incredibly helpful!

tustvold added a commit to tustvold/arrow-rs that referenced this issue Jun 28, 2022
tustvold added a commit that referenced this issue Jun 29, 2022
* Set is_adjusted_to_utc if any timezone set (#1932)

* Fix roundtrip
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug parquet Changes to the parquet crate
Projects
None yet
3 participants