Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Format Timestamps as RFC3339 #2939

Merged
merged 4 commits into from Oct 28, 2022
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Jump to
Jump to file
Failed to load files.
Diff view
Diff view
33 changes: 16 additions & 17 deletions arrow-array/src/array/primitive_array.rs
Expand Up @@ -18,15 +18,18 @@
use crate::builder::{BooleanBufferBuilder, BufferBuilder, PrimitiveBuilder};
use crate::iterator::PrimitiveIter;
use crate::raw_pointer::RawPtrBox;
use crate::temporal_conversions::{as_date, as_datetime, as_duration, as_time};
use crate::temporal_conversions::{
as_date, as_datetime, as_datetime_with_timezone, as_duration, as_time,
};
use crate::timezone::Tz;
use crate::trusted_len::trusted_len_unzip;
use crate::types::*;
use crate::{print_long_array, Array, ArrayAccessor};
use arrow_buffer::{i256, ArrowNativeType, Buffer};
use arrow_data::bit_iterator::try_for_each_valid_idx;
use arrow_data::ArrayData;
use arrow_schema::{ArrowError, DataType};
use chrono::{Duration, FixedOffset, NaiveDate, NaiveDateTime, NaiveTime};
use chrono::{DateTime, Duration, NaiveDate, NaiveDateTime, NaiveTime};
use half::f16;
use std::any::Any;

Expand Down Expand Up @@ -116,40 +119,40 @@ pub type Float64Array = PrimitiveArray<Float64Type>;
/// # Example: UTC timestamps post epoch
/// ```
/// # use arrow_array::TimestampSecondArray;
/// use chrono::FixedOffset;
/// use arrow_array::timezone::Tz;
/// // Corresponds to single element array with entry 1970-05-09T14:25:11+0:00
/// let arr = TimestampSecondArray::from(vec![11111111]);
/// // OR
/// let arr = TimestampSecondArray::from(vec![Some(11111111)]);
/// let utc_offset = FixedOffset::east(0);
/// let utc_tz: Tz = "+00:00".parse().unwrap();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would be nice to show an example of going from a chrono timezone (like FixedOffset) to a Arrow::tz (though maybe that is already available -- I didn't check)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@alamb
I can't add it here as inner data is private now

error[E0423]: cannot initialize a tuple struct which contains private fields
   --> src/array/primitive_array.rs:143:14
    |
12  | let utc_tz = Tz(FixedOffset::east(0));
    |              ^^
    |
note: constructor is not visible here due to private fields
   --> /Users/willy/willy/waitingkuo/arrow-rs/arrow-array/src/timezone.rs:246:19
    |
246 |     pub struct Tz(FixedOffset);
    |                   ^^^^^^^^^^^ private field

/// An Arrow [`TimeZone`]
#[derive(Debug, Copy, Clone)]
pub struct Tz(FixedOffset);

as #2909 is to add a timezone abstraction, do we still encourage user to use chrono api directly?

@tustvold do you have any comments?

///
/// assert_eq!(arr.value_as_datetime_with_tz(0, utc_offset).map(|v| v.to_string()).unwrap(), "1970-05-09 14:25:11")
/// assert_eq!(arr.value_as_datetime_with_tz(0, utc_tz).map(|v| v.to_string()).unwrap(), "1970-05-09 14:25:11 +00:00")
/// ```
///
/// # Example: UTC timestamps pre epoch
/// ```
/// # use arrow_array::TimestampSecondArray;
/// use chrono::FixedOffset;
/// use arrow_array::timezone::Tz;
/// // Corresponds to single element array with entry 1969-08-25T09:34:49+0:00
/// let arr = TimestampSecondArray::from(vec![-11111111]);
/// // OR
/// let arr = TimestampSecondArray::from(vec![Some(-11111111)]);
/// let utc_offset = FixedOffset::east(0);
/// let utc_tz: Tz = "+00:00".parse().unwrap();
///
/// assert_eq!(arr.value_as_datetime_with_tz(0, utc_offset).map(|v| v.to_string()).unwrap(), "1969-08-25 09:34:49")
/// assert_eq!(arr.value_as_datetime_with_tz(0, utc_tz).map(|v| v.to_string()).unwrap(), "1969-08-25 09:34:49 +00:00")
/// ```
///
/// # Example: With timezone specified
/// ```
/// # use arrow_array::TimestampSecondArray;
/// use chrono::FixedOffset;
/// use arrow_array::timezone::Tz;
/// // Corresponds to single element array with entry 1970-05-10T00:25:11+10:00
/// let arr = TimestampSecondArray::from(vec![11111111]).with_timezone("+10:00".to_string());
/// // OR
/// let arr = TimestampSecondArray::from(vec![Some(11111111)]).with_timezone("+10:00".to_string());
/// let sydney_offset = FixedOffset::east(10 * 60 * 60);
/// let sydney_tz: Tz = "+10:00".parse().unwrap();
///
/// assert_eq!(arr.value_as_datetime_with_tz(0, sydney_offset).map(|v| v.to_string()).unwrap(), "1970-05-10 00:25:11")
/// assert_eq!(arr.value_as_datetime_with_tz(0, sydney_tz).map(|v| v.to_string()).unwrap(), "1970-05-10 00:25:11 +10:00")
/// ```
///
pub type TimestampSecondArray = PrimitiveArray<TimestampSecondType>;
Expand Down Expand Up @@ -503,12 +506,8 @@ where
///
/// functionally it is same as `value_as_datetime`, however it adds
/// the passed tz to the to-be-returned NaiveDateTime
pub fn value_as_datetime_with_tz(
&self,
i: usize,
tz: FixedOffset,
) -> Option<NaiveDateTime> {
as_datetime::<T>(i64::from(self.value(i))).map(|datetime| datetime + tz)
pub fn value_as_datetime_with_tz(&self, i: usize, tz: Tz) -> Option<DateTime<Tz>> {
as_datetime_with_timezone::<T>(i64::from(self.value(i)), tz)
}

/// Returns value as a chrono `NaiveDate` by using `Self::datetime()`
Expand Down
80 changes: 71 additions & 9 deletions arrow/src/util/display.rs
Expand Up @@ -33,6 +33,7 @@ use crate::{array, datatypes::IntervalUnit};
use array::DictionaryArray;

use crate::error::{ArrowError, Result};
use arrow_array::timezone::Tz;

macro_rules! make_string {
($array_type:ty, $column: ident, $row: ident) => {{
Expand Down Expand Up @@ -190,14 +191,37 @@ macro_rules! make_string_datetime {
} else {
array
.value_as_datetime($row)
.map(|d| d.to_string())
.map(|d| format!("{:?}", d))
.unwrap_or_else(|| "ERROR CONVERTING DATE".to_string())
};

Ok(s)
}};
}

macro_rules! make_string_datetime_with_tz {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it might also be worth commenting that date/times are formatted as rfc3339?

($array_type:ty, $tz_string: ident, $column: ident, $row: ident) => {{
let array = $column.as_any().downcast_ref::<$array_type>().unwrap();

let s = if array.is_null($row) {
"".to_string()
} else {
match $tz_string.parse::<Tz>() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

as written this is going to parse the timezone string for every row -- perhaps we can parse it once per array 🤔

Copy link
Contributor

@tustvold tustvold Oct 27, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In general this display code is very inefficient, I think there is a broader issue to clean it up.

See the disclaimer on https://docs.rs/arrow/latest/arrow/util/display/fn.array_value_to_string.html

Ok(tz) => array
.value_as_datetime_with_tz($row, tz)
.map(|d| format!("{}", d.to_rfc3339()))
.unwrap_or_else(|| "ERROR CONVERTING DATE".to_string()),
Err(_) => array
.value_as_datetime($row)
.map(|d| format!("{:?} (Unknown Time Zone '{}')", d, $tz_string))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This doesn't seem right -- isn't this error for the case when $ts_string.parse() fails? As written I think it will be returned when the entry in array at $row is null (which should never happen).

I think we should remove the match statement

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the logic behind is:

while parsing TZ failed, parse datetime only

  1. if it works, show the datetime part with a unknown time zone warning
+--------------------------------------------------------+
| Timestamp(Second, Some("Asia/Taipei2"))                |
+--------------------------------------------------------+
| 1970-01-01T00:00:00 (Unknown Time Zone 'Asia/Taipei2') |
+--------------------------------------------------------+
  1. if it doesn't work, display something (similar as make_string_datetime) like this
+--------------------------------------------------------+
| Timestamp(Second, Some("Asia/Taipei2"))                |
+--------------------------------------------------------+
| ERROR CONVERTING DATE                                  |
+--------------------------------------------------------+

@alamb @tustvold
I wonder if it's preferred. we could also show followings directly while failed to parse timezone

+--------------------------------------------------------+
| Timestamp(Second, Some("Asia/Taipei2"))                |
+--------------------------------------------------------+
| Unknown Time Zone 'Asia/Taipei2'                       |
+--------------------------------------------------------+

i agree the logic here is confusing, i should've added some comments here

.unwrap_or_else(|| "ERROR CONVERTING DATE".to_string()),
}
};

Ok(s)
}};
}

// It's not possible to do array.value($row).to_string() for &[u8], let's format it as hex
macro_rules! make_string_hex {
($array_type:ty, $column: ident, $row: ident) => {{
Expand Down Expand Up @@ -334,17 +358,55 @@ pub fn array_value_to_string(column: &array::ArrayRef, row: usize) -> Result<Str
DataType::Float32 => make_string!(array::Float32Array, column, row),
DataType::Float64 => make_string!(array::Float64Array, column, row),
DataType::Decimal128(..) => make_string_from_decimal(column, row),
DataType::Timestamp(unit, _) if *unit == TimeUnit::Second => {
make_string_datetime!(array::TimestampSecondArray, column, row)
DataType::Timestamp(unit, tz_string_opt) if *unit == TimeUnit::Second => {
match tz_string_opt {
Some(tz_string) => make_string_datetime_with_tz!(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It might be nice to combine the make_string_datetime_with_tz and make_string_datetime macros together?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would also be awesome to avoid having to make a string at all (like return a impl Display) but that is definitely not something to do in this PR 😆

array::TimestampSecondArray,
tz_string,
column,
row
),
None => make_string_datetime!(array::TimestampSecondArray, column, row),
}
}
DataType::Timestamp(unit, _) if *unit == TimeUnit::Millisecond => {
make_string_datetime!(array::TimestampMillisecondArray, column, row)
DataType::Timestamp(unit, tz_string_opt) if *unit == TimeUnit::Millisecond => {
match tz_string_opt {
Some(tz_string) => make_string_datetime_with_tz!(
array::TimestampMillisecondArray,
tz_string,
column,
row
),
None => {
make_string_datetime!(array::TimestampMillisecondArray, column, row)
}
}
}
DataType::Timestamp(unit, _) if *unit == TimeUnit::Microsecond => {
make_string_datetime!(array::TimestampMicrosecondArray, column, row)
DataType::Timestamp(unit, tz_string_opt) if *unit == TimeUnit::Microsecond => {
match tz_string_opt {
Some(tz_string) => make_string_datetime_with_tz!(
array::TimestampMicrosecondArray,
tz_string,
column,
row
),
None => {
make_string_datetime!(array::TimestampMicrosecondArray, column, row)
}
}
}
DataType::Timestamp(unit, _) if *unit == TimeUnit::Nanosecond => {
make_string_datetime!(array::TimestampNanosecondArray, column, row)
DataType::Timestamp(unit, tz_string_opt) if *unit == TimeUnit::Nanosecond => {
match tz_string_opt {
Some(tz_string) => make_string_datetime_with_tz!(
array::TimestampNanosecondArray,
tz_string,
column,
row
),
None => {
make_string_datetime!(array::TimestampNanosecondArray, column, row)
}
}
}
DataType::Date32 => make_string_date!(array::Date32Array, column, row),
DataType::Date64 => make_string_date!(array::Date64Array, column, row),
Expand Down
129 changes: 125 additions & 4 deletions arrow/src/util/pretty.rs
Expand Up @@ -370,13 +370,134 @@ mod tests {
};
}

/// Generate an array with type $ARRAYTYPE with a numeric value of
/// $VALUE, and compare $EXPECTED_RESULT to the output of
/// formatting that array with `pretty_format_batches`
macro_rules! check_datetime_with_timezone {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think passing an optional timezone to check_datetime and calling with_timezone_opt would avoid some duplication.

On a related note, I don't think either need to be macros and therefore probably shouldn't be.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As in "make them functions" 👍

($ARRAYTYPE:ident, $VALUE:expr, $TZ_STRING:expr, $EXPECTED_RESULT:expr) => {
let mut builder = $ARRAYTYPE::builder(10);
builder.append_value($VALUE);
builder.append_null();
let array = builder.finish();
let array = array.with_timezone($TZ_STRING);

let schema = Arc::new(Schema::new(vec![Field::new(
"f",
array.data_type().clone(),
true,
)]));
let batch = RecordBatch::try_new(schema, vec![Arc::new(array)]).unwrap();

let table = pretty_format_batches(&[batch])
.expect("formatting batches")
.to_string();

let expected = $EXPECTED_RESULT;
let actual: Vec<&str> = table.lines().collect();

assert_eq!(expected, actual, "Actual result:\n\n{:#?}\n\n", actual);
};
}

#[test]
#[cfg(features = "chrono-tz")]
fn test_pretty_format_timestamp_second_with_utc_timezone() {
let expected = vec![
"+---------------------------+",
"| f |",
"+---------------------------+",
"| 1970-05-09T14:25:11+00:00 |",
"| |",
"+---------------------------+",
];
check_datetime_with_timezone!(
TimestampSecondArray,
11111111,
"UTC".to_string(),
expected
);
}

#[test]
#[cfg(features = "chrono-tz")]
fn test_pretty_format_timestamp_second_with_non_utc_timezone() {
let expected = vec![
"+---------------------------+",
"| f |",
"+---------------------------+",
"| 1970-05-09T22:25:11+08:00 |",
"| |",
"+---------------------------+",
];
check_datetime_with_timezone!(
TimestampSecondArray,
11111111,
"Asia/Taipei".to_string(),
expected
);
}

#[test]
fn test_pretty_format_timestamp_second_with_fixed_offset_timezone() {
let expected = vec![
"+---------------------------+",
"| f |",
"+---------------------------+",
"| 1970-05-09T22:25:11+08:00 |",
"| |",
"+---------------------------+",
];
check_datetime_with_timezone!(
TimestampSecondArray,
11111111,
"+08:00".to_string(),
expected
);
}

#[test]
fn test_pretty_format_timestamp_second_with_incorrect_fixed_offset_timezone() {
let expected = vec![
"+-------------------------------------------------+",
"| f |",
"+-------------------------------------------------+",
"| 1970-05-09T14:25:11 (Unknown Time Zone '08:00') |",
"| |",
"+-------------------------------------------------+",
];
check_datetime_with_timezone!(
TimestampSecondArray,
11111111,
"08:00".to_string(),
expected
);
}

#[test]
fn test_pretty_format_timestamp_second_with_unknown_timezone() {
let expected = vec![
"+---------------------------------------------------+",
"| f |",
"+---------------------------------------------------+",
"| 1970-05-09T14:25:11 (Unknown Time Zone 'Unknown') |",
"| |",
"+---------------------------------------------------+",
];
check_datetime_with_timezone!(
TimestampSecondArray,
11111111,
"Unknown".to_string(),
expected
);
}

#[test]
fn test_pretty_format_timestamp_second() {
let expected = vec![
"+---------------------+",
"| f |",
"+---------------------+",
"| 1970-05-09 14:25:11 |",
"| 1970-05-09T14:25:11 |",
"| |",
"+---------------------+",
];
Expand All @@ -389,7 +510,7 @@ mod tests {
"+-------------------------+",
"| f |",
"+-------------------------+",
"| 1970-01-01 03:05:11.111 |",
"| 1970-01-01T03:05:11.111 |",
"| |",
"+-------------------------+",
];
Expand All @@ -402,7 +523,7 @@ mod tests {
"+----------------------------+",
"| f |",
"+----------------------------+",
"| 1970-01-01 00:00:11.111111 |",
"| 1970-01-01T00:00:11.111111 |",
"| |",
"+----------------------------+",
];
Expand All @@ -415,7 +536,7 @@ mod tests {
"+-------------------------------+",
"| f |",
"+-------------------------------+",
"| 1970-01-01 00:00:00.011111111 |",
"| 1970-01-01T00:00:00.011111111 |",
"| |",
"+-------------------------------+",
];
Expand Down