Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Set adjusted to UTC if UTC timezone (#1932) #1937

Merged
merged 1 commit into from Jun 24, 2022

Conversation

tustvold
Copy link
Contributor

Which issue does this PR close?

Closes #1932

Rationale for this change

Prior to https://github.com/apache/arrow-rs/pull/1682/files#diff-103f6c92559ee00d93eff806b411e0c8c8d249564fb3f8674da1da7693f86cb1L393 the writer would set is_adjusted_to_u_t_c to true if any timezone was present in the arrow datatype. This was incorrect as normalization to UTC would never actually occur.

The fix was to always set is_adjusted_to_u_t_c to be false, which whilst correct, is overly conservative and represents a regression for the case where the timezone actually is UTC.

What changes are included in this PR?

If the source array is already normalized to UTC, set is_adjusted_to_u_t_c to be true

Are there any user-facing changes?

Yes, not sure if counts as breaking though

@github-actions github-actions bot added the parquet Changes to the parquet crate label Jun 24, 2022
@@ -300,10 +300,15 @@ fn arrow_to_parquet_type(field: &Field) -> Result<Type> {
.with_repetition(repetition)
.build()
}
DataType::Timestamp(time_unit, _) => {
DataType::Timestamp(time_unit, tz) => {
let is_utc = tz
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I initially played around with using chrono-tz, but this is a non-trivial additional dependency, and does not appear to be being actively maintained. I think this should be good enough, until such a time as we potentially add coerce types functionality as part of #1666

DataType::Timestamp(time_unit, tz) => {
let is_utc = tz
.as_ref()
.map(|tz| tz == "UTC" || tz == "+00:00" || tz == "-00:00")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Technically GMT is the timezone - https://www.timeanddate.com/time/gmt-utc-time.html
Should GMT be added in the checks too ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I would rather keep this small and simple, rather than cover all timezones that happen to be equivalent. Longer term this may be handled as part of #1938

@tustvold tustvold merged commit d52be30 into apache:master Jun 24, 2022
@tustvold tustvold deleted the set-adjusted-to-utc branch June 24, 2022 13:04
@jorgecarleitao
Copy link
Member

While applying this fix in arrow2, I broke some of our integration tests against pyarrow. Coming to the specs, when the tz string is set in Arrow, it means that

  1. the values i64 are time-aware
  2. the values are stored in UTC
  3. the tz string represents the offset/timezone that needs to be taken into account when using the i64

In this PR's notation, the values are by definition normalized to be UTC, it is the user of the logical type that needs to de-normalize them if they need to apply offsets (e.g. to represent).

AFAIK this matches with what "is adjusted to UTC" in parquet means:

A TIMESTAMP with isAdjustedToUTC=true is defined as the number of milliseconds, microseconds or nanoseconds (depending on the unit parameter being MILLIS, MICROS or NANOS, respectively) elapsed since the Unix epoch, 1970-01-01 00:00:00 UTC. Each such value unambiguously identifies a single instant on the time-line.

In other words, I think that the previous behavior was correct (but I may be wrong)

@tustvold
Copy link
Contributor Author

I think you might be right, will re-read tomorrow and potentially file a PR. Thank you

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
parquet Changes to the parquet crate
Projects
None yet
Development

Successfully merging this pull request may close these issues.

unable to write parquet file with UTC timestamp
4 participants