Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: ignoring time zone info when import from external files #1341

Merged
merged 2 commits into from Apr 10, 2023

Conversation

v0y4g3r
Copy link
Contributor

@v0y4g3r v0y4g3r commented Apr 6, 2023

I hereby agree to the terms of the GreptimeDB CLA

What's changed and what's your intention?

This PR allows users to import parquet files with explicit time zone info in schema by ignoring the time zone info and treating the i64 value as time elapsed since UNIX epoch.

Why ignoring time zone in arrow array is correct?

According to Timezone Aware Timestamp Parsing - arrow-rs, the time zone info in schema is an indicator of the desired time zone, but the underlying value (the i64 value) is always adjusted to UTC time.

That is being to say, if we can a timestamp in secinds with value 28800 (86060) and time zone "+08:00", what is the UTC time point of this timestamp? Is it "1970-01:01 00:00:00"? Actually no, it "1970-01:01 08:00:00" in UTC and "1970-01:01 16:00:00" in CST.

use arrow_array::timezone::Tz;
#[test]
fn test_parsing_timestamp_to_arrow_array() {
    let arr = TimestampSecondArray::from(vec![28800]).with_timezone("+08:00".to_string());
    let tz: Tz = arr.timezone().unwrap().parse().unwrap();

    println!(
        "Timestamp with time zone: {:?}",
        arr.value_as_datetime_with_tz(0, tz).unwrap().to_string()
    );
}

// it prints: "1970-01-01 16:00:00 +08:00"

Why ignoring time zone in parquet schema is ok?

Parquet has an "isAdjustedToUTC" flag which indicates if the INT96 value in Parquet is already adjusted to UTC time, but arrow/parquet does not provide a method to read that flag. So we can only ignore the flag and treat all INT96 values in parquet's timestamp columns as timestamp relative to UTC time, just like

Checklist

  • I have written the necessary rustdoc comments.
  • I have added the necessary unit tests and integration tests.

Refer to a related PR or issue link (optional)

#1319

Copy link
Collaborator

@fengjiachun fengjiachun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

src/datanode/src/error.rs Outdated Show resolved Hide resolved
@codecov
Copy link

codecov bot commented Apr 10, 2023

Codecov Report

Merging #1341 (acb839c) into develop (c850e96) will decrease coverage by 0.43%.
The diff coverage is 83.66%.

@@             Coverage Diff             @@
##           develop    #1341      +/-   ##
===========================================
- Coverage    85.90%   85.48%   -0.43%     
===========================================
  Files          509      509              
  Lines        76889    77032     +143     
===========================================
- Hits         66052    65849     -203     
- Misses       10837    11183     +346     

@waynexia waynexia merged commit b7bdee6 into GreptimeTeam:develop Apr 10, 2023
11 checks passed
paomian pushed a commit to paomian/greptimedb that referenced this pull request Oct 19, 2023
…meTeam#1341)

* feat: ignore timezone info when copy from external files

* chore: rebase onto develop
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants