Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Coerce parquet int96 timestamps to microsecond precision #5655

Open
ion-elgreco opened this issue Apr 16, 2024 · 7 comments
Open

Coerce parquet int96 timestamps to microsecond precision #5655

ion-elgreco opened this issue Apr 16, 2024 · 7 comments
Labels
enhancement Any new improvement worthy of a entry in the changelog

Comments

@ion-elgreco
Copy link

ion-elgreco commented Apr 16, 2024

Is your feature request related to a problem or challenge? Please describe what you are trying to do.
Spark annoyingly by default writes the deprecated int96 timestamps even though it's microsecond precision in the source.

Arrow-rs coerces this as nanosecond precision during reads. Pyarrow does this as well, however in their parquet reader you can enable a flag that coerces the int96 timestamps from nanosecond to microsecond.

Describe the solution you'd like
Add an option to the parquet reader to handle this coercion to microsecond precision for int96 timestamps

Describe alternatives you've considered

Additional context

This would help in delta-rs with reading spark created parquet tables, and prevent schema mismatches

@ion-elgreco ion-elgreco added the enhancement Any new improvement worthy of a entry in the changelog label Apr 16, 2024
@tustvold
Copy link
Contributor

I presume the data is actually nanosecond precision in the parquet data, you are just wanting the parquet reader to cast this to microsecond precision?

If so, I think we can solve this and related issues with #5657

@ion-elgreco
Copy link
Author

@tustvold to my understanding it's actually microsecond precision but it's saved as a logical int96

@liukun4515
Copy link
Contributor

liukun4515 commented Apr 17, 2024

@tustvold to my understanding it's actually microsecond precision but it's saved as a logical int96

I think in the parquet, the physical type of int96 represent the timestamp with nanosecond unit.

cc @ion-elgreco
Is there any system to write parquet file with the int96 physical type, but the unit is not the nanosecond

@liukun4515
Copy link
Contributor

liukun4515 commented Apr 17, 2024

I find the definition of the int96 in the deprecated doc https://github.com/xhochy/parquet-format/blob/cb4727767823ae201fd567f67825cc22834c20e9/LogicalTypes.md#int96-timestamps-also-called-impala_timestamp

(deprecated) Timestamps saved as an int96 are made up of the nanoseconds in the day (first 8 byte) and the Julian day (last 4 bytes). No timezone is attached to this value. To convert the timestamp into nanoseconds since the Unix epoch, 00:00:00.000000 on 1 January 1970, the following formula can be used: (julian_day - 2440588) * (86400 * 1000 * 1000 * 1000) + nanoseconds. The magic number 2440588 is the julian day for 1 January 1970.

Note that these timestamps are the common usage of the int96 physical type and are not marked with a special logical type annotation.


@ion-elgreco
Copy link
Author

ion-elgreco commented Apr 17, 2024

@liukun4515 I rather meant the source data, so in a Spark Dataframe it's microsecond but their default spark conf settings saves it as a int96.. they still haven't changed that unfortunately, so you see a lot of parquet data flowing around that is saved as int96 (ns precision I guess) but is actually just us precision.

Seems that HIVE and impala do the same thing

@mapleFU
Copy link
Member

mapleFU commented Apr 17, 2024

Can you provide the link for Spark's behavior? I found Int96 mentioned here but I wonder how it defines it specifically https://spark.apache.org/docs/latest/sql-data-sources-parquet.html

And if ConvertedType is defined for it, would we better choose parsing mode from ConvertedType?

@ion-elgreco
Copy link
Author

ion-elgreco commented Apr 21, 2024

@mapleFU, the default is INT96 because they want to keep compatibility with HIVE
image
https://spark.apache.org/docs/3.5.1/configuration.html#content

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement Any new improvement worthy of a entry in the changelog
Projects
None yet
Development

No branches or pull requests

4 participants