Coerce parquet int96 timestamps to microsecond precision #5655

ion-elgreco · 2024-04-16T22:49:04Z

Is your feature request related to a problem or challenge? Please describe what you are trying to do.
Spark annoyingly by default writes the deprecated int96 timestamps even though it's microsecond precision in the source.

Arrow-rs coerces this as nanosecond precision during reads. Pyarrow does this as well, however in their parquet reader you can enable a flag that coerces the int96 timestamps from nanosecond to microsecond.

Describe the solution you'd like
Add an option to the parquet reader to handle this coercion to microsecond precision for int96 timestamps

Describe alternatives you've considered

Additional context

This would help in delta-rs with reading spark created parquet tables, and prevent schema mismatches

tustvold · 2024-04-17T10:51:03Z

I presume the data is actually nanosecond precision in the parquet data, you are just wanting the parquet reader to cast this to microsecond precision?

If so, I think we can solve this and related issues with #5657

ion-elgreco · 2024-04-17T10:59:14Z

@tustvold to my understanding it's actually microsecond precision but it's saved as a logical int96

liukun4515 · 2024-04-17T10:59:59Z

@tustvold to my understanding it's actually microsecond precision but it's saved as a logical int96

I think in the parquet, the physical type of int96 represent the timestamp with nanosecond unit.

cc @ion-elgreco
Is there any system to write parquet file with the int96 physical type, but the unit is not the nanosecond

liukun4515 · 2024-04-17T11:07:22Z

I find the definition of the int96 in the deprecated doc https://github.com/xhochy/parquet-format/blob/cb4727767823ae201fd567f67825cc22834c20e9/LogicalTypes.md#int96-timestamps-also-called-impala_timestamp

(deprecated) Timestamps saved as an int96 are made up of the nanoseconds in the day (first 8 byte) and the Julian day (last 4 bytes). No timezone is attached to this value. To convert the timestamp into nanoseconds since the Unix epoch, 00:00:00.000000 on 1 January 1970, the following formula can be used: (julian_day - 2440588) * (86400 * 1000 * 1000 * 1000) + nanoseconds. The magic number 2440588 is the julian day for 1 January 1970.

Note that these timestamps are the common usage of the int96 physical type and are not marked with a special logical type annotation.

ion-elgreco · 2024-04-17T11:31:42Z

@liukun4515 I rather meant the source data, so in a Spark Dataframe it's microsecond but their default spark conf settings saves it as a int96.. they still haven't changed that unfortunately, so you see a lot of parquet data flowing around that is saved as int96 (ns precision I guess) but is actually just us precision.

Seems that HIVE and impala do the same thing

mapleFU · 2024-04-17T17:50:43Z

Can you provide the link for Spark's behavior? I found Int96 mentioned here but I wonder how it defines it specifically https://spark.apache.org/docs/latest/sql-data-sources-parquet.html

And if ConvertedType is defined for it, would we better choose parsing mode from ConvertedType?

ion-elgreco · 2024-04-21T08:45:02Z

@mapleFU, the default is INT96 because they want to keep compatibility with HIVE

https://spark.apache.org/docs/3.5.1/configuration.html#content

ion-elgreco added the enhancement Any new improvement worthy of a entry in the changelog label Apr 16, 2024

tustvold mentioned this issue Apr 17, 2024

Provide Arrow Schema Hint to Parquet Reader #5657

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Coerce parquet int96 timestamps to microsecond precision #5655

Coerce parquet int96 timestamps to microsecond precision #5655

ion-elgreco commented Apr 16, 2024 •

edited

tustvold commented Apr 17, 2024

ion-elgreco commented Apr 17, 2024

liukun4515 commented Apr 17, 2024 •

edited

liukun4515 commented Apr 17, 2024 •

edited

ion-elgreco commented Apr 17, 2024 •

edited

mapleFU commented Apr 17, 2024 •

edited

ion-elgreco commented Apr 21, 2024 •

edited

Coerce parquet int96 timestamps to microsecond precision #5655

Coerce parquet int96 timestamps to microsecond precision #5655

Comments

ion-elgreco commented Apr 16, 2024 • edited

tustvold commented Apr 17, 2024

ion-elgreco commented Apr 17, 2024

liukun4515 commented Apr 17, 2024 • edited

liukun4515 commented Apr 17, 2024 • edited

ion-elgreco commented Apr 17, 2024 • edited

mapleFU commented Apr 17, 2024 • edited

ion-elgreco commented Apr 21, 2024 • edited

ion-elgreco commented Apr 16, 2024 •

edited

liukun4515 commented Apr 17, 2024 •

edited

liukun4515 commented Apr 17, 2024 •

edited

ion-elgreco commented Apr 17, 2024 •

edited

mapleFU commented Apr 17, 2024 •

edited

ion-elgreco commented Apr 21, 2024 •

edited