New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Coerce parquet int96 timestamps to microsecond precision #5655
Comments
I presume the data is actually nanosecond precision in the parquet data, you are just wanting the parquet reader to cast this to microsecond precision? If so, I think we can solve this and related issues with #5657 |
@tustvold to my understanding it's actually microsecond precision but it's saved as a logical int96 |
I think in the parquet, the physical type of cc @ion-elgreco |
I find the definition of the int96 in the deprecated doc https://github.com/xhochy/parquet-format/blob/cb4727767823ae201fd567f67825cc22834c20e9/LogicalTypes.md#int96-timestamps-also-called-impala_timestamp
|
@liukun4515 I rather meant the source data, so in a Spark Dataframe it's microsecond but their default spark conf settings saves it as a int96.. they still haven't changed that unfortunately, so you see a lot of parquet data flowing around that is saved as int96 (ns precision I guess) but is actually just us precision. Seems that HIVE and impala do the same thing |
Can you provide the link for Spark's behavior? I found Int96 mentioned here but I wonder how it defines it specifically https://spark.apache.org/docs/latest/sql-data-sources-parquet.html And if |
@mapleFU, the default is INT96 because they want to keep compatibility with HIVE |
Is your feature request related to a problem or challenge? Please describe what you are trying to do.
Spark annoyingly by default writes the deprecated int96 timestamps even though it's microsecond precision in the source.
Arrow-rs coerces this as nanosecond precision during reads. Pyarrow does this as well, however in their parquet reader you can enable a flag that coerces the int96 timestamps from nanosecond to microsecond.
Describe the solution you'd like
Add an option to the parquet reader to handle this coercion to microsecond precision for int96 timestamps
Describe alternatives you've considered
Additional context
This would help in delta-rs with reading spark created parquet tables, and prevent schema mismatches
The text was updated successfully, but these errors were encountered: