Provide Arrow Schema Hint to Parquet Reader #5657

tustvold · 2024-04-17T10:49:55Z

Is your feature request related to a problem or challenge? Please describe what you are trying to do.

The parquet reader automatically uses an embedded arrow schema to hint type inference for decode. In particular if the hinted type is compatible with the underlying parquet type, it performs a cast.

Describe the solution you'd like

In situations where the writer was not an arrow writer this schema is not available, and therefore the arrow types are inferred from the parquet schema. This is not always desirable:

Describe alternatives you've considered

Additional context

alamb · 2024-04-17T17:42:38Z

Here is one potential API

let file = File::open("data.parquet").unwrap();

let reader = ParquetRecordBatchReaderBuilder::try_new(file)?
  // specify column "time" should be UTC
  // will error if this type can not be read from parquet
  .with_column_type("time", DateTime::Timestamp(Nanoseconds, Some("UTC"))

println!("Converted arrow schema is: {}", builder.schema());

I am not quite sure how to handle identifying nested types with a single column name

Like if the parquet file has

{
  "my_object": { 
    "time": "12-01-02"
  }
}

maybe we would refer to the time field like "my_object.time"?

tustvold · 2024-04-17T19:11:57Z

I think my expectation would be for you to provide the SchemaRef for the entire file

Lordworms · 2024-04-17T21:06:30Z

Let me try the remaining part if it is ok

liukun4515 · 2024-04-18T05:36:59Z

I think my expectation would be for you to provide the SchemaRef for the entire file

Basically agree with your idea
In the datafusion, the ParquetExec of FileScanConfig contains the schema for the parquet file, but I think the provided SchemaRef should be the optional for the parquet reader when infer the data type

alamb · 2024-04-18T21:53:51Z

Something like

let file = File::open("data.parquet").unwrap();

let reader = ParquetRecordBatchReaderBuilder::try_new(file)?
  // specify the arrow schema to read from this parquet file
  // will error if the types in the parquet file can not be converted
  // into the specific types. 
  // Will ignore any embedded metadata about types when written
  .schema(schema)

println!("Converted arrow schema is: {}", builder.schema());

liukun4515 · 2024-04-19T03:28:07Z

Something like

let file = File::open("data.parquet").unwrap();

let reader = ParquetRecordBatchReaderBuilder::try_new(file)?
  // specify the arrow schema to read from this parquet file
  // will error if the types in the parquet file can not be converted
  // into the specific types. 
  // Will ignore any embedded metadata about types when written
  .schema(schema)

println!("Converted arrow schema is: {}", builder.schema());

Do we need to add some checker in the function of the schema which is used to compare the input schema with schema inferred from the parquet file?

The compatibility is very important for the parquet reader

tustvold · 2024-04-19T06:21:08Z

The inference logic is already setup to use the arrow schema as a hint as opposed to authoritative , if you give it something invalid it will just ignore it

liukun4515 · 2024-04-22T05:24:19Z

The inference logic is already setup to use the arrow schema as a hint as opposed to authoritative , if you give it something invalid it will just ignore it

thanks, got it.

tustvold added enhancement Any new improvement worthy of a entry in the changelog good first issue Good for newcomers help wanted labels Apr 17, 2024

This was referenced Apr 17, 2024

Coerce parquet int96 timestamps to microsecond precision #5655

Open

Account for Timezone when Casting Timestamp to Date32 #5605

Merged

liukun4515 mentioned this issue Apr 19, 2024

get error value if timestamp represented by the INT96 in the parquet file apache/datafusion#9981

Open

Lordworms linked a pull request Apr 20, 2024 that will close this issue

Provide Arrow Schema Hint to Parquet Reader #5671

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Provide Arrow Schema Hint to Parquet Reader #5657

Provide Arrow Schema Hint to Parquet Reader #5657

tustvold commented Apr 17, 2024

alamb commented Apr 17, 2024

tustvold commented Apr 17, 2024

Lordworms commented Apr 17, 2024

liukun4515 commented Apr 18, 2024 •

edited

alamb commented Apr 18, 2024

liukun4515 commented Apr 19, 2024 •

edited

tustvold commented Apr 19, 2024

liukun4515 commented Apr 22, 2024

Provide Arrow Schema Hint to Parquet Reader #5657

Provide Arrow Schema Hint to Parquet Reader #5657

Comments

tustvold commented Apr 17, 2024

alamb commented Apr 17, 2024

tustvold commented Apr 17, 2024

Lordworms commented Apr 17, 2024

liukun4515 commented Apr 18, 2024 • edited

alamb commented Apr 18, 2024

liukun4515 commented Apr 19, 2024 • edited

tustvold commented Apr 19, 2024

liukun4515 commented Apr 22, 2024

liukun4515 commented Apr 18, 2024 •

edited

liukun4515 commented Apr 19, 2024 •

edited