Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Provide Arrow Schema Hint to Parquet Reader #5657

Open
tustvold opened this issue Apr 17, 2024 · 8 comments · May be fixed by #5671
Open

Provide Arrow Schema Hint to Parquet Reader #5657

tustvold opened this issue Apr 17, 2024 · 8 comments · May be fixed by #5671
Labels
enhancement Any new improvement worthy of a entry in the changelog good first issue Good for newcomers help wanted

Comments

@tustvold
Copy link
Contributor

Is your feature request related to a problem or challenge? Please describe what you are trying to do.

The parquet reader automatically uses an embedded arrow schema to hint type inference for decode. In particular if the hinted type is compatible with the underlying parquet type, it performs a cast.

Describe the solution you'd like

In situations where the writer was not an arrow writer this schema is not available, and therefore the arrow types are inferred from the parquet schema. This is not always desirable:

Describe alternatives you've considered

Additional context

@tustvold tustvold added enhancement Any new improvement worthy of a entry in the changelog good first issue Good for newcomers help wanted labels Apr 17, 2024
@alamb
Copy link
Contributor

alamb commented Apr 17, 2024

Here is one potential API

let file = File::open("data.parquet").unwrap();

let reader = ParquetRecordBatchReaderBuilder::try_new(file)?
  // specify column "time" should be UTC
  // will error if this type can not be read from parquet
  .with_column_type("time", DateTime::Timestamp(Nanoseconds, Some("UTC"))

println!("Converted arrow schema is: {}", builder.schema());

I am not quite sure how to handle identifying nested types with a single column name

Like if the parquet file has

{
  "my_object": { 
    "time": "12-01-02"
  }
}

maybe we would refer to the time field like "my_object.time"?

@tustvold
Copy link
Contributor Author

I think my expectation would be for you to provide the SchemaRef for the entire file

@Lordworms
Copy link
Contributor

Let me try the remaining part if it is ok

@liukun4515
Copy link
Contributor

liukun4515 commented Apr 18, 2024

I think my expectation would be for you to provide the SchemaRef for the entire file

Basically agree with your idea
In the datafusion, the ParquetExec of FileScanConfig contains the schema for the parquet file, but I think the provided SchemaRef should be the optional for the parquet reader when infer the data type

@alamb
Copy link
Contributor

alamb commented Apr 18, 2024

Something like

let file = File::open("data.parquet").unwrap();

let reader = ParquetRecordBatchReaderBuilder::try_new(file)?
  // specify the arrow schema to read from this parquet file
  // will error if the types in the parquet file can not be converted
  // into the specific types. 
  // Will ignore any embedded metadata about types when written
  .schema(schema)

println!("Converted arrow schema is: {}", builder.schema());

@liukun4515
Copy link
Contributor

liukun4515 commented Apr 19, 2024

Something like

let file = File::open("data.parquet").unwrap();

let reader = ParquetRecordBatchReaderBuilder::try_new(file)?
  // specify the arrow schema to read from this parquet file
  // will error if the types in the parquet file can not be converted
  // into the specific types. 
  // Will ignore any embedded metadata about types when written
  .schema(schema)

println!("Converted arrow schema is: {}", builder.schema());

Do we need to add some checker in the function of the schema which is used to compare the input schema with schema inferred from the parquet file?

The compatibility is very important for the parquet reader

@tustvold
Copy link
Contributor Author

The inference logic is already setup to use the arrow schema as a hint as opposed to authoritative , if you give it something invalid it will just ignore it

@Lordworms Lordworms linked a pull request Apr 20, 2024 that will close this issue
@liukun4515
Copy link
Contributor

The inference logic is already setup to use the arrow schema as a hint as opposed to authoritative , if you give it something invalid it will just ignore it

thanks, got it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement Any new improvement worthy of a entry in the changelog good first issue Good for newcomers help wanted
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants