Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

write duration to parquet but read as int64 #5625

Open
Liyixin95 opened this issue Apr 11, 2024 · 0 comments · May be fixed by #5626
Open

write duration to parquet but read as int64 #5625

Liyixin95 opened this issue Apr 11, 2024 · 0 comments · May be fixed by #5626
Labels

Comments

@Liyixin95
Copy link
Contributor

Describe the bug

As the title says, the ParquetRecordBatchReader can not recognize duration type written by pandas or polars.

To Reproduce

First, we should prepare parquet file

import polars as pl
from datetime import timedelta

df = pl.DataFrame({
    "a":  [timedelta(days=1) for _ in range(100)]
})

df.write_parquet("./test.parquet")

Then, read in rust arrow-rs:

fn main() -> Result<()> {
    // Create parquet file that will be read.

    let path = "./test.parquet";
    let file = File::open(path).unwrap();

    let parquet_reader = ParquetRecordBatchReaderBuilder::try_new(file)?
        .with_batch_size(8192)
        .build()?;

    let mut batches = Vec::new();

    for batch in parquet_reader {
        batches.push(batch?);
    }

    println!("{:#?}", batches[0].schema());

    Ok(())
}

finally we get:

Schema {
    fields: [
        Field {
            name: "a",
            data_type: Int64,
            nullable: true,
            dict_id: 0,
            dict_is_ordered: false,
            metadata: {},
        },
    ],
    metadata: {},
}

Expected behavior

polars result:

shape: (100, 1)
┌──────────────┐
│ a            │
│ ---          │
│ duration[μs] │
╞══════════════╡
│ 1d           │
│ 1d           │
│ 1d           │
│ 1d           │
│ 1d           │
│ …            │
│ 1d           │
│ 1d           │
│ 1d           │
│ 1d           │
│ 1d           │
└──────────────┘

pandas result:

        a
0  1 days
1  1 days
2  1 days
3  1 days
4  1 days
..    ...
95 1 days
96 1 days
97 1 days
98 1 days
99 1 days

[100 rows x 1 columns]

Additional context

@Liyixin95 Liyixin95 added the bug label Apr 11, 2024
@Liyixin95 Liyixin95 linked a pull request Apr 11, 2024 that will close this issue
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant