pandas.DeltaTableDataset depends on pyarrow.Table.from_pandas to work properly #610

julio-cmdr · 2024-02-29T17:14:57Z

Description

I had some problems with pandas.DeltaTableDataset when my nodes were returning a dataframe. Eg: Running the code below results in the error: name 'sepal_width' present in the specified schema is not found in the columns or index" even with the column sepal_width defined as nullable.

from kedro_datasets.pandas import DeltaTableDataset
import pyarrow as pa

dataset = DeltaTableDataset(
    filepath='data/01_raw/delta_iris',
    save_args={
        'mode': 'overwrite',
        'schema': pa.schema([
            pa.field('sepal_length', pa.float64(), nullable=True),
            pa.field('sepal_width', pa.float64(), nullable=True),
            pa.field('petal_length', pa.float64(), nullable=True),
            pa.field('petal_width', pa.float64(), nullable=True),
            pa.field('species', pa.string(), nullable=False)
        ]),
        'overwrite_schema': True
    }
)

dataset.save(iris.drop(columns=['sepal_width']))

I also had some problems related with index_level_0 column when no schema was specified (see this issue).

Using pyarrow.Table.from_pandas(df) as node return fixed both problems. Could this function be embedded into pandas.DeltaTableDataset in the next release of kedro datasets?

Possible Implementation

Embed pyarrow.Table.from_pandas() inside pandas.DeltaTableDataset.save() function.

Possible Alternatives

Use the pyarrow.Table.from_pandas() function in every node return.

The text was updated successfully, but these errors were encountered:

astrojuanlu · 2024-02-29T17:23:13Z

Thanks for opening @julio-cmdr . I see you mention the iris dataset, can this be reproduced with something like https://github.com/kedro-org/kedro-starters/tree/main/pandas-iris then?

julio-cmdr · 2024-02-29T17:42:41Z

Yeas, I think so! In the example above I just did pd.read_csv() to get the iris dataframe.

noklam · 2024-03-05T13:35:53Z

regard to index_level_0, I have seen a case that this get created on transcoding from pandas -> spark with parquet. By default pandas.CSVDataset use to_index=False, but this is not consistent for other pandas dataset (ParquetDataset etc)

github-actions bot mentioned this issue Mar 1, 2024

Monthly issue metrics report kedro-org/kedro#3671

Open

julio-cmdr mentioned this issue Mar 6, 2024

[spike] Clarify status of various Delta Table datasets #542

Open

merelcht transferred this issue from kedro-org/kedro Mar 14, 2024

merelcht added this to the Individual dataset improvements milestone Mar 14, 2024

KrzysztofDoboszInpost mentioned this issue Mar 29, 2024

Pandas DataFrame index not preserved with pandas.DeltaTableDataset #431

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

pandas.DeltaTableDataset depends on pyarrow.Table.from_pandas to work properly #610

pandas.DeltaTableDataset depends on pyarrow.Table.from_pandas to work properly #610

julio-cmdr commented Feb 29, 2024

astrojuanlu commented Feb 29, 2024

julio-cmdr commented Feb 29, 2024

noklam commented Mar 5, 2024

pandas.DeltaTableDataset depends on pyarrow.Table.from_pandas to work properly #610

pandas.DeltaTableDataset depends on pyarrow.Table.from_pandas to work properly #610

Comments

julio-cmdr commented Feb 29, 2024

Description

Possible Implementation

Possible Alternatives

astrojuanlu commented Feb 29, 2024

julio-cmdr commented Feb 29, 2024

noklam commented Mar 5, 2024