Read a parquet and ensure certain columns are nullable int #8405

cliffplaysdrums · 2021-11-19T20:01:19Z

cliffplaysdrums
Nov 19, 2021

I'm reading a parquet with a number of missing values. This results in my int-type columns gettings converted to float. I want to force the read operation to result in pd.Int64Dtype() but haven't had any luck.

When reading a csv, it's as simple as:
dd.read_csv(urlpath=my_glob, dtype={'my_int_field': pd.Int64Dtype()})

For parquet, I'm using pyarrow as my engine. I've tried passing all combinations of the kwargs dict below to dd.read_parquet:

import pandas as pd
import pyarrow as pa

def type_mapper(t):
    if pa.types.is_integer(t):
        return pd.Int64Dtype()

kwargs = {
    'dataset': {
        'schema': pa.schema([('my_int_field', pa.int64())])
    },
    'read': {'use_nullable_dtypes': True},
    'arrow_to_pandas': {'types_mapper': type_mapper}
}
my_df = dd.read_parquet(my_glob, kwargs=kwargs)

The result each time is still 'int64' unlike the csv method which correctly shows 'Int64' (capital I).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Read a parquet and ensure certain columns are nullable int #8405

{{title}}

Replies: 0 comments

Select a reply

Read a parquet and ensure certain columns are nullable int #8405

cliffplaysdrums Nov 19, 2021

Replies: 0 comments

cliffplaysdrums
Nov 19, 2021