Add support for `use_nullable_dtypes` to `dd.read_parquet` #9617

ian-r-rose · 2022-11-02T23:01:55Z

Partial fix for Spark compatibility with pandas extension dtypes #9590 , continues the work of Test round-tripping dataframe parquet I/O including pyspark #9156.
Tests added / passed
Passes pre-commit run --all-files

As of pandas==1.2.0 there is a new keyword argument for pd.read_parquet use_nullable_dtypes, which makes the parquet reader prefer nullable pandas extension dtypes where appropriate. This includes nullable integers, nullable booleans, and string dtypes (both python and pyarrow).

This implements use_nullable_dtypes for Dask. One consequence of this is that it make it easier to read parquet files written by other systems with more native null support like Spark or various databases. This does not attempt to read/parse spark metadata (though a follow-up could), and the user still needs to include use_nullable_dtypes=True to get the expected result in the presence of columns with nulls.

A meta comment: pandas is getting more invested in these dtypes, and I wouldn't be surprised to see them becoming the defaults. use_nullable_dtypes will soon also be an option in read_csv, and it will also be a global config option in pandas 2.0.

dask/dataframe/io/parquet/arrow.py

dask/tests/test_spark_compat.py

dask/dataframe/io/tests/test_parquet.py

hayesgb

This is a very nice addition. I'm curious about the use of string[python] extension dtype vs string[pyarrow]. Given that we're restricting nullables to the pyarrow engine, I would think its consistent to use string[pyarrow] but I may be missing something.

ncclementi · 2022-11-03T14:51:48Z

This is a very nice addition. I'm curious about the use of string[python] extension dtype vs string[pyarrow]. Given that we're restricting nullables to the pyarrow engine, I would think its consistent to use string[pyarrow] but I may be missing something.

This might be because to be able to use string[pyarrow] we need to wait for the arrow release? @ian-r-rose you know which one was the PR or am I confusing things here?

ian-r-rose · 2022-11-03T15:36:46Z

This is a very nice addition. I'm curious about the use of string[python] extension dtype vs string[pyarrow]. Given that we're restricting nullables to the pyarrow engine, I would think its consistent to use string[pyarrow] but I may be missing something.

This is a good point. I've generally been a bit defensive about defaulting to string[pyarrow] because

The user might not have pyarrow installed, and
not all operations are supported by string[pyarrow] (in pandas as well as Dask)

The first point is not relevant here, as you point out, and the second is becoming less and less of a problem as pandas implements more operations there, and as we fix issues in Dask.

I would note that the user can still change the string storage backend via the pandas config system, but that may be a bit much to ask for most users.

hayesgb · 2022-11-03T15:42:14Z

@ncclementi -- are you referring to: #9477, which requires apache/arrow#14080 to be released before p2p shuffle works reliably with extension dtypes.

ian-r-rose · 2022-11-03T15:44:24Z

This might be because to be able to use string[pyarrow] we need to wait for the arrow release? @ian-r-rose you know which one was the PR or am I confusing things here?

I think you are referring to #9477, which is fixed in arrow main (and might be in the most recent release from yesterday? We should check). That issue is probably also a problem with string[python], so I'm not sure if we need to worry about that here.

available in pandas <= 1.2.0.

mrocklin · 2022-11-05T16:17:36Z

So I'm trying this naively on a dataset:

import dask.dataframe as dd

df = dd.read_parquet(
    "s3://nyc-tlc/trip data/fhvhv_tripdata_2022-06.parquet", 
    split_row_groups=True, 
    use_nullable_dtypes=True,
)
df.dtypes

hvfhs_license_num               object
dispatching_base_num            object
originating_base_num            object
request_datetime        datetime64[ns]
on_scene_datetime       datetime64[ns]
pickup_datetime         datetime64[ns]
dropoff_datetime        datetime64[ns]
PULocationID                     int64
DOLocationID                     int64
trip_miles                     float64
trip_time                        int64
base_passenger_fare            float64
tolls                          float64
bcf                            float64
sales_tax                      float64
congestion_surcharge           float64
airport_fee                    float64
tips                           float64
driver_pay                     float64
shared_request_flag             object
shared_match_flag               object
access_a_ride_flag              object
wav_request_flag                object
wav_match_flag                  object
dtype: object

I'm not getting anything like string. My guess is that this is because we don't have metadata telling us that this is a string column. Is that correct?

mrocklin · 2022-11-05T16:19:05Z

For context, that is a single-row-group dataset that comes in at either 10GB if we're not clever about dtypes, or 2GB if we are mildly clever about dtypes.

mrocklin · 2022-11-05T19:23:38Z

I was screwing around with this a little here:

https://github.com/mrocklin/nyc-taxi

https://www.youtube.com/watch?v=31MbjVpT2hM

Summary:

I'm curious about how to force string[pyarrow] from the beginning, even if metadata isn't set (see Read Parquet directly into string[pyarrow] #9631 )
I'd love it if we could get to a point where string[pyarrow] was default. Do we have a sense for what is broken? Or is this [DNM] Flush out extension dtype issues #9523 ?

mroeschke · 2022-11-11T00:35:48Z

A meta comment: pandas is getting more invested in these dtypes, and I wouldn't be surprised to see them becoming the defaults. use_nullable_dtypes will soon also be an option in read_csv, and it will also be a global config option in pandas 2.0.

Just cross linking #9631 (comment) noting in pandas 2.0 there will also be another global option to make use_nullable_dtypes=True return pyarrow types (for everything not just strings).

mroeschke · 2022-11-11T00:45:00Z

dask/dataframe/io/parquet/arrow.py

    ) -> pd.DataFrame:
        _kwargs = kwargs.get("arrow_to_pandas", {})
        _kwargs.update({"use_threads": False, "ignore_metadata": False})

+        if use_nullable_dtypes:
+            _kwargs["types_mapper"] = PYARROW_NULLABLE_DTYPE_MAPPING.get


More of an FYI if there is future appetite to get back a pandas DataFrame with any pyarrow type, I think to_pandas(..., type_mapper=...) would go from arrow -> numpy -> arrow.

To avoid this conversion, I essentially split the pa.Table into pa.ChunkedArrays and stuck them into each column with as a pd.ArrowExtensionArray: https://github.com/pandas-dev/pandas/pull/49039/files#diff-868f7f48a0ed35429e240d9be0b98ad9303ceb2a7771b5bd21390eca332b0da4R267

mrocklin · 2022-11-17T15:42:24Z

@jrbourbeau I suspect that you're busy, but I wanted to ping on this in case it was going stale.

mrocklin · 2022-11-17T15:42:38Z

(actually I'm just cleaning up my tabs, and this one seemed important)

jrbourbeau · 2022-11-17T15:46:52Z

Yes, this is near the top of my list

mrocklin · 2022-11-17T17:13:38Z

Cool. Sorry to prod

…

On Thu, Nov 17, 2022 at 9:47 AM James Bourbeau ***@***.***> wrote: Yes, this is near the top of my list — Reply to this email directly, view it on GitHub <#9617 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AACKZTGJFXKZCF6BU5RW3CTWIZHPNANCNFSM6AAAAAARVS3OVM> . You are receiving this because you commented.Message ID: ***@***.***>

…nullable_dtypes

jrbourbeau

Thanks @ian-r-rose!

ian-r-rose · 2022-12-01T02:55:42Z

Woo! Thanks for bringing this over the line!

…

On Wed, Nov 30, 2022 at 5:57 PM James Bourbeau ***@***.***> wrote: Merged #9617 <#9617> into main. — Reply to this email directly, view it on GitHub <#9617 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ABLWQN6E3CVQV7X636WQS3LWLAAYRANCNFSM6AAAAAARVS3OVM> . You are receiving this because you were mentioned.Message ID: ***@***.***>

Ian Rose added 5 commits November 2, 2022 12:56

Add failing test for use_nullable_dtypes

57714ad

WIP use_nullable_dtypes

bcd51cf

apply use_nullable_dtypes to meta

58659d2

clean up

cd9a6c6

Raise when using fastparquet

32d45eb

ian-r-rose added dataframe feature Something is missing labels Nov 2, 2022

github-actions bot added the io label Nov 2, 2022

Ian Rose added 2 commits November 2, 2022 16:08

Add spark test

db09037

Remove debug code

203a1f2