Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for use_nullable_dtypes to dd.read_parquet #9617

Merged
merged 14 commits into from Dec 1, 2022

Conversation

ian-r-rose
Copy link
Collaborator

@ian-r-rose ian-r-rose commented Nov 2, 2022

As of pandas==1.2.0 there is a new keyword argument for pd.read_parquet use_nullable_dtypes, which makes the parquet reader prefer nullable pandas extension dtypes where appropriate. This includes nullable integers, nullable booleans, and string dtypes (both python and pyarrow).

This implements use_nullable_dtypes for Dask. One consequence of this is that it make it easier to read parquet files written by other systems with more native null support like Spark or various databases. This does not attempt to read/parse spark metadata (though a follow-up could), and the user still needs to include use_nullable_dtypes=True to get the expected result in the presence of columns with nulls.

A meta comment: pandas is getting more invested in these dtypes, and I wouldn't be surprised to see them becoming the defaults. use_nullable_dtypes will soon also be an option in read_csv, and it will also be a global config option in pandas 2.0.

@ian-r-rose ian-r-rose added dataframe feature Something is missing labels Nov 2, 2022
@github-actions github-actions bot added the io label Nov 2, 2022
Copy link
Contributor

@hayesgb hayesgb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a very nice addition. I'm curious about the use of string[python] extension dtype vs string[pyarrow]. Given that we're restricting nullables to the pyarrow engine, I would think its consistent to use string[pyarrow] but I may be missing something.

@ncclementi
Copy link
Member

ncclementi commented Nov 3, 2022

This is a very nice addition. I'm curious about the use of string[python] extension dtype vs string[pyarrow]. Given that we're restricting nullables to the pyarrow engine, I would think its consistent to use string[pyarrow] but I may be missing something.

This might be because to be able to use string[pyarrow] we need to wait for the arrow release? @ian-r-rose you know which one was the PR or am I confusing things here?

@ian-r-rose
Copy link
Collaborator Author

ian-r-rose commented Nov 3, 2022

This is a very nice addition. I'm curious about the use of string[python] extension dtype vs string[pyarrow]. Given that we're restricting nullables to the pyarrow engine, I would think its consistent to use string[pyarrow] but I may be missing something.

This is a good point. I've generally been a bit defensive about defaulting to string[pyarrow] because

  1. The user might not have pyarrow installed, and
  2. not all operations are supported by string[pyarrow] (in pandas as well as Dask)

The first point is not relevant here, as you point out, and the second is becoming less and less of a problem as pandas implements more operations there, and as we fix issues in Dask.

I would note that the user can still change the string storage backend via the pandas config system, but that may be a bit much to ask for most users.

@hayesgb
Copy link
Contributor

hayesgb commented Nov 3, 2022

@ncclementi -- are you referring to: #9477, which requires apache/arrow#14080 to be released before p2p shuffle works reliably with extension dtypes.

@ian-r-rose
Copy link
Collaborator Author

This might be because to be able to use string[pyarrow] we need to wait for the arrow release? @ian-r-rose you know which one was the PR or am I confusing things here?

I think you are referring to #9477, which is fixed in arrow main (and might be in the most recent release from yesterday? We should check). That issue is probably also a problem with string[python], so I'm not sure if we need to worry about that here.

@mrocklin
Copy link
Member

mrocklin commented Nov 5, 2022

So I'm trying this naively on a dataset:

import dask.dataframe as dd

df = dd.read_parquet(
    "s3://nyc-tlc/trip data/fhvhv_tripdata_2022-06.parquet", 
    split_row_groups=True, 
    use_nullable_dtypes=True,
)
df.dtypes
hvfhs_license_num               object
dispatching_base_num            object
originating_base_num            object
request_datetime        datetime64[ns]
on_scene_datetime       datetime64[ns]
pickup_datetime         datetime64[ns]
dropoff_datetime        datetime64[ns]
PULocationID                     int64
DOLocationID                     int64
trip_miles                     float64
trip_time                        int64
base_passenger_fare            float64
tolls                          float64
bcf                            float64
sales_tax                      float64
congestion_surcharge           float64
airport_fee                    float64
tips                           float64
driver_pay                     float64
shared_request_flag             object
shared_match_flag               object
access_a_ride_flag              object
wav_request_flag                object
wav_match_flag                  object
dtype: object

I'm not getting anything like string. My guess is that this is because we don't have metadata telling us that this is a string column. Is that correct?

@mrocklin
Copy link
Member

mrocklin commented Nov 5, 2022

For context, that is a single-row-group dataset that comes in at either 10GB if we're not clever about dtypes, or 2GB if we are mildly clever about dtypes.

@mrocklin
Copy link
Member

mrocklin commented Nov 5, 2022

I was screwing around with this a little here:

https://github.com/mrocklin/nyc-taxi

https://www.youtube.com/watch?v=31MbjVpT2hM

Summary:

  1. I'm curious about how to force string[pyarrow] from the beginning, even if metadata isn't set (see Read Parquet directly into string[pyarrow] #9631 )
  2. I'd love it if we could get to a point where string[pyarrow] was default. Do we have a sense for what is broken? Or is this [DNM] Flush out extension dtype issues #9523 ?

@hayesgb hayesgb mentioned this pull request Nov 6, 2022
4 tasks
@mroeschke
Copy link
Contributor

A meta comment: pandas is getting more invested in these dtypes, and I wouldn't be surprised to see them becoming the defaults. use_nullable_dtypes will soon also be an option in read_csv, and it will also be a global config option in pandas 2.0.

Just cross linking #9631 (comment) noting in pandas 2.0 there will also be another global option to make use_nullable_dtypes=True return pyarrow types (for everything not just strings).

) -> pd.DataFrame:
_kwargs = kwargs.get("arrow_to_pandas", {})
_kwargs.update({"use_threads": False, "ignore_metadata": False})

if use_nullable_dtypes:
_kwargs["types_mapper"] = PYARROW_NULLABLE_DTYPE_MAPPING.get
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

More of an FYI if there is future appetite to get back a pandas DataFrame with any pyarrow type, I think to_pandas(..., type_mapper=...) would go from arrow -> numpy -> arrow.

To avoid this conversion, I essentially split the pa.Table into pa.ChunkedArrays and stuck them into each column with as a pd.ArrowExtensionArray: https://github.com/pandas-dev/pandas/pull/49039/files#diff-868f7f48a0ed35429e240d9be0b98ad9303ceb2a7771b5bd21390eca332b0da4R267

@mrocklin
Copy link
Member

@jrbourbeau I suspect that you're busy, but I wanted to ping on this in case it was going stale.

@mrocklin
Copy link
Member

(actually I'm just cleaning up my tabs, and this one seemed important)

@jrbourbeau
Copy link
Member

Yes, this is near the top of my list

@mrocklin
Copy link
Member

mrocklin commented Nov 17, 2022 via email

@jrbourbeau jrbourbeau changed the title use nullable dtypes in dd.read_parquet Add support for use_nullable_dtypes to dd.read_parquet Nov 24, 2022
Copy link
Member

@jrbourbeau jrbourbeau left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @ian-r-rose!

@ian-r-rose
Copy link
Collaborator Author

ian-r-rose commented Dec 1, 2022 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
dataframe feature Something is missing io
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants