New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add support for use_nullable_dtypes
to dd.read_parquet
#9617
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a very nice addition. I'm curious about the use of string[python]
extension dtype vs string[pyarrow]
. Given that we're restricting nullables to the pyarrow engine, I would think its consistent to use string[pyarrow]
but I may be missing something.
This might be because to be able to use |
This is a good point. I've generally been a bit defensive about defaulting to
The first point is not relevant here, as you point out, and the second is becoming less and less of a problem as pandas implements more operations there, and as we fix issues in Dask. I would note that the user can still change the string storage backend via the pandas config system, but that may be a bit much to ask for most users. |
@ncclementi -- are you referring to: #9477, which requires apache/arrow#14080 to be released before |
I think you are referring to #9477, which is fixed in arrow main (and might be in the most recent release from yesterday? We should check). That issue is probably also a problem with |
available in pandas <= 1.2.0.
So I'm trying this naively on a dataset: import dask.dataframe as dd
df = dd.read_parquet(
"s3://nyc-tlc/trip data/fhvhv_tripdata_2022-06.parquet",
split_row_groups=True,
use_nullable_dtypes=True,
)
df.dtypes
I'm not getting anything like |
For context, that is a single-row-group dataset that comes in at either 10GB if we're not clever about dtypes, or 2GB if we are mildly clever about dtypes. |
I was screwing around with this a little here: https://github.com/mrocklin/nyc-taxi https://www.youtube.com/watch?v=31MbjVpT2hM Summary:
|
Just cross linking #9631 (comment) noting in pandas 2.0 there will also be another global option to make |
dask/dataframe/io/parquet/arrow.py
Outdated
) -> pd.DataFrame: | ||
_kwargs = kwargs.get("arrow_to_pandas", {}) | ||
_kwargs.update({"use_threads": False, "ignore_metadata": False}) | ||
|
||
if use_nullable_dtypes: | ||
_kwargs["types_mapper"] = PYARROW_NULLABLE_DTYPE_MAPPING.get |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
More of an FYI if there is future appetite to get back a pandas DataFrame with any pyarrow type, I think to_pandas(..., type_mapper=...)
would go from arrow -> numpy -> arrow.
To avoid this conversion, I essentially split the pa.Table
into pa.ChunkedArray
s and stuck them into each column with as a pd.ArrowExtensionArray
: https://github.com/pandas-dev/pandas/pull/49039/files#diff-868f7f48a0ed35429e240d9be0b98ad9303ceb2a7771b5bd21390eca332b0da4R267
@jrbourbeau I suspect that you're busy, but I wanted to ping on this in case it was going stale. |
(actually I'm just cleaning up my tabs, and this one seemed important) |
Yes, this is near the top of my list |
Cool. Sorry to prod
…On Thu, Nov 17, 2022 at 9:47 AM James Bourbeau ***@***.***> wrote:
Yes, this is near the top of my list
—
Reply to this email directly, view it on GitHub
<#9617 (comment)>, or
unsubscribe
<https://github.com/notifications/unsubscribe-auth/AACKZTGJFXKZCF6BU5RW3CTWIZHPNANCNFSM6AAAAAARVS3OVM>
.
You are receiving this because you commented.Message ID:
***@***.***>
|
…nullable_dtypes
use nullable dtypes
in dd.read_parquetuse_nullable_dtypes
to dd.read_parquet
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @ian-r-rose!
Woo! Thanks for bringing this over the line!
…On Wed, Nov 30, 2022 at 5:57 PM James Bourbeau ***@***.***> wrote:
Merged #9617 <#9617> into main.
—
Reply to this email directly, view it on GitHub
<#9617 (comment)>, or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ABLWQN6E3CVQV7X636WQS3LWLAAYRANCNFSM6AAAAAARVS3OVM>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
pre-commit run --all-files
As of pandas==1.2.0 there is a new keyword argument for
pd.read_parquet
use_nullable_dtypes
, which makes the parquet reader prefer nullable pandas extension dtypes where appropriate. This includes nullable integers, nullable booleans, and string dtypes (both python and pyarrow).This implements
use_nullable_dtypes
for Dask. One consequence of this is that it make it easier to read parquet files written by other systems with more nativenull
support like Spark or various databases. This does not attempt to read/parse spark metadata (though a follow-up could), and the user still needs to includeuse_nullable_dtypes=True
to get the expected result in the presence of columns with nulls.A meta comment: pandas is getting more invested in these dtypes, and I wouldn't be surprised to see them becoming the defaults.
use_nullable_dtypes
will soon also be an option inread_csv
, and it will also be a global config option in pandas 2.0.