New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support dtype_backend="pandas|pyarrow"
configuration
#9719
Support dtype_backend="pandas|pyarrow"
configuration
#9719
Conversation
dask/dataframe/io/parquet/core.py
Outdated
@@ -185,7 +186,7 @@ def read_parquet( | |||
index=None, | |||
storage_options=None, | |||
engine="auto", | |||
use_nullable_dtypes=False, | |||
use_nullable_dtypes: bool | Literal["pandas", "pyarrow"] = False, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Btw, @mroeschke this is the PR I mentioned offline about extending use_nullable_dtypes
to support "pandas"
and "pyarrow"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah yeah this is pretty clean!
Our pandas issues pandas-dev/pandas#48957 (offline discussion happened here) and pandas-dev/pandas#49997 are examples where some discussion/preference of keeping use_nullable_dtypes
boolean
[True, pd.NA, False, True, False], dtype=f"boolean{nullable_backend}" | ||
), | ||
"c": pd.Series( | ||
[0.1, 0.2, 0.3, pd.NA, 0.4], dtype=f"Float64{nullable_backend}" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh nice, I actually didn't know this was case insensitive. (The Float64
is parsed by pyarrow)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, pyarrow
converts everything to be lowercase here. It makes it useful for writing these types of tests where I want to easily switch between pandas- and pyarrow-backed extension dtypes. Though once pandas-dev/pandas#50094 lands and is released, I could see us using that too!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So after thinking about it more, I switched this PR to keep user_nullable_dtypes
strictly a bool
and instead add a dataframe.nullable_backend="pandas"|"pyarrow"
config option to determine whether numpy
-backed or pyarrow
-backed extension dtypes should be used.
@mroeschke I'm curious to get your thoughts on the discussion here #9631 (comment) about whether we should go with nullable_backend
, or some other name, for the config options in pandas
/ dask
dask/dask.yaml
Outdated
@@ -12,6 +12,7 @@ dataframe: | |||
parquet: | |||
metadata-task-size-local: 512 # Number of files per local metadata-processing task | |||
metadata-task-size-remote: 16 # Number of files per remote metadata-processing task | |||
nullable_backend: "pandas" # Nullable dtype implementation to use |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How do you expect this option (and its default) to interact with the corresponding pandas 2.0 config option? When pandas-2 is released, should the default just correspond to whatever the pandas default is?
For example, it would be nice if we were able to use a test like this for pandas-2:
with pd.option_context("io.nullable_backend", "pyarrow"):
df = pd.read_parquet("tmp.parquet", engine="pyarrow", use_nullable_dtypes=True)
ddf = dd.read_parquet("tmp.parquet", engine="pyarrow", use_nullable_dtypes=True)
assert_eq(dd, ddf)
Does client vs worker config options make this a challenge?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does client vs worker config options make this a challenge?
It's something we'll definitely need to account for. My guess is the most pleasant user experience will be if we pull the corresponding config value on the client and then embed it into the task graph (like we're doing in this PR). That way users won't need to worry about setting config options on the workers. Regardless, I suspect the implementation will be the same whether we pull the pandas
or dask
config option (see #9711 for an example).
The downside to supporting pandas
config options is that we wouldn't support all the config options. We could explicitly document which ones we do support, and when, but still might be a source of confusion.
Either way, I think this is a good question to ask. But I'm not too concerned because there is a smooth path in either direction. If we don't support the pandas
option, then no changes are needed. If we do, then we can either update the default for the dask
config value to pull in the current pandas
option, or we deprecate the dask
config value altogether.
use_nullable_dtypes="pandas|pyarrow"
nullable_backend="pandas|pyarrow"
configuration
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's a pandas community meeting happening tomorrow where we'll discuss the name of the config option used to specify the dtype backend (currently called nullable_backend
). Barring any further comments, I'll plan on updating this PR to match whatever comes out of that community meeting (i.e. keep nullable_backend
or change the name to whatever is decided on) and then merge this PR in.
|
||
|
||
@pytest.mark.skipif(not PANDAS_GT_150, reason="Requires pyarrow-backed nullable dtypes") | ||
def test_read_decimal_dtype_pyarrow(spark_session, tmpdir): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One additional benefit of adding support for pyarrow dtypes is that we actually end up getting better Spark interoperability. For example, I ran into a user group offline who were using Spark with decimal type data. When they tried to read in the corresponding Spark-written Parquet dataset, Dask would end up converting them to object
. With this PR we can now use dask.config.set({"dataframe.nullable_backend": "pyarrow"})
to read that data in backed by pyarrow's decimal128
type.
Anyways, that's the context around this test
…nullable-dtypes
I am initially proposing renaming |
Alright, it looks like |
nullable_backend="pandas|pyarrow"
configurationdtype_backend="pandas|pyarrow"
configuration
Woo!
…On Fri, Dec 16, 2022 at 2:39 PM James Bourbeau ***@***.***> wrote:
Merged #9719 <#9719> into main.
—
Reply to this email directly, view it on GitHub
<#9719 (comment)>, or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AACKZTHLIIGLFYPAHA6Y2J3WNTHQDANCNFSM6AAAAAASUZ2SQM>
.
You are receiving this because you are subscribed to this thread.Message
ID: ***@***.***>
|
This PR updates the
use_nullable_dtypes=
keyword inread_parquet
to accept"pandas"
and"pyarrow"
as valid inputs. The equivalent inpandas
would beuse_nullable_dtypes=True
+ the newio.nullable_backend
pandas
config option that's coming inpandas=2.0
. I likeuse_nullable_dtypes="pandas|pyarrow"
in Dask becausepandas=2.0
is scheduled for release -- I'm just guessing severaldask
releases will happen beforehand)pandas
config systemWe might consider going with this over #9711 for having
read_parquet
support reading in pyarrow-backed dtypesEDIT: The primary downside to this PR I see is that it's a deviation away from
pandas
API. The good news is that the logic is well isolated enough that it would be very easy to deprecate in the future, should we want to align onio.nullable_backend
at some point.cc @rjzamora @mroeschke for visibility
Closes #9631