Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support dtype_backend="pandas|pyarrow" configuration #9719

Merged
merged 11 commits into from Dec 16, 2022

Conversation

jrbourbeau
Copy link
Member

@jrbourbeau jrbourbeau commented Dec 5, 2022

This PR updates the use_nullable_dtypes= keyword in read_parquet to accept "pandas" and "pyarrow" as valid inputs. The equivalent in pandas would be use_nullable_dtypes=True + the new io.nullable_backend pandas config option that's coming in pandas=2.0. I like use_nullable_dtypes="pandas|pyarrow" in Dask because

  1. I find it a bit more ergonomic (my subjective opinion)
  2. This functionality could be made available to Dask users quickly (actually, I'm not sure when pandas=2.0 is scheduled for release -- I'm just guessing several dask releases will happen beforehand)
  3. It let's us remain de-coupled from pandas config system

We might consider going with this over #9711 for having read_parquet support reading in pyarrow-backed dtypes

EDIT: The primary downside to this PR I see is that it's a deviation away from pandas API. The good news is that the logic is well isolated enough that it would be very easy to deprecate in the future, should we want to align on io.nullable_backend at some point.

cc @rjzamora @mroeschke for visibility

Closes #9631

@@ -185,7 +186,7 @@ def read_parquet(
index=None,
storage_options=None,
engine="auto",
use_nullable_dtypes=False,
use_nullable_dtypes: bool | Literal["pandas", "pyarrow"] = False,
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Btw, @mroeschke this is the PR I mentioned offline about extending use_nullable_dtypes to support "pandas" and "pyarrow"

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah yeah this is pretty clean!

Our pandas issues pandas-dev/pandas#48957 (offline discussion happened here) and pandas-dev/pandas#49997 are examples where some discussion/preference of keeping use_nullable_dtypes boolean

[True, pd.NA, False, True, False], dtype=f"boolean{nullable_backend}"
),
"c": pd.Series(
[0.1, 0.2, 0.3, pd.NA, 0.4], dtype=f"Float64{nullable_backend}"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh nice, I actually didn't know this was case insensitive. (The Float64 is parsed by pyarrow)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, pyarrow converts everything to be lowercase here. It makes it useful for writing these types of tests where I want to easily switch between pandas- and pyarrow-backed extension dtypes. Though once pandas-dev/pandas#50094 lands and is released, I could see us using that too!

Copy link
Member Author

@jrbourbeau jrbourbeau left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So after thinking about it more, I switched this PR to keep user_nullable_dtypes strictly a bool and instead add a dataframe.nullable_backend="pandas"|"pyarrow" config option to determine whether numpy-backed or pyarrow-backed extension dtypes should be used.

@mroeschke I'm curious to get your thoughts on the discussion here #9631 (comment) about whether we should go with nullable_backend, or some other name, for the config options in pandas / dask

dask/dask.yaml Outdated
@@ -12,6 +12,7 @@ dataframe:
parquet:
metadata-task-size-local: 512 # Number of files per local metadata-processing task
metadata-task-size-remote: 16 # Number of files per remote metadata-processing task
nullable_backend: "pandas" # Nullable dtype implementation to use
Copy link
Member

@rjzamora rjzamora Dec 9, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How do you expect this option (and its default) to interact with the corresponding pandas 2.0 config option? When pandas-2 is released, should the default just correspond to whatever the pandas default is?

For example, it would be nice if we were able to use a test like this for pandas-2:

with pd.option_context("io.nullable_backend", "pyarrow"):
    df = pd.read_parquet("tmp.parquet", engine="pyarrow", use_nullable_dtypes=True)
    ddf = dd.read_parquet("tmp.parquet", engine="pyarrow", use_nullable_dtypes=True)
assert_eq(dd, ddf)

Does client vs worker config options make this a challenge?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does client vs worker config options make this a challenge?

It's something we'll definitely need to account for. My guess is the most pleasant user experience will be if we pull the corresponding config value on the client and then embed it into the task graph (like we're doing in this PR). That way users won't need to worry about setting config options on the workers. Regardless, I suspect the implementation will be the same whether we pull the pandas or dask config option (see #9711 for an example).

The downside to supporting pandas config options is that we wouldn't support all the config options. We could explicitly document which ones we do support, and when, but still might be a source of confusion.

Either way, I think this is a good question to ask. But I'm not too concerned because there is a smooth path in either direction. If we don't support the pandas option, then no changes are needed. If we do, then we can either update the default for the dask config value to pull in the current pandas option, or we deprecate the dask config value altogether.

@jrbourbeau jrbourbeau changed the title Support use_nullable_dtypes="pandas|pyarrow" Support nullable_backend="pandas|pyarrow" configuration Dec 9, 2022
Copy link
Member Author

@jrbourbeau jrbourbeau left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's a pandas community meeting happening tomorrow where we'll discuss the name of the config option used to specify the dtype backend (currently called nullable_backend). Barring any further comments, I'll plan on updating this PR to match whatever comes out of that community meeting (i.e. keep nullable_backend or change the name to whatever is decided on) and then merge this PR in.



@pytest.mark.skipif(not PANDAS_GT_150, reason="Requires pyarrow-backed nullable dtypes")
def test_read_decimal_dtype_pyarrow(spark_session, tmpdir):
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One additional benefit of adding support for pyarrow dtypes is that we actually end up getting better Spark interoperability. For example, I ran into a user group offline who were using Spark with decimal type data. When they tried to read in the corresponding Spark-written Parquet dataset, Dask would end up converting them to object. With this PR we can now use dask.config.set({"dataframe.nullable_backend": "pyarrow"}) to read that data in backed by pyarrow's decimal128 type.

Anyways, that's the context around this test

@mroeschke
Copy link
Contributor

I am initially proposing renaming nullable_backend in pandas to dtype_backend in pandas-dev/pandas#50291

@jrbourbeau
Copy link
Member Author

jrbourbeau commented Dec 16, 2022

Alright, it looks like dtype_backend is the front-runner for the new config option name in pandas-dev/pandas#50291. I'm going to update this PR accordingly. I'd like to get this PR included in the release today as there are users I'd like to try this feature out who need it to be in a release. Luckily we have a nice setup for deprecating config options (just need to add an entry here) so if ultimately a different name is decided on upstream in pandas, we can just deprecate the option here to match whatever is decided on upstream. FWIW I mentioned this offline to @rjzamora yesterday and he was okay with it

@jrbourbeau jrbourbeau changed the title Support nullable_backend="pandas|pyarrow" configuration Support dtype_backend="pandas|pyarrow" configuration Dec 16, 2022
@jrbourbeau jrbourbeau merged commit 1ac0b11 into dask:main Dec 16, 2022
@jrbourbeau jrbourbeau deleted the pyarrow-use-nullable-dtypes branch December 16, 2022 20:39
@mrocklin
Copy link
Member

mrocklin commented Dec 16, 2022 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Read Parquet directly into string[pyarrow]
4 participants