Support `dtype_backend="pandas|pyarrow"` configuration #9719

jrbourbeau · 2022-12-05T22:43:27Z

This PR updates the use_nullable_dtypes= keyword in read_parquet to accept "pandas" and "pyarrow" as valid inputs. The equivalent in pandas would be use_nullable_dtypes=True + the new io.nullable_backend pandas config option that's coming in pandas=2.0. I like use_nullable_dtypes="pandas|pyarrow" in Dask because

I find it a bit more ergonomic (my subjective opinion)
This functionality could be made available to Dask users quickly (actually, I'm not sure when pandas=2.0 is scheduled for release -- I'm just guessing several dask releases will happen beforehand)
It let's us remain de-coupled from pandas config system

We might consider going with this over #9711 for having read_parquet support reading in pyarrow-backed dtypes

EDIT: The primary downside to this PR I see is that it's a deviation away from pandas API. The good news is that the logic is well isolated enough that it would be very easy to deprecate in the future, should we want to align on io.nullable_backend at some point.

cc @rjzamora @mroeschke for visibility

Closes #9631

jrbourbeau · 2022-12-06T21:53:30Z

dask/dataframe/io/parquet/core.py

@@ -185,7 +186,7 @@ def read_parquet(
    index=None,
    storage_options=None,
    engine="auto",
-    use_nullable_dtypes=False,
+    use_nullable_dtypes: bool | Literal["pandas", "pyarrow"] = False,


Btw, @mroeschke this is the PR I mentioned offline about extending use_nullable_dtypes to support "pandas" and "pyarrow"

Ah yeah this is pretty clean!

Our pandas issues pandas-dev/pandas#48957 (offline discussion happened here) and pandas-dev/pandas#49997 are examples where some discussion/preference of keeping use_nullable_dtypes boolean

mroeschke · 2022-12-06T22:12:56Z

dask/dataframe/io/tests/test_parquet.py

+                [True, pd.NA, False, True, False], dtype=f"boolean{nullable_backend}"
+            ),
+            "c": pd.Series(
+                [0.1, 0.2, 0.3, pd.NA, 0.4], dtype=f"Float64{nullable_backend}"


Oh nice, I actually didn't know this was case insensitive. (The Float64 is parsed by pyarrow)

Yeah, pyarrow converts everything to be lowercase here. It makes it useful for writing these types of tests where I want to easily switch between pandas- and pyarrow-backed extension dtypes. Though once pandas-dev/pandas#50094 lands and is released, I could see us using that too!

jrbourbeau

So after thinking about it more, I switched this PR to keep user_nullable_dtypes strictly a bool and instead add a dataframe.nullable_backend="pandas"|"pyarrow" config option to determine whether numpy-backed or pyarrow-backed extension dtypes should be used.

@mroeschke I'm curious to get your thoughts on the discussion here #9631 (comment) about whether we should go with nullable_backend, or some other name, for the config options in pandas / dask

rjzamora · 2022-12-09T17:30:19Z

dask/dask.yaml

@@ -12,6 +12,7 @@ dataframe:
  parquet:
    metadata-task-size-local: 512  # Number of files per local metadata-processing task
    metadata-task-size-remote: 16  # Number of files per remote metadata-processing task
+  nullable_backend: "pandas"  # Nullable dtype implementation to use


How do you expect this option (and its default) to interact with the corresponding pandas 2.0 config option? When pandas-2 is released, should the default just correspond to whatever the pandas default is?

For example, it would be nice if we were able to use a test like this for pandas-2:

with pd.option_context("io.nullable_backend", "pyarrow"): df = pd.read_parquet("tmp.parquet", engine="pyarrow", use_nullable_dtypes=True) ddf = dd.read_parquet("tmp.parquet", engine="pyarrow", use_nullable_dtypes=True) assert_eq(dd, ddf)

Does client vs worker config options make this a challenge?

Does client vs worker config options make this a challenge?

It's something we'll definitely need to account for. My guess is the most pleasant user experience will be if we pull the corresponding config value on the client and then embed it into the task graph (like we're doing in this PR). That way users won't need to worry about setting config options on the workers. Regardless, I suspect the implementation will be the same whether we pull the pandas or dask config option (see #9711 for an example).

The downside to supporting pandas config options is that we wouldn't support all the config options. We could explicitly document which ones we do support, and when, but still might be a source of confusion.

Either way, I think this is a good question to ask. But I'm not too concerned because there is a smooth path in either direction. If we don't support the pandas option, then no changes are needed. If we do, then we can either update the default for the dask config value to pull in the current pandas option, or we deprecate the dask config value altogether.

…nullable-dtypes

jrbourbeau

There's a pandas community meeting happening tomorrow where we'll discuss the name of the config option used to specify the dtype backend (currently called nullable_backend). Barring any further comments, I'll plan on updating this PR to match whatever comes out of that community meeting (i.e. keep nullable_backend or change the name to whatever is decided on) and then merge this PR in.

jrbourbeau · 2022-12-13T18:35:17Z

dask/tests/test_spark_compat.py

+
+
+@pytest.mark.skipif(not PANDAS_GT_150, reason="Requires pyarrow-backed nullable dtypes")
+def test_read_decimal_dtype_pyarrow(spark_session, tmpdir):


One additional benefit of adding support for pyarrow dtypes is that we actually end up getting better Spark interoperability. For example, I ran into a user group offline who were using Spark with decimal type data. When they tried to read in the corresponding Spark-written Parquet dataset, Dask would end up converting them to object. With this PR we can now use dask.config.set({"dataframe.nullable_backend": "pyarrow"}) to read that data in backed by pyarrow's decimal128 type.

Anyways, that's the context around this test

…nullable-dtypes

mroeschke · 2022-12-16T00:15:21Z

I am initially proposing renaming nullable_backend in pandas to dtype_backend in pandas-dev/pandas#50291

jrbourbeau · 2022-12-16T19:08:49Z

Alright, it looks like dtype_backend is the front-runner for the new config option name in pandas-dev/pandas#50291. I'm going to update this PR accordingly. I'd like to get this PR included in the release today as there are users I'd like to try this feature out who need it to be in a release. Luckily we have a nice setup for deprecating config options (just need to add an entry here) so if ultimately a different name is decided on upstream in pandas, we can just deprecate the option here to match whatever is decided on upstream. FWIW I mentioned this offline to @rjzamora yesterday and he was okay with it

mrocklin · 2022-12-16T20:41:28Z

Woo!

…

On Fri, Dec 16, 2022 at 2:39 PM James Bourbeau ***@***.***> wrote: Merged #9719 <#9719> into main. — Reply to this email directly, view it on GitHub <#9719 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AACKZTHLIIGLFYPAHA6Y2J3WNTHQDANCNFSM6AAAAAASUZ2SQM> . You are receiving this because you are subscribed to this thread.Message ID: ***@***.***>

jrbourbeau added 2 commits December 5, 2022 16:10

Support use_nullable_dtypes="pandas|pyarrow"

9b69961

Cleanup

f79594e

github-actions bot added dataframe io labels Dec 5, 2022

jrbourbeau added 3 commits December 5, 2022 20:31

Test fixup

472fbd4

Skip test is pyarrow dtypes not available

fb3a25f

Docstring

2ae283b

jrbourbeau commented Dec 6, 2022

View reviewed changes

mroeschke reviewed Dec 6, 2022

View reviewed changes

jrbourbeau mentioned this pull request Dec 6, 2022

Read Parquet directly into string[pyarrow] #9631

Closed

Use config option

bf30884

jrbourbeau commented Dec 9, 2022

View reviewed changes

rjzamora reviewed Dec 9, 2022

View reviewed changes

jrbourbeau changed the title ~~Support use_nullable_dtypes="pandas|pyarrow"~~ Support nullable_backend="pandas|pyarrow" configuration Dec 9, 2022

jrbourbeau added 3 commits December 12, 2022 16:26

Add test that demonstrates increased spark interoperability

911f36b

Merge branch 'main' of https://github.com/dask/dask into pyarrow-use-…

ed95fd5

…nullable-dtypes

Skip test when arrow dtypes are available

0498e5c

jrbourbeau commented Dec 13, 2022

View reviewed changes

This was referenced Dec 14, 2022

Release 2022.12.1 dask/community#297

Closed

[DNM] Try pyarrow-backed dtypes coiled/benchmarks#616

Open

Merge branch 'main' of https://github.com/dask/dask into pyarrow-use-…

e822b30

…nullable-dtypes

Rename to dtype_backend

dd80bb8

jrbourbeau changed the title ~~Support nullable_backend="pandas|pyarrow" configuration~~ Support dtype_backend="pandas|pyarrow" configuration Dec 16, 2022

jrbourbeau merged commit 1ac0b11 into dask:main Dec 16, 2022

jrbourbeau deleted the pyarrow-use-nullable-dtypes branch December 16, 2022 20:39

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support `dtype_backend="pandas|pyarrow"` configuration #9719

Support `dtype_backend="pandas|pyarrow"` configuration #9719

jrbourbeau commented Dec 5, 2022 •

edited

jrbourbeau Dec 6, 2022

mroeschke Dec 6, 2022

mroeschke Dec 6, 2022

jrbourbeau Dec 6, 2022

jrbourbeau left a comment

rjzamora Dec 9, 2022 •

edited

jrbourbeau Dec 9, 2022

jrbourbeau left a comment

jrbourbeau Dec 13, 2022

mroeschke commented Dec 16, 2022

jrbourbeau commented Dec 16, 2022 •

edited

mrocklin commented Dec 16, 2022 via email



		@pytest.mark.skipif(not PANDAS_GT_150, reason="Requires pyarrow-backed nullable dtypes")
		def test_read_decimal_dtype_pyarrow(spark_session, tmpdir):

Support dtype_backend="pandas|pyarrow" configuration #9719

Support dtype_backend="pandas|pyarrow" configuration #9719

Conversation

jrbourbeau commented Dec 5, 2022 • edited

jrbourbeau Dec 6, 2022

Choose a reason for hiding this comment

mroeschke Dec 6, 2022

Choose a reason for hiding this comment

mroeschke Dec 6, 2022

Choose a reason for hiding this comment

jrbourbeau Dec 6, 2022

Choose a reason for hiding this comment

jrbourbeau left a comment

Choose a reason for hiding this comment

rjzamora Dec 9, 2022 • edited

Choose a reason for hiding this comment

jrbourbeau Dec 9, 2022

Choose a reason for hiding this comment

jrbourbeau left a comment

Choose a reason for hiding this comment

jrbourbeau Dec 13, 2022

Choose a reason for hiding this comment

mroeschke commented Dec 16, 2022

jrbourbeau commented Dec 16, 2022 • edited

mrocklin commented Dec 16, 2022 via email

Support `dtype_backend="pandas|pyarrow"` configuration #9719

Support `dtype_backend="pandas|pyarrow"` configuration #9719

jrbourbeau commented Dec 5, 2022 •

edited

rjzamora Dec 9, 2022 •

edited

jrbourbeau commented Dec 16, 2022 •

edited