New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support dtype_backend="pandas|pyarrow"
configuration
#9719
Changes from 9 commits
9b69961
f79594e
472fbd4
fb3a25f
2ae283b
bf30884
911f36b
ed95fd5
0498e5c
e822b30
dd80bb8
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,19 +1,22 @@ | ||
import decimal | ||
import signal | ||
import sys | ||
import threading | ||
|
||
import pytest | ||
|
||
import dask | ||
from dask.datasets import timeseries | ||
|
||
dd = pytest.importorskip("dask.dataframe") | ||
pyspark = pytest.importorskip("pyspark") | ||
pytest.importorskip("pyarrow") | ||
pa = pytest.importorskip("pyarrow") | ||
pytest.importorskip("fastparquet") | ||
|
||
import numpy as np | ||
import pandas as pd | ||
|
||
from dask.dataframe._compat import PANDAS_GT_150 | ||
from dask.dataframe.utils import assert_eq | ||
|
||
pytestmark = pytest.mark.skipif( | ||
|
@@ -149,3 +152,43 @@ def test_roundtrip_parquet_spark_to_dask_extension_dtypes(spark_session, tmpdir) | |
[pd.api.types.is_extension_array_dtype(dtype) for dtype in ddf.dtypes] | ||
), ddf.dtypes | ||
assert_eq(ddf, pdf, check_index=False) | ||
|
||
|
||
@pytest.mark.skipif(not PANDAS_GT_150, reason="Requires pyarrow-backed nullable dtypes") | ||
def test_read_decimal_dtype_pyarrow(spark_session, tmpdir): | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. One additional benefit of adding support for pyarrow dtypes is that we actually end up getting better Spark interoperability. For example, I ran into a user group offline who were using Spark with decimal type data. When they tried to read in the corresponding Spark-written Parquet dataset, Dask would end up converting them to Anyways, that's the context around this test |
||
tmpdir = str(tmpdir) | ||
npartitions = 3 | ||
size = 6 | ||
|
||
decimal_data = [ | ||
decimal.Decimal("8093.234"), | ||
decimal.Decimal("8094.234"), | ||
decimal.Decimal("8095.234"), | ||
decimal.Decimal("8096.234"), | ||
decimal.Decimal("8097.234"), | ||
decimal.Decimal("8098.234"), | ||
] | ||
pdf = pd.DataFrame( | ||
{ | ||
"a": range(size), | ||
"b": decimal_data, | ||
} | ||
) | ||
sdf = spark_session.createDataFrame(pdf) | ||
sdf = sdf.withColumn("b", sdf["b"].cast(pyspark.sql.types.DecimalType(7, 3))) | ||
# We are not overwriting any data, but spark complains if the directory | ||
# already exists (as tmpdir does) and we don't set overwrite | ||
sdf.repartition(npartitions).write.parquet(tmpdir, mode="overwrite") | ||
|
||
with dask.config.set({"dataframe.nullable_backend": "pyarrow"}): | ||
ddf = dd.read_parquet(tmpdir, engine="pyarrow", use_nullable_dtypes=True) | ||
assert ddf.b.dtype.pyarrow_dtype == pa.decimal128(7, 3) | ||
assert ddf.b.compute().dtype.pyarrow_dtype == pa.decimal128(7, 3) | ||
expected = pdf.astype( | ||
{ | ||
"a": "int64[pyarrow]", | ||
"b": pd.ArrowDtype(pa.decimal128(7, 3)), | ||
} | ||
) | ||
|
||
assert_eq(ddf, expected, check_index=False) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How do you expect this option (and its default) to interact with the corresponding pandas 2.0 config option? When pandas-2 is released, should the default just correspond to whatever the pandas default is?
For example, it would be nice if we were able to use a test like this for pandas-2:
Does client vs worker config options make this a challenge?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's something we'll definitely need to account for. My guess is the most pleasant user experience will be if we pull the corresponding config value on the client and then embed it into the task graph (like we're doing in this PR). That way users won't need to worry about setting config options on the workers. Regardless, I suspect the implementation will be the same whether we pull the
pandas
ordask
config option (see #9711 for an example).The downside to supporting
pandas
config options is that we wouldn't support all the config options. We could explicitly document which ones we do support, and when, but still might be a source of confusion.Either way, I think this is a good question to ask. But I'm not too concerned because there is a smooth path in either direction. If we don't support the
pandas
option, then no changes are needed. If we do, then we can either update the default for thedask
config value to pull in the currentpandas
option, or we deprecate thedask
config value altogether.