Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add dataframe.nullable_dtypes configuration option #9874

Open
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

jrbourbeau
Copy link
Member

This PR adds a new dataframe.nullable_dtype configuration option similar to what pandas is doing in pandas-dev/pandas#50748. This should both make alignment between pandas and dask.dataframe better and make it easier for folks to opt into using pyarrow-backed dtypes.

cc @phofl for visibility

Comment on lines +101 to +106
class _NoDefault(Enum):
no_default = ...


no_default: Final = _NoDefault.no_default
NoDefault = Literal[_NoDefault.no_default]
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This change was needed to make mypy happy with the new default value of use_nullable_dtypes= in read_parquet. cc @crusaderky in case you think there's a better way to do this

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is indeed the canonical way of doing it (and yes it's ugly)

@jrbourbeau
Copy link
Member Author

Revisiting the example in #9631, we can now use just configuration options to get all pyarrow-backed dtypes:

In [1]: import dask

In [2]: import dask.dataframe as dd

In [3]: dask.config.set({"dataframe.dtype_backend": "pyarrow", "dataframe.nullable_dtypes": True})
Out[3]: <dask.config.set at 0x102dadb70>

In [4]: df = dd.read_parquet("s3://nyc-tlc/trip data/fhvhv_tripdata_2022-06.parquet", split_row_groups=True)

In [5]: df.dtypes
Out[5]:
hvfhs_license_num              string[pyarrow]
dispatching_base_num           string[pyarrow]
originating_base_num           string[pyarrow]
request_datetime        timestamp[us][pyarrow]
on_scene_datetime       timestamp[us][pyarrow]
pickup_datetime         timestamp[us][pyarrow]
dropoff_datetime        timestamp[us][pyarrow]
PULocationID                    int64[pyarrow]
DOLocationID                    int64[pyarrow]
trip_miles                     double[pyarrow]
trip_time                       int64[pyarrow]
base_passenger_fare            double[pyarrow]
tolls                          double[pyarrow]
bcf                            double[pyarrow]
sales_tax                      double[pyarrow]
congestion_surcharge           double[pyarrow]
airport_fee                    double[pyarrow]
tips                           double[pyarrow]
driver_pay                     double[pyarrow]
shared_request_flag            string[pyarrow]
shared_match_flag              string[pyarrow]
access_a_ride_flag             string[pyarrow]
wav_request_flag               string[pyarrow]
wav_match_flag                 string[pyarrow]
dtype: object

@phofl
Copy link
Collaborator

phofl commented Jan 26, 2023

Revisiting the example in #9631, we can now use just configuration options to get all pyarrow-backed dtypes:

Awesome! Should also make switching the default easier at some point in the future.

@jrbourbeau jrbourbeau mentioned this pull request Jan 26, 2023
4 tasks
@github-actions github-actions bot added the needs attention It's been a while since this was pushed on. Needs attention from the owner or a maintainer. label Mar 25, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
dataframe io needs attention It's been a while since this was pushed on. Needs attention from the owner or a maintainer.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants