-
-
Notifications
You must be signed in to change notification settings - Fork 1.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add dataframe.nullable_dtypes
configuration option
#9874
base: main
Are you sure you want to change the base?
Conversation
class _NoDefault(Enum): | ||
no_default = ... | ||
|
||
|
||
no_default: Final = _NoDefault.no_default | ||
NoDefault = Literal[_NoDefault.no_default] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This change was needed to make mypy
happy with the new default value of use_nullable_dtypes=
in read_parquet
. cc @crusaderky in case you think there's a better way to do this
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is indeed the canonical way of doing it (and yes it's ugly)
Revisiting the example in #9631, we can now use just configuration options to get all In [1]: import dask
In [2]: import dask.dataframe as dd
In [3]: dask.config.set({"dataframe.dtype_backend": "pyarrow", "dataframe.nullable_dtypes": True})
Out[3]: <dask.config.set at 0x102dadb70>
In [4]: df = dd.read_parquet("s3://nyc-tlc/trip data/fhvhv_tripdata_2022-06.parquet", split_row_groups=True)
In [5]: df.dtypes
Out[5]:
hvfhs_license_num string[pyarrow]
dispatching_base_num string[pyarrow]
originating_base_num string[pyarrow]
request_datetime timestamp[us][pyarrow]
on_scene_datetime timestamp[us][pyarrow]
pickup_datetime timestamp[us][pyarrow]
dropoff_datetime timestamp[us][pyarrow]
PULocationID int64[pyarrow]
DOLocationID int64[pyarrow]
trip_miles double[pyarrow]
trip_time int64[pyarrow]
base_passenger_fare double[pyarrow]
tolls double[pyarrow]
bcf double[pyarrow]
sales_tax double[pyarrow]
congestion_surcharge double[pyarrow]
airport_fee double[pyarrow]
tips double[pyarrow]
driver_pay double[pyarrow]
shared_request_flag string[pyarrow]
shared_match_flag string[pyarrow]
access_a_ride_flag string[pyarrow]
wav_request_flag string[pyarrow]
wav_match_flag string[pyarrow]
dtype: object |
Awesome! Should also make switching the default easier at some point in the future. |
This PR adds a new
dataframe.nullable_dtype
configuration option similar to whatpandas
is doing in pandas-dev/pandas#50748. This should both make alignment betweenpandas
anddask.dataframe
better and make it easier for folks to opt into usingpyarrow
-backed dtypes.cc @phofl for visibility