Make pyarrow
strings easy to use
#9946
Labels
dataframe
feature
Something is missing
io
needs attention
It's been a while since this was pushed on. Needs attention from the owner or a maintainer.
This is similar to #9879, but smaller in scope.
Motivation
We've seen several cases where using
pyarrow
strings for text data have significant memory usage / computation performance improvements (xref #9631, dask/community#301). We should make it easy for users to use utilize this performant data type.Proposal
I'll propose we add a config option users can set to automatically convert
object
andstring[python]
data that's encountered tostring[pyarrow]
. We'll want this to work with all methods for creating daskDataFrame
s. That is, things like the followingshould all return dask
DataFrame
s that usestring[pyarrow]
appropriately.For some methods, like
read_parquet
, we'll want to have a specialized implementation as they'll be able to efficiently read data directly intostring[pyarrow]
. However, in cases where a specialized method isn't implemented, we should still automatically cast the daskDataFrame
to usestring[pyarrow]
when the config option is set. For example, through anmap_partitions
call after our existingDataFrame
creation logic.Steps
Steps that I think make sense here are:
string[pyarrow]
dtype where appropriate (see Add option for converting string data to usepyarrow
strings #9926)dd.read_parquet
(see Efficientdataframe.convert_string
support forread_parquet
#9979)pyarrow
strings turned on #10017)string[pyarrow]
(e.g. emit a performance warning when using text data withoutstring[pyarrow]
, turn the config option on by default, etc).Notes
See #9926 where I'm taking an initial pass at adding the config option.
cc @rjzamora @quasiben @j-bennet @phofl for visibility
The text was updated successfully, but these errors were encountered: