Make `pyarrow` strings easy to use #9946

jrbourbeau · 2023-02-13T16:50:44Z

This is similar to #9879, but smaller in scope.

Motivation

We've seen several cases where using pyarrow strings for text data have significant memory usage / computation performance improvements (xref #9631, dask/community#301). We should make it easy for users to use utilize this performant data type.

Proposal

I'll propose we add a config option users can set to automatically convert object and string[python] data that's encountered to string[pyarrow]. We'll want this to work with all methods for creating dask DataFrames. That is, things like the following

import dask
import dask.dataframe as dd

# Tell dask to use `pyarrow`-strings for object dtypes
dask.config.set({"dataframe.object_as_pyarrow_string": True})  # Suggestions for a better name are welcome! 

df = dd.read_parquet(...)
df = dd.read_csv(...)
df = dd.from_pandas(...)
df = dd.from_delayed(...)
...

should all return dask DataFrames that use string[pyarrow] appropriately.

For some methods, like read_parquet, we'll want to have a specialized implementation as they'll be able to efficiently read data directly into string[pyarrow]. However, in cases where a specialized method isn't implemented, we should still automatically cast the dask DataFrame to use string[pyarrow] when the config option is set. For example, through an map_partitions call after our existing DataFrame creation logic.

Steps

Steps that I think make sense here are:

Add a config option that automatically converts to string[pyarrow] dtype where appropriate (see Add option for converting string data to use pyarrow strings #9926)
Add specialized implementation for dd.read_parquet (see Efficient dataframe.convert_string support for read_parquet #9979)
Add a CI job with the new config option turned on (see Add CI job with pyarrow strings turned on #10017)
Fix all test failures in the new CI job
Add documentation
(Optional) Make it easy to always use performant string[pyarrow] (e.g. emit a performance warning when using text data without string[pyarrow], turn the config option on by default, etc).

Notes

See #9926 where I'm taking an initial pass at adding the config option.

cc @rjzamora @quasiben @j-bennet @phofl for visibility

The text was updated successfully, but these errors were encountered:

martindurant · 2023-02-16T14:59:35Z

Some thoughts on this:

do we want dask[dataframe] (or just dask in conda) to depend on pyarrow? If not, having string[pyarrow] as default would require annoying code to work around the possibility of it not being installed
I recommend against changing dtypes that were in a dataframe supplied by the user, e.g., from_pandas, from_delayed. Maybe they have a good reason for their choice and would end up transforming back
we should carefully check the API coverage of string[pyarrow]; my impression is that most things are vectorized, but are there some that still coerce back to python strings for operations? What about things that don't map to simple types, e.g., split()?
fastparquet does not produce string[pyarrow] since one of its main selling points is the smaller install requirements

j-bennet · 2023-03-22T17:47:53Z

Follow-up issue to fix CI with arrow strings: #10029.

jrbourbeau added dataframe io feature Something is missing labels Feb 13, 2023

jrbourbeau mentioned this issue Feb 13, 2023

Add option for converting string data to use pyarrow strings #9926

Merged

jrbourbeau mentioned this issue Feb 17, 2023

Make dataframe.convert_string efficient for dd.read_parquet #9978

Closed

j-bennet mentioned this issue Feb 17, 2023

Efficient dataframe.convert_string support for read_parquet #9979

Merged

3 tasks

j-bennet mentioned this issue Feb 24, 2023

Improved support for pyarrow strings #10000

Merged

2 tasks

jrbourbeau mentioned this issue Mar 3, 2023

Add CI job with pyarrow strings turned on #10017

Merged

j-bennet mentioned this issue Mar 7, 2023

Raise an error with dataframe.convert_string=True and pandas<2.0 #10033

Merged

2 tasks

github-actions bot added the needs attention It's been a while since this was pushed on. Needs attention from the owner or a maintainer. label Mar 11, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Make `pyarrow` strings easy to use #9946

Make `pyarrow` strings easy to use #9946

jrbourbeau commented Feb 13, 2023 •

edited

martindurant commented Feb 16, 2023

j-bennet commented Mar 22, 2023

Make pyarrow strings easy to use #9946

Make pyarrow strings easy to use #9946

Comments

jrbourbeau commented Feb 13, 2023 • edited

Motivation

Proposal

Steps

Notes

martindurant commented Feb 16, 2023

j-bennet commented Mar 22, 2023

Make `pyarrow` strings easy to use #9946

Make `pyarrow` strings easy to use #9946

jrbourbeau commented Feb 13, 2023 •

edited