Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make pyarrow strings easy to use #9946

Open
3 of 6 tasks
jrbourbeau opened this issue Feb 13, 2023 · 2 comments
Open
3 of 6 tasks

Make pyarrow strings easy to use #9946

jrbourbeau opened this issue Feb 13, 2023 · 2 comments
Labels
dataframe feature Something is missing io needs attention It's been a while since this was pushed on. Needs attention from the owner or a maintainer.

Comments

@jrbourbeau
Copy link
Member

jrbourbeau commented Feb 13, 2023

This is similar to #9879, but smaller in scope.

Motivation

We've seen several cases where using pyarrow strings for text data have significant memory usage / computation performance improvements (xref #9631, dask/community#301). We should make it easy for users to use utilize this performant data type.

Proposal

I'll propose we add a config option users can set to automatically convert object and string[python] data that's encountered to string[pyarrow]. We'll want this to work with all methods for creating dask DataFrames. That is, things like the following

import dask
import dask.dataframe as dd

# Tell dask to use `pyarrow`-strings for object dtypes
dask.config.set({"dataframe.object_as_pyarrow_string": True})  # Suggestions for a better name are welcome! 

df = dd.read_parquet(...)
df = dd.read_csv(...)
df = dd.from_pandas(...)
df = dd.from_delayed(...)
...

should all return dask DataFrames that use string[pyarrow] appropriately.

For some methods, like read_parquet, we'll want to have a specialized implementation as they'll be able to efficiently read data directly into string[pyarrow]. However, in cases where a specialized method isn't implemented, we should still automatically cast the dask DataFrame to use string[pyarrow] when the config option is set. For example, through an map_partitions call after our existing DataFrame creation logic.

Steps

Steps that I think make sense here are:

Notes

See #9926 where I'm taking an initial pass at adding the config option.

cc @rjzamora @quasiben @j-bennet @phofl for visibility

@martindurant
Copy link
Member

Some thoughts on this:

  • do we want dask[dataframe] (or just dask in conda) to depend on pyarrow? If not, having string[pyarrow] as default would require annoying code to work around the possibility of it not being installed
  • I recommend against changing dtypes that were in a dataframe supplied by the user, e.g., from_pandas, from_delayed. Maybe they have a good reason for their choice and would end up transforming back
  • we should carefully check the API coverage of string[pyarrow]; my impression is that most things are vectorized, but are there some that still coerce back to python strings for operations? What about things that don't map to simple types, e.g., split()?
  • fastparquet does not produce string[pyarrow] since one of its main selling points is the smaller install requirements

@j-bennet
Copy link
Contributor

Follow-up issue to fix CI with arrow strings: #10029.

@github-actions github-actions bot added the needs attention It's been a while since this was pushed on. Needs attention from the owner or a maintainer. label Mar 11, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
dataframe feature Something is missing io needs attention It's been a while since this was pushed on. Needs attention from the owner or a maintainer.
Projects
None yet
Development

No branches or pull requests

3 participants