New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Make filesystem-backend configurable in read_parquet
#9699
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @rjzamora! Grokking the changes here still, but thought I'd leave a few initial comments
Just a note that dask_cudf will need to add an |
UPDATE: This PR now makes it easy to opt in to using a pyarrow-based filesystem (with |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @rjzamora. I tried the example from #9631 (comment) with this PR and filesystem="pyarrow"
but it looks like we're having trouble with the filepath parsing when creating the arrow filesystem.
import dask.dataframe as dd
df = dd.read_parquet(
"s3://nyc-tlc/trip data/fhvhv_tripdata_2022-06.parquet",
split_row_groups=True,
use_nullable_dtypes=True,
filesystem="pyarrow",
)
gives
Traceback (most recent call last):
File "/Users/james/projects/dask/dask/dask/backends.py", line 125, in wrapper
return func(*args, **kwargs)
File "/Users/james/projects/dask/dask/dask/dataframe/io/parquet/core.py", line 494, in read_parquet
fs, paths, dataset_options, open_file_options = engine.extract_filesystem(
File "/Users/james/projects/dask/dask/dask/dataframe/io/parquet/arrow.py", line 369, in extract_filesystem
fs = type(pa_fs.FileSystem.from_uri(urlpath[0])[0])(
File "pyarrow/_fs.pyx", line 470, in pyarrow._fs.FileSystem.from_uri
File "pyarrow/error.pxi", line 144, in pyarrow.lib.pyarrow_internal_check_status
File "pyarrow/error.pxi", line 100, in pyarrow.lib.check_status
pyarrow.lib.ArrowInvalid: Cannot parse URI: 's3://nyc-tlc/trip data/fhvhv_tripdata_2022-06.parquet'
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/Users/james/projects/dask/dask/test.py", line 3, in <module>
df = dd.read_parquet(
File "/Users/james/projects/dask/dask/dask/backends.py", line 127, in wrapper
raise type(e)(
pyarrow.lib.ArrowInvalid: An error occurred while calling the read_parquet method registered to the pandas backend.
Original Message: Cannot parse URI: 's3://nyc-tlc/trip data/fhvhv_tripdata_2022-06.parquet'
Any idea what might be going on here?
Should note that filesystem="fsspec"
(the default) works as expected
Looks like pyarrow is having trouble with the space in the path name. This works fine: pa_fs.FileSystem.from_uri("s3://ursa-labs-taxi-data/2009/01/data.parquet") but this does not: pa_fs.FileSystem.from_uri("s3//nyc-tlc/trip data/fhvhv_tripdata_2022-06.parquet") |
Okay, looked into it a bit more and it looks like this is a parsing issue in EDIT: Jinx |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for all your work here @rjzamora! Overall this looks good to me. I've left several small comments -- looking forward to getting this merged
dask/dataframe/io/parquet/arrow.py
Outdated
if urlpath[0].startswith("C:") and isinstance( | ||
fs, pa_fs.LocalFileSystem | ||
): | ||
# ArrowFSWrapper._strip_protocol not reliable on windows | ||
from fsspec.implementations.local import LocalFileSystem | ||
|
||
fs_strip = LocalFileSystem() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there an upstream issue for this one?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I submitted this issue and linked in a comment: fsspec/filesystem_spec#1137
read_parquet
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @rjzamora! Looking forward to folks taking this for a spin
Basic implementation of the plan discussed in this comment.
This PR adds better documentation for
read_parquet(..., dataset=)
, and allows the user to configure the filesystem-backend using adataset
option (as one would do withpyarrow.dataset.Dataset
orfastparquet.ParquetFile
).In order to address #9619, the default "filesystem" argument is set to "arrow" for s3 storage (and "fsspec" otherwise).pre-commit run --all-files