Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make filesystem-backend configurable in read_parquet #9699

Merged
merged 10 commits into from Dec 15, 2022

Conversation

rjzamora
Copy link
Member

@rjzamora rjzamora commented Nov 28, 2022

Basic implementation of the plan discussed in this comment.

This PR adds better documentation for read_parquet(..., dataset=), and allows the user to configure the filesystem-backend using a dataset option (as one would do with pyarrow.dataset.Dataset or fastparquet.ParquetFile). In order to address #9619, the default "filesystem" argument is set to "arrow" for s3 storage (and "fsspec" otherwise).

  • Closes #xxxx
  • Tests added / passed
  • Passes pre-commit run --all-files

@rjzamora rjzamora marked this pull request as ready for review December 5, 2022 23:22
@rjzamora rjzamora changed the title [WIP] Expose filesystem option in read_parquet Make filesystem-backend configurable in read_parquet Dec 5, 2022
Copy link
Member

@jrbourbeau jrbourbeau left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @rjzamora! Grokking the changes here still, but thought I'd leave a few initial comments

dask/dataframe/io/parquet/utils.py Outdated Show resolved Hide resolved
dask/dataframe/io/parquet/utils.py Outdated Show resolved Hide resolved
dask/dataframe/io/parquet/arrow.py Outdated Show resolved Hide resolved
dask/dataframe/io/parquet/arrow.py Outdated Show resolved Hide resolved
@rjzamora
Copy link
Member Author

rjzamora commented Dec 7, 2022

Just a note that dask_cudf will need to add an extract_filesystem method to override the "arrow" default, because cudf's optimized s3fs usage should typically make the "fsspec" route a bit more performant (and my local experiments seem to confirm this).

@rjzamora
Copy link
Member Author

rjzamora commented Dec 14, 2022

UPDATE: This PR now makes it easy to opt in to using a pyarrow-based filesystem (with filesystem="arrow"), but it does not change the default behavior yet. I suggest we start with this partial fix to #9619, and address the question of defaults after we're sure filesystem="arrow" is behaving as expected/desired in the real world.

Copy link
Member

@jrbourbeau jrbourbeau left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @rjzamora. I tried the example from #9631 (comment) with this PR and filesystem="pyarrow" but it looks like we're having trouble with the filepath parsing when creating the arrow filesystem.

import dask.dataframe as dd

df = dd.read_parquet(
    "s3://nyc-tlc/trip data/fhvhv_tripdata_2022-06.parquet",
    split_row_groups=True,
    use_nullable_dtypes=True,
    filesystem="pyarrow",
)

gives

Traceback (most recent call last):
  File "/Users/james/projects/dask/dask/dask/backends.py", line 125, in wrapper
    return func(*args, **kwargs)
  File "/Users/james/projects/dask/dask/dask/dataframe/io/parquet/core.py", line 494, in read_parquet
    fs, paths, dataset_options, open_file_options = engine.extract_filesystem(
  File "/Users/james/projects/dask/dask/dask/dataframe/io/parquet/arrow.py", line 369, in extract_filesystem
    fs = type(pa_fs.FileSystem.from_uri(urlpath[0])[0])(
  File "pyarrow/_fs.pyx", line 470, in pyarrow._fs.FileSystem.from_uri
  File "pyarrow/error.pxi", line 144, in pyarrow.lib.pyarrow_internal_check_status
  File "pyarrow/error.pxi", line 100, in pyarrow.lib.check_status
pyarrow.lib.ArrowInvalid: Cannot parse URI: 's3://nyc-tlc/trip data/fhvhv_tripdata_2022-06.parquet'

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/Users/james/projects/dask/dask/test.py", line 3, in <module>
    df = dd.read_parquet(
  File "/Users/james/projects/dask/dask/dask/backends.py", line 127, in wrapper
    raise type(e)(
pyarrow.lib.ArrowInvalid: An error occurred while calling the read_parquet method registered to the pandas backend.
Original Message: Cannot parse URI: 's3://nyc-tlc/trip data/fhvhv_tripdata_2022-06.parquet'

Any idea what might be going on here?

Should note that filesystem="fsspec" (the default) works as expected

@rjzamora
Copy link
Member Author

Any idea what might be going on here?

Looks like pyarrow is having trouble with the space in the path name. This works fine:

pa_fs.FileSystem.from_uri("s3://ursa-labs-taxi-data/2009/01/data.parquet")

but this does not:

pa_fs.FileSystem.from_uri("s3//nyc-tlc/trip data/fhvhv_tripdata_2022-06.parquet")

@jrbourbeau
Copy link
Member

jrbourbeau commented Dec 14, 2022

Okay, looked into it a bit more and it looks like this is a parsing issue in pyarrow when there's a space in the URI (which seems unusual, but valid). I've opened https://issues.apache.org/jira/browse/ARROW-18436 upstream

EDIT: Jinx

@jrbourbeau jrbourbeau mentioned this pull request Dec 15, 2022
8 tasks
Copy link
Member

@jrbourbeau jrbourbeau left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for all your work here @rjzamora! Overall this looks good to me. I've left several small comments -- looking forward to getting this merged

dask/dataframe/io/parquet/utils.py Outdated Show resolved Hide resolved
dask/dataframe/io/parquet/utils.py Outdated Show resolved Hide resolved
dask/dataframe/io/tests/test_parquet.py Outdated Show resolved Hide resolved
dask/dataframe/io/tests/test_parquet.py Outdated Show resolved Hide resolved
dask/dataframe/io/tests/test_parquet.py Outdated Show resolved Hide resolved
dask/dataframe/io/tests/test_parquet.py Outdated Show resolved Hide resolved
dask/dataframe/io/parquet/utils.py Show resolved Hide resolved
dask/dataframe/io/parquet/arrow.py Outdated Show resolved Hide resolved
Comment on lines 375 to 381
if urlpath[0].startswith("C:") and isinstance(
fs, pa_fs.LocalFileSystem
):
# ArrowFSWrapper._strip_protocol not reliable on windows
from fsspec.implementations.local import LocalFileSystem

fs_strip = LocalFileSystem()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there an upstream issue for this one?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I submitted this issue and linked in a comment: fsspec/filesystem_spec#1137

@jrbourbeau jrbourbeau changed the title Make filesystem-backend configurable in read_parquet Make filesystem-backend configurable in read_parquet Dec 15, 2022
Copy link
Member

@jrbourbeau jrbourbeau left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @rjzamora! Looking forward to folks taking this for a spin

@jrbourbeau jrbourbeau merged commit d943293 into dask:main Dec 15, 2022
@rjzamora rjzamora deleted the filesystem-option branch December 15, 2022 20:16
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants