Make filesystem-backend configurable in `read_parquet` #9699

rjzamora · 2022-11-28T20:12:21Z

Basic implementation of the plan discussed in this comment.

This PR adds better documentation for read_parquet(..., dataset=), and allows the user to configure the filesystem-backend using a dataset option (as one would do with pyarrow.dataset.Dataset or fastparquet.ParquetFile). ~~In order to address #9619, the default "filesystem" argument is set to "arrow" for s3 storage (and "fsspec" otherwise).~~

Closes #xxxx
Tests added / passed
Passes pre-commit run --all-files

dask/dataframe/io/parquet/arrow.py

jrbourbeau

Thanks @rjzamora! Grokking the changes here still, but thought I'd leave a few initial comments

dask/dataframe/io/parquet/utils.py

dask/dataframe/io/parquet/arrow.py

rjzamora · 2022-12-07T17:14:21Z

Just a note that dask_cudf will need to add an extract_filesystem method to override the "arrow" default, because cudf's optimized s3fs usage should typically make the "fsspec" route a bit more performant (and my local experiments seem to confirm this).

rjzamora · 2022-12-14T17:00:31Z

UPDATE: This PR now makes it easy to opt in to using a pyarrow-based filesystem (with filesystem="arrow"), but it does not change the default behavior yet. I suggest we start with this partial fix to #9619, and address the question of defaults after we're sure filesystem="arrow" is behaving as expected/desired in the real world.

jrbourbeau

Thanks @rjzamora. I tried the example from #9631 (comment) with this PR and filesystem="pyarrow" but it looks like we're having trouble with the filepath parsing when creating the arrow filesystem.

import dask.dataframe as dd

df = dd.read_parquet(
    "s3://nyc-tlc/trip data/fhvhv_tripdata_2022-06.parquet",
    split_row_groups=True,
    use_nullable_dtypes=True,
    filesystem="pyarrow",
)

gives

Traceback (most recent call last):
  File "/Users/james/projects/dask/dask/dask/backends.py", line 125, in wrapper
    return func(*args, **kwargs)
  File "/Users/james/projects/dask/dask/dask/dataframe/io/parquet/core.py", line 494, in read_parquet
    fs, paths, dataset_options, open_file_options = engine.extract_filesystem(
  File "/Users/james/projects/dask/dask/dask/dataframe/io/parquet/arrow.py", line 369, in extract_filesystem
    fs = type(pa_fs.FileSystem.from_uri(urlpath[0])[0])(
  File "pyarrow/_fs.pyx", line 470, in pyarrow._fs.FileSystem.from_uri
  File "pyarrow/error.pxi", line 144, in pyarrow.lib.pyarrow_internal_check_status
  File "pyarrow/error.pxi", line 100, in pyarrow.lib.check_status
pyarrow.lib.ArrowInvalid: Cannot parse URI: 's3://nyc-tlc/trip data/fhvhv_tripdata_2022-06.parquet'

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/Users/james/projects/dask/dask/test.py", line 3, in <module>
    df = dd.read_parquet(
  File "/Users/james/projects/dask/dask/dask/backends.py", line 127, in wrapper
    raise type(e)(
pyarrow.lib.ArrowInvalid: An error occurred while calling the read_parquet method registered to the pandas backend.
Original Message: Cannot parse URI: 's3://nyc-tlc/trip data/fhvhv_tripdata_2022-06.parquet'

Any idea what might be going on here?

Should note that filesystem="fsspec" (the default) works as expected

rjzamora · 2022-12-14T21:45:48Z

Any idea what might be going on here?

Looks like pyarrow is having trouble with the space in the path name. This works fine:

pa_fs.FileSystem.from_uri("s3://ursa-labs-taxi-data/2009/01/data.parquet")

but this does not:

pa_fs.FileSystem.from_uri("s3//nyc-tlc/trip data/fhvhv_tripdata_2022-06.parquet")

jrbourbeau · 2022-12-14T21:46:48Z

Okay, looked into it a bit more and it looks like this is a parsing issue in pyarrow when there's a space in the URI (which seems unusual, but valid). I've opened https://issues.apache.org/jira/browse/ARROW-18436 upstream

EDIT: Jinx

jrbourbeau

Thanks for all your work here @rjzamora! Overall this looks good to me. I've left several small comments -- looking forward to getting this merged

dask/dataframe/io/parquet/utils.py

dask/dataframe/io/tests/test_parquet.py

dask/dataframe/io/parquet/utils.py

dask/dataframe/io/parquet/arrow.py

jrbourbeau · 2022-12-15T18:23:39Z

dask/dataframe/io/parquet/arrow.py

+                if urlpath[0].startswith("C:") and isinstance(
+                    fs, pa_fs.LocalFileSystem
+                ):
+                    # ArrowFSWrapper._strip_protocol not reliable on windows
+                    from fsspec.implementations.local import LocalFileSystem
+
+                    fs_strip = LocalFileSystem()


Is there an upstream issue for this one?

I submitted this issue and linked in a comment: fsspec/filesystem_spec#1137

jrbourbeau

Thanks @rjzamora! Looking forward to folks taking this for a spin

expose filesystem option

9c07780

rjzamora added dataframe io parquet labels Nov 28, 2022

rjzamora mentioned this pull request Nov 28, 2022

Use pyarrow S3 file system at read time for arrow parquet engine #9669

Closed

3 tasks

rjzamora added 2 commits December 5, 2022 12:12

move argument under dataset and add exception handling

61d0aab

docstring update

6733668

rjzamora marked this pull request as ready for review December 5, 2022 23:22

rjzamora changed the title ~~[WIP] Expose filesystem option in read_parquet~~ Make filesystem-backend configurable in read_parquet Dec 5, 2022

remove problematic import

ce32dca

rjzamora commented Dec 6, 2022

View reviewed changes

dask/dataframe/io/parquet/arrow.py Outdated Show resolved Hide resolved

jrbourbeau reviewed Dec 6, 2022

View reviewed changes

dask/dataframe/io/parquet/utils.py Outdated Show resolved Hide resolved

dask/dataframe/io/parquet/utils.py Outdated Show resolved Hide resolved

dask/dataframe/io/parquet/arrow.py Outdated Show resolved Hide resolved

dask/dataframe/io/parquet/arrow.py Outdated Show resolved Hide resolved

address code review

bb3a9c7

rjzamora added 2 commits December 14, 2022 08:50

roll back default filesystem change to arrow for now

e8fd1ae

Merge remote-tracking branch 'upstream/main' into filesystem-option

995ff05

address windows issue

45b5493

jrbourbeau reviewed Dec 14, 2022

View reviewed changes

jrbourbeau mentioned this pull request Dec 15, 2022

Release 2022.12.1 dask/community#297

Closed

8 tasks

jrbourbeau reviewed Dec 15, 2022

View reviewed changes

address code review

1304900

rjzamora mentioned this pull request Dec 15, 2022

ArrowFSWrapper._strip_protocol differs from pure fsspec implementation fsspec/filesystem_spec#1137

Open

add comment on windows issue

0b2d037

jrbourbeau changed the title ~~Make filesystem-backend configurable in read_parquet~~ Make filesystem-backend configurable in read_parquet Dec 15, 2022

jrbourbeau approved these changes Dec 15, 2022

View reviewed changes

jrbourbeau merged commit d943293 into dask:main Dec 15, 2022

rjzamora deleted the filesystem-option branch December 15, 2022 20:16

jorisvandenbossche mentioned this pull request Mar 20, 2023

GeoArrowEngine error when reading Parquet files geopandas/dask-geopandas#241

Open

jrbourbeau mentioned this pull request Apr 17, 2023

read parquet from s3 failing with 'GeoArrowEngine' has no attribute 'extract_filesystem' geopandas/dask-geopandas#250

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Make filesystem-backend configurable in `read_parquet` #9699

Make filesystem-backend configurable in `read_parquet` #9699

rjzamora commented Nov 28, 2022 •

edited

jrbourbeau left a comment

rjzamora commented Dec 7, 2022

rjzamora commented Dec 14, 2022 •

edited

jrbourbeau left a comment

rjzamora commented Dec 14, 2022

jrbourbeau commented Dec 14, 2022 •

edited

jrbourbeau left a comment

jrbourbeau Dec 15, 2022

rjzamora Dec 15, 2022

jrbourbeau left a comment

Make filesystem-backend configurable in read_parquet #9699

Make filesystem-backend configurable in read_parquet #9699

Conversation

rjzamora commented Nov 28, 2022 • edited

jrbourbeau left a comment

Choose a reason for hiding this comment

rjzamora commented Dec 7, 2022

rjzamora commented Dec 14, 2022 • edited

jrbourbeau left a comment

Choose a reason for hiding this comment

rjzamora commented Dec 14, 2022

jrbourbeau commented Dec 14, 2022 • edited

jrbourbeau left a comment

Choose a reason for hiding this comment

jrbourbeau Dec 15, 2022

Choose a reason for hiding this comment

rjzamora Dec 15, 2022

Choose a reason for hiding this comment

jrbourbeau left a comment

Choose a reason for hiding this comment

Make filesystem-backend configurable in `read_parquet` #9699

Make filesystem-backend configurable in `read_parquet` #9699

rjzamora commented Nov 28, 2022 •

edited

rjzamora commented Dec 14, 2022 •

edited

jrbourbeau commented Dec 14, 2022 •

edited