-
-
Notifications
You must be signed in to change notification settings - Fork 1.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Use pyarrow
S3 file system at read time for arrow parquet engine
#9669
Changes from 2 commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -335,9 +335,13 @@ def read_metadata( | |
ignore_metadata_file=False, | ||
metadata_task_size=0, | ||
parquet_file_extension=None, | ||
storage_options=None, | ||
**kwargs, | ||
): | ||
|
||
# Optimize open_file_func for remote filesystems | ||
cls._update_open_file_func(fs, storage_options, kwargs) | ||
|
||
# Stage 1: Collect general dataset information | ||
dataset_info = cls._collect_dataset_info( | ||
paths, | ||
|
@@ -762,6 +766,28 @@ def write_metadata(cls, parts, meta, fs, path, append=False, **kwargs): | |
# Private Class Methods | ||
# | ||
|
||
@classmethod | ||
def _update_open_file_func(cls, fs, storage_options, kwargs): | ||
"""Update kwargs if we can use a native pyarrow filesystem | ||
|
||
Currently supports ``s3fs`` -> ``pyarrow.fs.S3FileSystem``. | ||
""" | ||
is_s3fs = type(fs).__module__.split(".")[0] == "s3fs" | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. As is, I think we could use There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yeah, this line is pretty gross. If we cant boil this down to a direct instance check, then I like the idea of using a well-defined utility. |
||
if is_s3fs and not kwargs.get("open_file_options", {}): | ||
pa_option_map = {"anon": "anonymous"} | ||
try: | ||
from pyarrow import fs as pa_fs | ||
|
||
pa_options = { | ||
pa_option_map.get(k): v for k, v in storage_options.items() | ||
} | ||
_fs = pa_fs.S3FileSystem(**pa_options) | ||
kwargs["open_file_options"] = {"open_file_func": _fs.open_input_file} | ||
except KeyError: | ||
# Could not map one or more ``storage_options`` | ||
# keys to ``S3FileSystem`` options | ||
pass | ||
|
||
@classmethod | ||
def _collect_dataset_info( | ||
cls, | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we make this something more like:
or (psuedocode below):
Where possible I'd like to avoid methods that mutate things inplace as it can make it more difficult to reason about the logic at play. Being more explicit about where things are set is, I think, a good thing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I completely agree that code should be functional whenever possible. Sorry, you were so quick to review that I didn't get a chance to clean this up