New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Use pyarrow
S3 file system at read time for arrow parquet engine
#9669
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
dask/dataframe/io/parquet/arrow.py
Outdated
**kwargs, | ||
): | ||
|
||
# Optimize open_file_func for remote filesystems | ||
cls._update_open_file_func(fs, storage_options, kwargs) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we make this something more like:
kwargs = cls._update_open_file_func(fs, storage_options, kwargs)
or (psuedocode below):
if "open_file_options" not in kwargs:
open_file_options = cls._get_open_file_options(fs, storage_options, kwargs)
if open_file_options:
kwargs["open_file_options"] = open_file_options
Where possible I'd like to avoid methods that mutate things inplace as it can make it more difficult to reason about the logic at play. Being more explicit about where things are set is, I think, a good thing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Where possible I'd like to avoid methods that mutate things inplace as it can make it more difficult to reason about the logic at play. Being more explicit about where things are set is, I think, a good thing.
I completely agree that code should be functional whenever possible. Sorry, you were so quick to review that I didn't get a chance to clean this up
dask/dataframe/io/parquet/arrow.py
Outdated
|
||
Currently supports ``s3fs`` -> ``pyarrow.fs.S3FileSystem``. | ||
""" | ||
is_s3fs = type(fs).__module__.split(".")[0] == "s3fs" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As is, I think we could use dask.utils.typename(...)
here to make things more succinct. That said, I wonder if we could use a more direct check using isinstance(fs, fsspec.AbstractFileSystem)
and fs.protocol
(or something along these lines)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, this line is pretty gross. If we cant boil this down to a direct instance check, then I like the idea of using a well-defined utility.
It also appears that |
Unfortunately not :/ The first link is already what we are doing to create pyarrow Dataset objects (the backend filesystem is still python-based fsspec). The second link ( |
**kwargs, | ||
): | ||
|
||
# Set default open_file_options for remote filesystems | ||
kwargs["open_file_options"] = cls._default_open_file_options( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
_get_open_file_options
Maybe this is dumb, but rather than build this convention, could we do something like ...
if engine == "pyarrow" and filename.startswith("s3://") and not storage_options:
open_file_options[...] = ...
Building a convention like _get_open_file_options
for this seems a bit heavyweight to me.
(please feel free to entirely ignore this comment. I also may not ever respond. Please do not block on me)
Thanks for taking this on @rjzamora |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Checking in here -- @rjzamora the code changes here generally look good and seem pretty straightforward. What additional changes / benchmarking do you think need to be done (if any)?
Also, could we add a test for the functionality added here? It'd be good to confirm that we swap between s3fs
/ pyarrow
S3 filesystems under the expected conditions
@@ -762,6 +768,30 @@ def write_metadata(cls, parts, meta, fs, path, append=False, **kwargs): | |||
# Private Class Methods | |||
# | |||
|
|||
@classmethod | |||
def _default_open_file_options(cls, fs, storage_options, input_options): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sidenote: I agree with Matt that it'd be nice to inline this logic, since it's relatively straightward and only used in one place. However, my guess is inlining this logic will make it more difficult to test. If that's the case, then I think testability should take priority.
EDIT: moving to small, private utility function seems lighter weight than a @classmethod
and should make it easier to test
pyarrow
S3 file system at read time for arrow parquet engine
@martindurant - Do you think we should be able to use |
fsspec/filesystem_spec#1113 is what I was referring to. It ought to be fine to pass various connection strings to arrow, but I don't really know what their API expects. Assuming this is fine, the rest of the interface should work as normal. Note the options in _open (https://github.com/fsspec/filesystem_spec/blob/master/fsspec/implementations/arrow.py#L149), where presumably we always need |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@martindurant I didn't quite follow #9669 (comment) -- do you know if it's possible to use ArrowFSWrapper
to get an fsspec
-compatible object that still gets the performance benefits of using pyarrow
s S3 filesystem? @rjzamora mentioned offline that he tried this but ran into some issues (I don't have any details here -- perhaps Rick could fill them in).
Regardless, I think we all view this as a temporary optimization. Long-term, it'd be great to have s3fs
be just as performant and rip out the changes in this PR. However, given that it looks like users will see a sizeable performance improvement, and the code changes here are relatively straightforward / isolated, I think it makes sense to include these changes until we can make upstream changes to s3fs
.
I wish to raise the following caveats:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @martindurant
things that appear to work on the client will maybe not work on workers
Hmm I don't follow this point. Could you give an example fail case?
the change does not appear in any higher documentation
That's a fair point. @rjzamora thoughts on adding a note about this in the docstring?
the configuration of s3fs and arrow's fs are not the same, they require different kwargs. Here we have dealt with exactly one: anonymous access.
This seems okay to me. anon
/anonymous
seems somehow special as it's very commonly used -- I think special casing that one keyword is reasonable. Other than that, s3fs.S3FileSystem
and pyarrow.S3FileSystem
share no arguments, and since we handle falling back to s3fs
gracefully, any existing code that specifies s3fs
-specific storage options will continue to use s3fs.S3FileSystem
. I don't anticipate any change in behavior/performance when a non-trivial storage_options
is provided.
We could make this more explicit by saying "if storage_options
has anything other than anon
/anonymous
use s3fs
". @rjzamora mentioned the current approach would make it easier for folks using pyarrow.S3FileSystem
outside of Dask to start using Dask. This seems fine, but not super motivated. I'd be happy adding a more explicit storage_options
check if that's what you would prefer.
all other config that might be passed in storage_options are lost, so any data requiring, for example, profile= will not be usable at all, which is a big regression
I think this also falls under the point above
there is no way not to do this
Users could specify a open_file_options
value that uses s3fs.S3FileSystem
. Fully admit that's not a very obvious workaround, but I wanted to highlight there is a way to override the new default behavior proposed here.
Things might magically work if auth can be automatically determined without any kwargs, but boto and arrow probably don't have the same try order for the various config options either (e.g., .boto3 file will only be used by boto/s3fs).
That's a fair point that hadn't occurred to me. I wonder (1) how common this is, (2) what we can do in those cases. Maybe gracefully fallback to s3fs
is the initial metadata can't be read? @rjzamora I'd be curious about your thoughts here
The user does
There are many ways to configure botocore sessions, not just kwargs. (same as last point above). It is super common in some circumstances, e.g., each user in a jhub, and probably any situation where the user does not own the compute instance they are on. Note that for non-local read, e.g., from my laptop, I already showed that s3fs is just as fast as pyarrow, saturating the bandwidth, so no need for the change in that case. This is probably the biggest user base (but not in enterprise). |
Here's a data point. Did this after seeing mixed results in some of my tests. The following is 10 runs of Query 1 on a 50GB datasets for H2O benchmarks with this branch vs Details: These are 10 runs of Query1 for each environment. We compare 10 runs on the same cluster, calling |
@hayesgb , you are showing wall time, correct? |
@hayesgb can I ask you to dig into this and see what's actually happening? I would look at profiling, task streams, etc.. I'm also curious about the difference in the benchmarks that we've seen above, which seem inconsistent with what you're presenting here. |
I am having qualms about merging this PR for two reasons:
One possible "middle-ground" solution is to make it easier for the user to define/pass-in the desired filesystem themselves: import dask.dataframe as dd
import pyarrow.filesystem as pa_fs
fs = pa_fs.S3FileSystem(...)
# OR MAYBE allow {"arrow", "fsspec", <fs-object>}
ddf = dd.read_parquet("s3://...", filesystem=fs) The "arrow" engine could simply add non-default logic to wrap a Pyarrow-based filesystem with ArrowFSWrapper (and the default logic would raise an error for anything other than an fsspec filesystem implementation). The obvious drawback to this approach is that it doesn't address the original goal of automatically using a Pyarrow-based filesystem for s3. With that said, it does provide a more-intuitive mechanism for overriding the default behavior. Also , if we do find that |
@martindurant -- Yes. Sorry that wasn't clear. @mrocklin -- Of course. I've done some digging already. Planned to dig deeper. |
I would like for us to find a solution where the default behavior of reading Parquet from S3 was close to hardware speeds. I don't think that this should require special configuration. |
Yeah, I agree that we need good default behavior here. However, I'm just pointing out that this PR feels like a footgun to me. I think it makes sense to enable the use of an optimized filesystem object, but my gut tells me that we should be very transparent about it. For example, #9699 is still very rough, but it proposes a mechanism to use a Pyarrow-based filesystem for metadata parsing and IO (in a way that may be easier to document). |
Footguns sound bad. Do you have suggestions for how to make the default behavior fast? Does #9699 bring us towards that outcome? |
Additional work on this topic, which makes a pretty solid case for some version of this PR being implemented. Shoutout to @gjoseph92 for his help with profiling, and some digging to find that Summary All of these runs are with "Query1" from a 50GB H2O-benchmark dataset. We run on 3 different types of EC2 instances, as follows: All of these machines a maximum network bandwidth of "up to 12.5Gbps". Networking performance of these instances is burstable. We have between 10 and 30 repeats, depending on the run. We're comparing s3fs + dask==2022.11.1 vs this PR. pyspy profile from a single worker of the 10 worker cluster: A few interesting things about this.
For comparison, the below is the pyarrow.S3Filesystem profile.
This plot is from the run with Run to run variance -- When we run multiple runs on the same cluster with restarts between, wall clock times get less jittery. Presumably S3 is doing some caching of data, but the first 1-2 runs can be pretty variable in their run times. |
@martindurant, @jrbourbeau and I had an offline discussion about this work. Rough summary of this discussion:
Remaining Uncertainty:
Tentative Plan:
(James and Martin: Feel free to make corrections) |
Thanks for writing up the summary |
Closing this in favor of #9699 (but will re-open if necessary). |
As discussed in #9619, there seem to be some performance benefits of using
pyarrow.fs.S3FileSystem
to open s3-based Parquet files for data ingest. This PR proposes a simple change to the defaultsopen_file_options
contents for the "arrow"read_parquet
engine.pre-commit run --all-files