Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Opening lots of files can be slow #816

Open
jrbourbeau opened this issue Oct 26, 2023 · 2 comments
Open

Opening lots of files can be slow #816

jrbourbeau opened this issue Oct 26, 2023 · 2 comments

Comments

@jrbourbeau
Copy link
Contributor

When I open a file on S3 like this:

import fsspec

fs = fsspec.filesystem('s3', anon=True)
path = "coiled-datasets/uber-lyft-tlc/part.93.parquet"
fs.open(path, mode="rb")

The fs.open call often takes ~0.5-1.5 seconds to run. Here's a snakeviz profile (again, just of the fs.open call) where it looks like most time is spent in a details call that hits S3:

Screenshot 2023-10-26 at 4 24 31 PM

I think this is mostly to get the file size (though I'm not sure why the size is needed at file object creation time) because if I pass the file size to fs.open, then things are much faster:

Screenshot 2023-10-26 at 4 24 46 PM

@martindurant do you have a sense for what's possible here to speed up opening files?

The actual use case I'm interested in is passing a bunch (100k) of netcdf files to Xarray, whose h5netcdf engine requires open file objects.

@martindurant
Copy link
Member

s3fs caches file listings, so the simpler workaround to get all the lengths of all of the files you need, is to prospectively ls()/find() in the right locations beforehand. We can also enable passing the size (+etag, ...) explicitly to open() if you have that information from elsewhere; I think we talked about this.

Where you need real file-like objects supporting seek() random access, knowing the size is necessary so that the readahead buffer doesn't attempt to read bytes that don't exist in the target. On the other hand, the best caching strategy for kerchunking HDF5 files I have found to be "first", since that's where the majority of the metadata lives. In that case, knowing the size should not be required and maybe we can do some work to make it a lazy attribute.

(Including the etag is optional, but all open() calls currently do use it, to make sure the file didn't change during reading)

@martindurant
Copy link
Member

The fs.open call often takes ~0.5-1.5 seconds to run

Worth mentioning that this value will be higher on the first call, due to time to set up the http session (ssl, etc) and query the bucket location - you would need to pay this latency at some point regardless.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants