Opening lots of files can be slow #816

jrbourbeau · 2023-10-26T22:13:55Z

When I open a file on S3 like this:

import fsspec

fs = fsspec.filesystem('s3', anon=True)
path = "coiled-datasets/uber-lyft-tlc/part.93.parquet"
fs.open(path, mode="rb")

The fs.open call often takes ~0.5-1.5 seconds to run. Here's a snakeviz profile (again, just of the fs.open call) where it looks like most time is spent in a details call that hits S3:

I think this is mostly to get the file size (though I'm not sure why the size is needed at file object creation time) because if I pass the file size to fs.open, then things are much faster:

@martindurant do you have a sense for what's possible here to speed up opening files?

The actual use case I'm interested in is passing a bunch (100k) of netcdf files to Xarray, whose h5netcdf engine requires open file objects.

The text was updated successfully, but these errors were encountered:

martindurant · 2023-11-03T13:44:27Z

s3fs caches file listings, so the simpler workaround to get all the lengths of all of the files you need, is to prospectively ls()/find() in the right locations beforehand. We can also enable passing the size (+etag, ...) explicitly to open() if you have that information from elsewhere; I think we talked about this.

Where you need real file-like objects supporting seek() random access, knowing the size is necessary so that the readahead buffer doesn't attempt to read bytes that don't exist in the target. On the other hand, the best caching strategy for kerchunking HDF5 files I have found to be "first", since that's where the majority of the metadata lives. In that case, knowing the size should not be required and maybe we can do some work to make it a lazy attribute.

(Including the etag is optional, but all open() calls currently do use it, to make sure the file didn't change during reading)

martindurant · 2023-11-03T13:58:09Z

The fs.open call often takes ~0.5-1.5 seconds to run

Worth mentioning that this value will be higher on the first call, due to time to set up the http session (ssl, etc) and query the bucket location - you would need to pay this latency at some point regardless.

jrbourbeau mentioned this issue Nov 7, 2023

Support remote string paths for h5netcdf engine pydata/xarray#8423

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Opening lots of files can be slow #816

Opening lots of files can be slow #816

jrbourbeau commented Oct 26, 2023

martindurant commented Nov 3, 2023

martindurant commented Nov 3, 2023

Opening lots of files can be slow #816

Opening lots of files can be slow #816

Comments

jrbourbeau commented Oct 26, 2023

martindurant commented Nov 3, 2023

martindurant commented Nov 3, 2023