Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

opening a zarr dataset taking so much time #8902

Closed
DarshanSP19 opened this issue Apr 2, 2024 · 10 comments
Closed

opening a zarr dataset taking so much time #8902

DarshanSP19 opened this issue Apr 2, 2024 · 10 comments

Comments

@DarshanSP19
Copy link

What is your issue?

I have an era5 dataset stored in GCS bucket as zarr. It contains 273 weather related variables and 4 dimensions. It's an hourly stored data from 1940 to 2023.
When I try to open with ds = xr.open_zarr("gs://gcp-public-data-arco-era5/ar/full_37-1h-0p25deg-chunk-1.zarr-v3") it takes 90 seconds to actually finish the open call.
The chunk scheme is { 'time': 1 }.

@DarshanSP19 DarshanSP19 added the needs triage Issue that has not been reviewed by xarray team member label Apr 2, 2024
Copy link

welcome bot commented Apr 2, 2024

Thanks for opening your first issue here at xarray! Be sure to follow the issue template!
If you have an idea for a solution, we would really welcome a Pull Request with proposed changes.
See the Contributing Guide for more.
It may take us a while to respond here, but we really value your contribution. Contributors like you help make xarray better.
Thank you!

@TomNicholas TomNicholas added topic-performance topic-zarr Related to zarr storage library and removed needs triage Issue that has not been reviewed by xarray team member labels Apr 2, 2024
@slevang
Copy link
Contributor

slevang commented Apr 2, 2024

I've felt the pain on this particular store as well, it's a nice test case for the whole stack. 3PB total, >1,000,000 chunks in each of the pressure level variables!

Looks like this is a dask problem though. All the time is spent in single-threaded code creating the array chunks.

Screenshot from 2024-04-02 14-34-43

If we skip dask with xr.open_zarr(..., chunks=None) it takes 1.5s.

We currently have a drop_variables arg. When you have a dataset with 273 variables and you only want a couple, the inverse keep_variables would be a lot easier. It looks like drop_variables gets applied before we create the dask chunks for the arrays, so reading the store once and on the second read adding drop_variables=[v for v in ds.data_vars if v != "geopotential"], I recover a ~1.5s read time.

@dcherian
Copy link
Contributor

dcherian commented Apr 2, 2024

Nice profile!

@jrbourbeau dask/dask#10648 probably improves this by a lot. Can that be reviewed/merged please?

edit: xref dask/dask#10269

@slevang
Copy link
Contributor

slevang commented Apr 2, 2024

Interestingly things are almost a factor of 4x worse on both those PRs, but both are out of date so may be missing other recent improvements.

Screenshot from 2024-04-02 15-13-21

@riley-brady
Copy link

@slevang, do you mind sharing how you are generating these profiles and associated graphs? I've struggled in the past to do this effectively with a dask cluster. This looks great!

@dcherian
Copy link
Contributor

dcherian commented Apr 3, 2024

This is snakeviz: https://jiffyclub.github.io/snakeviz/

@slevang
Copy link
Contributor

slevang commented Apr 3, 2024

^yep. Probably not useful for distributed profiling but I haven't really tried. It's just a visualization layer for cProfile.

In this case the creation of this monster task graph would be happening serially in the main process even if your goal was to eventually use a distributed client to run processing. The graph (only describing the array chunks) is close to a billion objects in this case, so would run into issues even trying to serialize that out to workers.

@max-sixty
Copy link
Collaborator

max-sixty commented Apr 3, 2024

FYI something like:

ds = xr.open_zarr(
    "gs://gcp-public-data-arco-era5/ar/full_37-1h-0p25deg-chunk-1.zarr-v3",
    chunks=dict(time=-1, level=-1, latitude="auto", longitude="auto"),
)

...may give a better balance between dask task graph size and chuck size.

(I do this "make the dask chunks bigger than the zarr chunks" a lot, because simple dask graphs can become huge with a 50TB dataset, since the default zarr encoding has a maximum chunk size of 2GB. Not sure it's necessarily the best way, very open to ideas...)

@slevang
Copy link
Contributor

slevang commented Apr 3, 2024

Edit: nevermind I actually have no idea where this profile came from. Disregard

@DarshanSP19
Copy link
Author

How do I get my work done?

  • I opened the dataset with chunks=None.
  • Then filter that as required like select only a few data variables only for some fixed time ranges and for some fixed lat lon ranges.
  • Then chunk that small dataset only.
ds = xr.open_zarr("gs://gcp-public-data-arco-era5/ar/full_37-1h-0p25deg-chunk-1.zarr-v3", chunks=None)
data_vars = ['surface_pressure', 'temperature']
ds = ds[data_vars]
ds = ds.sel(time=slice('2013-01-01', '2013-12-31'))
... and a few more filters ...
ds = ds.chunk() 

This will only generate chunks for a filtered dataset.
These steps worked for me so closing the issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

6 participants