opening a zarr dataset taking so much time #8902

DarshanSP19 · 2024-04-02T13:01:52Z

What is your issue?

I have an era5 dataset stored in GCS bucket as zarr. It contains 273 weather related variables and 4 dimensions. It's an hourly stored data from 1940 to 2023.
When I try to open with ds = xr.open_zarr("gs://gcp-public-data-arco-era5/ar/full_37-1h-0p25deg-chunk-1.zarr-v3") it takes 90 seconds to actually finish the open call.
The chunk scheme is { 'time': 1 }.

The text was updated successfully, but these errors were encountered:

welcome · 2024-04-02T13:01:54Z

Thanks for opening your first issue here at xarray! Be sure to follow the issue template!
If you have an idea for a solution, we would really welcome a Pull Request with proposed changes.
See the Contributing Guide for more.
It may take us a while to respond here, but we really value your contribution. Contributors like you help make xarray better.
Thank you!

slevang · 2024-04-02T18:50:52Z

I've felt the pain on this particular store as well, it's a nice test case for the whole stack. 3PB total, >1,000,000 chunks in each of the pressure level variables!

Looks like this is a dask problem though. All the time is spent in single-threaded code creating the array chunks.

If we skip dask with xr.open_zarr(..., chunks=None) it takes 1.5s.

We currently have a drop_variables arg. When you have a dataset with 273 variables and you only want a couple, the inverse keep_variables would be a lot easier. It looks like drop_variables gets applied before we create the dask chunks for the arrays, so reading the store once and on the second read adding drop_variables=[v for v in ds.data_vars if v != "geopotential"], I recover a ~1.5s read time.

dcherian · 2024-04-02T18:57:40Z

Nice profile!

@jrbourbeau dask/dask#10648 probably improves this by a lot. Can that be reviewed/merged please?

edit: xref dask/dask#10269

slevang · 2024-04-02T19:59:27Z

Interestingly things are almost a factor of 4x worse on both those PRs, but both are out of date so may be missing other recent improvements.

riley-brady · 2024-04-03T18:10:01Z

@slevang, do you mind sharing how you are generating these profiles and associated graphs? I've struggled in the past to do this effectively with a dask cluster. This looks great!

dcherian · 2024-04-03T18:19:08Z

This is snakeviz: https://jiffyclub.github.io/snakeviz/

slevang · 2024-04-03T18:24:24Z

^yep. Probably not useful for distributed profiling but I haven't really tried. It's just a visualization layer for cProfile.

In this case the creation of this monster task graph would be happening serially in the main process even if your goal was to eventually use a distributed client to run processing. The graph (only describing the array chunks) is close to a billion objects in this case, so would run into issues even trying to serialize that out to workers.

max-sixty · 2024-04-03T18:56:22Z

FYI something like:

ds = xr.open_zarr(
    "gs://gcp-public-data-arco-era5/ar/full_37-1h-0p25deg-chunk-1.zarr-v3",
    chunks=dict(time=-1, level=-1, latitude="auto", longitude="auto"),
)

...may give a better balance between dask task graph size and chuck size.

(I do this "make the dask chunks bigger than the zarr chunks" a lot, because simple dask graphs can become huge with a 50TB dataset, since the default zarr encoding has a maximum chunk size of 2GB. Not sure it's necessarily the best way, very open to ideas...)

slevang · 2024-04-03T19:08:51Z

Edit: nevermind I actually have no idea where this profile came from. Disregard

DarshanSP19 · 2024-04-04T07:49:51Z

How do I get my work done?

I opened the dataset with chunks=None.
Then filter that as required like select only a few data variables only for some fixed time ranges and for some fixed lat lon ranges.
Then chunk that small dataset only.

ds = xr.open_zarr("gs://gcp-public-data-arco-era5/ar/full_37-1h-0p25deg-chunk-1.zarr-v3", chunks=None)
data_vars = ['surface_pressure', 'temperature']
ds = ds[data_vars]
ds = ds.sel(time=slice('2013-01-01', '2013-12-31'))
... and a few more filters ...
ds = ds.chunk()

This will only generate chunks for a filtered dataset.
These steps worked for me so closing the issue.

DarshanSP19 added the needs triage Issue that has not been reviewed by xarray team member label Apr 2, 2024

TomNicholas added topic-performance topic-zarr Related to zarr storage library and removed needs triage Issue that has not been reviewed by xarray team member labels Apr 2, 2024

dcherian added upstream issue topic-dask labels Apr 2, 2024

DarshanSP19 closed this as completed Apr 4, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

opening a zarr dataset taking so much time #8902

opening a zarr dataset taking so much time #8902

DarshanSP19 commented Apr 2, 2024

welcome bot commented Apr 2, 2024

slevang commented Apr 2, 2024

dcherian commented Apr 2, 2024 •

edited

slevang commented Apr 2, 2024

riley-brady commented Apr 3, 2024

dcherian commented Apr 3, 2024

slevang commented Apr 3, 2024 •

edited

max-sixty commented Apr 3, 2024 •

edited

slevang commented Apr 3, 2024 •

edited

DarshanSP19 commented Apr 4, 2024

opening a zarr dataset taking so much time #8902

opening a zarr dataset taking so much time #8902

Comments

DarshanSP19 commented Apr 2, 2024

What is your issue?

welcome bot commented Apr 2, 2024

slevang commented Apr 2, 2024

dcherian commented Apr 2, 2024 • edited

slevang commented Apr 2, 2024

riley-brady commented Apr 3, 2024

dcherian commented Apr 3, 2024

slevang commented Apr 3, 2024 • edited

max-sixty commented Apr 3, 2024 • edited

slevang commented Apr 3, 2024 • edited

DarshanSP19 commented Apr 4, 2024

dcherian commented Apr 2, 2024 •

edited

slevang commented Apr 3, 2024 •

edited

max-sixty commented Apr 3, 2024 •

edited

slevang commented Apr 3, 2024 •

edited