open_datatree performance improvement on NetCDF, H5, and Zarr files #9014

aladinor · 2024-05-07T19:24:11Z

open_datatree performance improvement on NetCDF files

Closes Improving performance of open_datatree #8994 (NetCDF + Zarr datatree)
Tests added
User visible changes (including notable bug fixes) are documented in whats-new.rst

welcome · 2024-05-07T19:24:14Z

Thank you for opening this pull request! It may take us a few days to respond here, so thank you for being patient.
If you have questions, some answers may be found in our contributing guidelines.

…to datatree-zarr merging into same branch

xarray/backends/zarr.py

…d code

…to datatree-zarr merging branches

flamingbear

I had thoughts about the legacyhdf5 api and how it might be incorporated.

xarray/backends/netCDF4_.py

renaming variables Co-authored-by: Tom Nicholas <tom@cworthy.org>

…tree implementations

xarray/backends/zarr.py

…for zarr datatree

xarray/backends/h5netcdf_.py

…g group variable typing hints (str | Iterable[str] | callable) under the open_datatree for h5 files. Finally, separating positional from keyword args

…ding group variable typing hints (str | Iterable[str] | callable) under the open_datatree method for netCDF files

…ding group variable typing hints (str | Iterable[str] | callable) under the open_datatree method for zarr files

xarray/backends/h5netcdf_.py

flamingbear

This looks great. Thanks for working through the dual library stuff with me.

TomNicholas · 2024-06-04T22:40:06Z

Yes very excited by this! Two final things:

This deserves a whats-new.rst entry!
Would you be willing to add an benchmark test? You can see here how we benchmark opening and loading a single netCDF file

xarray/asv_bench/benchmarks/dataset_io.py

Line 125 in 447e5a3

def time_load_dataset_netcdf4(self):

Alternatively we could leave adding that benchmark to a separate PR?

kmuehlbauer · 2024-06-05T06:10:20Z

Sorry for being late to the party, but do we really want to have mode in open_datatree? The open_* functions are for read access only (imho).

Update: It looks like there are already several keyword arguments in the open_*-functions at least for netcdf4/h5netcdf backends which are only needed for write access (mode, format, invalid_netcdf, clobber, diskless, persist and maybe even more).

kmuehlbauer

Beside my other comment on the open_*-function kwargs there is not much to add to @TomNicholas's comment.

We might want to further deduplicate code in the backends (by moving this into backend/store.py). There has already quite some work been done by @jthielen in #7437.

Should open_datatree be added to the API with this PR too?

Those issues can be handled in subsequent PR.

Great work @aladinor 🚀.

keewis · 2024-06-05T09:16:18Z

In the meeting yesterday we decided to not bother with deduplication for now (and anyways it is nice to have backends somewhat self-contained).

The API will be extended in #9033, the idea is to have one single PR that marks the entire DataTree API as public.

Edit: and I agree, since open_dataset in particular already has these parameters, it's fine if open_datatree also has them (before marking BackendEntrypoint.open_datatree as public and documenting it we should probably clean it up a bit more)

open_datatree performance improvement on NetCDF files

14aaf56

aladinor added 5 commits May 7, 2024 14:44

fixing issue with forward slashes

3a5edb4

Merge branch 'main' into datatree-zarr

72d7660

fixing issue with pytest

d9dde29

fixing issue with pytest

2bc5e73

Merge branch 'main' into datatree-zarr

89fb4fb

TomNicholas added the topic-DataTree Related to the implementation of a DataTree class label May 8, 2024

TomNicholas added this to In progress in DataTree integration via automation May 8, 2024

Illviljan added the run-benchmark Run the ASV benchmark workflow label May 10, 2024

aladinor added 4 commits May 10, 2024 07:59

open datatree in zarr format improvement

0343f10

Merge branch 'main' into datatree-zarr

93e1d59

fixing incompatibility in returned object

ac11b3e

Merge branch 'datatree-zarr' of https://github.com/aladinor/xarray in…

6d0ee13

…to datatree-zarr merging into same branch

aladinor changed the title ~~open_datatree performance improvement on NetCDF files~~ open_datatree performance improvement on NetCDF and Zarr files May 10, 2024

Merge branch 'main' into datatree-zarr

91c5f0a

shoyer reviewed May 14, 2024

View reviewed changes

xarray/backends/zarr.py Show resolved Hide resolved

aladinor added 5 commits May 18, 2024 17:36

Merge branch 'main' into datatree-zarr

3363e91

passing group parameter to opendatatree method and reducing duplicate…

7bba52c

…d code

Merge branch 'datatree-zarr' of https://github.com/aladinor/xarray in…

725aed7

…to datatree-zarr merging branches

passing group parameter to opendatatree method - NetCDF

903effd

Merge branch 'main' into datatree-zarr

d468478

flamingbear reviewed May 20, 2024

View reviewed changes

xarray/backends/netCDF4_.py Show resolved Hide resolved

TomNicholas added topic-backends io topic-performance labels May 28, 2024

TomNicholas reviewed May 28, 2024

View reviewed changes

xarray/backends/netCDF4_.py Outdated Show resolved Hide resolved

aladinor and others added 3 commits May 28, 2024 17:07

Update xarray/backends/netCDF4_.py

51da175

renaming variables Co-authored-by: Tom Nicholas <tom@cworthy.org>

Merge branch 'main' into datatree-zarr

24881bd

renaming variables

5f4bff1

renaming variables

41ceb4f

aladinor requested a review from flamingbear May 29, 2024 01:34

This comment was marked as outdated.

Sign in to view

aladinor added 3 commits May 29, 2024 11:03

renaming group_store variable

f18ead6

removing _open_datatree_netcdf function not used anymore in open_data…

33d9769

…tree implementations

improving performance of open_datatree method

3345b92

aladinor changed the title ~~open_datatree performance improvement on NetCDF and Zarr files~~ open_datatree performance improvement on NetCDF, H5, and Zarr files May 29, 2024

keewis reviewed May 29, 2024

View reviewed changes

xarray/backends/zarr.py Outdated Show resolved Hide resolved

xarray/backends/zarr.py Outdated Show resolved Hide resolved

aladinor added 2 commits May 29, 2024 14:07

renaming 'i' variable within list comprehension in open_store method …

3cb131c

…for zarr datatree

using the default generator instead of loading zarr groups in memory

6a759c0

aladinor force-pushed the datatree-zarr branch from a055344 to 6a759c0 Compare May 29, 2024 20:30

flamingbear reviewed May 29, 2024

View reviewed changes

xarray/backends/h5netcdf_.py Show resolved Hide resolved

aladinor added 3 commits May 29, 2024 17:09

fixing issue with group path to avoid using group[1:] notation. Addin…

6c00641

…g group variable typing hints (str | Iterable[str] | callable) under the open_datatree for h5 files. Finally, separating positional from keyword args

fixing issue with group path to avoid using group[1:] notation and ad…

189b497

…ding group variable typing hints (str | Iterable[str] | callable) under the open_datatree method for netCDF files

fixing issue with group path to avoid using group[1:] notation and ad…

a9c306d

…ding group variable typing hints (str | Iterable[str] | callable) under the open_datatree method for zarr files

flamingbear reviewed May 29, 2024

View reviewed changes

xarray/backends/h5netcdf_.py Show resolved Hide resolved

flamingbear reviewed May 29, 2024

View reviewed changes

xarray/backends/h5netcdf_.py Show resolved Hide resolved

aladinor added 2 commits June 3, 2024 08:26

Merge branch 'main' into datatree-zarr

fad0e76

Merge branch 'main' into datatree-zarr

792f9c7

dcherian requested a review from kmuehlbauer June 4, 2024 16:11

aladinor added 2 commits June 4, 2024 13:08

adding 'mode' parameter to open_datatree method

8c5796f

adding 'mode' parameter to H5NetCDFStore.open method

728b374

flamingbear approved these changes Jun 4, 2024

View reviewed changes

Merge branch 'main' into datatree-zarr

74b9a7c

kmuehlbauer approved these changes Jun 5, 2024

View reviewed changes

TomNicholas mentioned this pull request Jun 11, 2024

DataTree: Segmentation fault with open_datatree(engine='netCDF4') with data variables that are a string type #9093

Open

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

open_datatree performance improvement on NetCDF, H5, and Zarr files #9014

open_datatree performance improvement on NetCDF, H5, and Zarr files #9014

aladinor commented May 7, 2024 •

edited

welcome bot commented May 7, 2024

flamingbear left a comment

This comment was marked as outdated.

flamingbear left a comment

TomNicholas commented Jun 4, 2024

kmuehlbauer commented Jun 5, 2024 •

edited

kmuehlbauer left a comment •

edited

keewis commented Jun 5, 2024 •

edited

open_datatree performance improvement on NetCDF, H5, and Zarr files #9014

Are you sure you want to change the base?

open_datatree performance improvement on NetCDF, H5, and Zarr files #9014

Conversation

aladinor commented May 7, 2024 • edited

welcome bot commented May 7, 2024

flamingbear left a comment

Choose a reason for hiding this comment

This comment was marked as outdated.

flamingbear left a comment

Choose a reason for hiding this comment

TomNicholas commented Jun 4, 2024

kmuehlbauer commented Jun 5, 2024 • edited

kmuehlbauer left a comment • edited

Choose a reason for hiding this comment

keewis commented Jun 5, 2024 • edited

aladinor commented May 7, 2024 •

edited

kmuehlbauer commented Jun 5, 2024 •

edited

kmuehlbauer left a comment •

edited

keewis commented Jun 5, 2024 •

edited