Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

open_datatree performance improvement on NetCDF, H5, and Zarr files #9014

Open
wants to merge 33 commits into
base: main
Choose a base branch
from

Conversation

aladinor
Copy link

@aladinor aladinor commented May 7, 2024

open_datatree performance improvement on NetCDF files

Copy link

welcome bot commented May 7, 2024

Thank you for opening this pull request! It may take us a few days to respond here, so thank you for being patient.
If you have questions, some answers may be found in our contributing guidelines.

@TomNicholas TomNicholas added the topic-DataTree Related to the implementation of a DataTree class label May 8, 2024
@TomNicholas TomNicholas added this to In progress in DataTree integration via automation May 8, 2024
@Illviljan Illviljan added the run-benchmark Run the ASV benchmark workflow label May 10, 2024
@aladinor aladinor changed the title open_datatree performance improvement on NetCDF files open_datatree performance improvement on NetCDF and Zarr files May 10, 2024
Copy link
Contributor

@flamingbear flamingbear left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had thoughts about the legacyhdf5 api and how it might be incorporated.

xarray/backends/netCDF4_.py Show resolved Hide resolved
aladinor and others added 3 commits May 28, 2024 17:07
@aladinor aladinor requested a review from flamingbear May 29, 2024 01:34
aladinor

This comment was marked as outdated.

@aladinor aladinor changed the title open_datatree performance improvement on NetCDF and Zarr files open_datatree performance improvement on NetCDF, H5, and Zarr files May 29, 2024
xarray/backends/zarr.py Outdated Show resolved Hide resolved
xarray/backends/zarr.py Outdated Show resolved Hide resolved
…g group variable typing hints (str | Iterable[str] | callable) under the open_datatree for h5 files. Finally, separating positional from keyword args
…ding group variable typing hints (str | Iterable[str] | callable) under the open_datatree method for netCDF files
…ding group variable typing hints (str | Iterable[str] | callable) under the open_datatree method for zarr files
@dcherian dcherian requested a review from kmuehlbauer June 4, 2024 16:11
Copy link
Contributor

@flamingbear flamingbear left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks great. Thanks for working through the dual library stuff with me.

@TomNicholas
Copy link
Contributor

Yes very excited by this! Two final things:

  • This deserves a whats-new.rst entry!
  • Would you be willing to add an benchmark test? You can see here how we benchmark opening and loading a single netCDF file

def time_load_dataset_netcdf4(self):

Alternatively we could leave adding that benchmark to a separate PR?

@kmuehlbauer
Copy link
Contributor

kmuehlbauer commented Jun 5, 2024

Sorry for being late to the party, but do we really want to have mode in open_datatree? The open_* functions are for read access only (imho).

Update: It looks like there are already several keyword arguments in the open_*-functions at least for netcdf4/h5netcdf backends which are only needed for write access (mode, format, invalid_netcdf, clobber, diskless, persist and maybe even more).

Copy link
Contributor

@kmuehlbauer kmuehlbauer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Beside my other comment on the open_*-function kwargs there is not much to add to @TomNicholas's comment.

We might want to further deduplicate code in the backends (by moving this into backend/store.py). There has already quite some work been done by @jthielen in #7437.

Should open_datatree be added to the API with this PR too?

Those issues can be handled in subsequent PR.

Great work @aladinor 🚀.

@keewis
Copy link
Collaborator

keewis commented Jun 5, 2024

In the meeting yesterday we decided to not bother with deduplication for now (and anyways it is nice to have backends somewhat self-contained).

The API will be extended in #9033, the idea is to have one single PR that marks the entire DataTree API as public.

Edit: and I agree, since open_dataset in particular already has these parameters, it's fine if open_datatree also has them (before marking BackendEntrypoint.open_datatree as public and documenting it we should probably clean it up a bit more)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
io run-benchmark Run the ASV benchmark workflow topic-backends topic-DataTree Related to the implementation of a DataTree class topic-performance
Projects
Development

Successfully merging this pull request may close these issues.

Improving performance of open_datatree
7 participants