Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

NetCDFDataset support for engine="h5netcdf" #620

Closed
charlesbmi opened this issue Mar 19, 2024 · 0 comments · Fixed by #631
Closed

NetCDFDataset support for engine="h5netcdf" #620

charlesbmi opened this issue Mar 19, 2024 · 0 comments · Fixed by #631

Comments

@charlesbmi
Copy link
Contributor

Description

I am excited to use the new NetCDFDataSet class with xarray. My (ultrasound) data isn't an exact fit for NetCDF because it can be complex (I/Q data), but . In xarray, I use the workarounds:

ds.to_netcdf(
    "data/intermediate_data_iq.h5",
    # Needed when saving complex values, which are not supported in netCDF4 subset of HDF5
    invalid_netcdf=True,
    engine="h5netcdf",
)

This works pretty well for me for saving HDF5 that is close-but-not-exactly NetCDF.

However, the current Kedro implementation of saving to a bytes_buffer doesn't work with the h5netcdf or netcdf4 engines.

Context

Why is this change important to you? How would you use it? How can it benefit other users?

Ultrasound data has a lot of associated coordinates/metadata (e.g. image physical location) that I find helpful to organize with xarray . This change would enable me to fully use Kedro for a data processing pipeline.

This would also benefit other users who want to use netCDF version 4, because the current bytes buffer approach only supports the scipy engine and therefore NETCDF3_64BIT

File-like objects are only supported by the scipy engine. If no path is provided, this function returns the resulting netCDF file as bytes; in this case, we need to use scipy, which does not support netCDF version 4 (the default format becomes NETCDF3_64BIT).

Possible Implementation

@astrojuanlu suggested:

the way we usually compensate for this is by copying from the fsspec location to a temporary file.

Possible Alternatives

(Optional) Describe any alternative solutions or features you've considered.

  1. Complex numbers can be represented in NetCDF as an extra real/imaginary dimension. This is a workaround but it would be nice to work with complex data natively.
  2. Use the HDF5Dataset. The Dataset/coordinate/metadata management of xarray is nice, and it would be great to use that in our pipelines.

Thanks for the help!

@merelcht merelcht linked a pull request May 20, 2024 that will close this issue
4 tasks
merelcht pushed a commit that referenced this issue May 22, 2024
#631)

* Change NetCDFDataset to use a temporary file for remote filesystems, to allow other to_netcdf engines
* Update unit test to include save engine for NetCDFDataset
* Fix unit-test error where folder was accessed before being created

Signed-off-by: Charles Guan <3221512+charlesincharge@users.noreply.github.com>
Signed-off-by: Merel Theisen <merel.theisen@quantumblack.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant