Dataset to numpy conversion replaces part of the data with zeros #2394

cbassa · 2024-03-19T13:18:02Z

An occasional problem I run into is that when reading an HDF5 dataset with h5py and converting it to a numpy array, a large fraction of the output array is replaced with zeros. This appears to only be the case when reading the entire dataset, as slices of a subset of the dataset results in the correct data being returned. What may be going wrong here?

The example code below yields these results:

Fraction of zeros with np.fromfile:  0.0007078439597315436
Fraction of zeros with cast to np.array:  0.633608081655481
Fraction of zeros with read_direct:  0.633608081655481
Fraction of zeros with horizontal slice:  0.633608081655481
Fraction of zeros with horizontal and vertical slice:  0.633608081655481
Fraction of zeros with almost complete vertical slice:  0.0007078439597315436
Fraction of zeros with complete vertical slice:  0.633608081655481

As the data and header are split into separate files, reading the data with np.fromfile returns the expected input, which only has 0.07% consisting of zeros. All other approaches, except the partial slicing, return arrays in which 63% has been replaced with zeros.

The input array has a shape of 228864x6400 values as float32.

#!/usr/bin/env python3
import h5py
import numpy as np

if __name__ == "__main__":
    # File names
    h5fname = "L2037874_SAP000_B000_S0_P000_bf.h5"
    rawfname = "L2037874_SAP000_B000_S0_P000_bf.raw"
    group_name = "SUB_ARRAY_POINTING_000/BEAM_000/STOKES_0"

    # Read with numpy directly
    data = np.fromfile(rawfname, dtype="float32").reshape(-1, 6400)
    print("Fraction of zeros with np.fromfile: ", np.sum(data == 0) / np.prod(data.shape))

    # Cast to numpy array
    h5 = h5py.File(h5fname, "r")
    data = np.array(h5[group_name])
    h5.close()
    print("Fraction of zeros with cast to np.array: ", np.sum(data == 0) / np.prod(data.shape))

    # Use read_direct
    h5 = h5py.File(h5fname, "r")
    data = np.zeros((228864, 6400), dtype="float32")
    h5[group_name].read_direct(data)
    h5.close()
    print("Fraction of zeros with read_direct: ", np.sum(data == 0) / np.prod(data.shape))    

    # Using horizontal slice
    h5 = h5py.File(h5fname, "r")
    data = h5[group_name][:]
    h5.close()
    print("Fraction of zeros with horizontal slice: ", np.sum(data == 0) / np.prod(data.shape))    

    # Using horizontal and vertical slice
    h5 = h5py.File(h5fname, "r")
    data = h5[group_name][:, :]
    h5.close()
    print("Fraction of zeros with horizontal and vertical slice: ", np.sum(data == 0) / np.prod(data.shape))    
    
    # Using almost complete vertical slice
    h5 = h5py.File(h5fname, "r")
    data = h5[group_name][:, :6399]
    h5.close()
    print("Fraction of zeros with almost complete vertical slice: ", np.sum(data == 0) / np.prod(data.shape))    

    # Using complete vertical slice
    h5 = h5py.File(h5fname, "r")
    data = h5[group_name][:, :6400]
    h5.close()
    print("Fraction of zeros with complete vertical slice: ", np.sum(data == 0) / np.prod(data.shape))

Particulars of the used software:
Summary of the h5py configuration

h5py 3.9.0
HDF5 1.12.2
Python 3.11.7 (main, Jan 29 2024, 16:03:57) [GCC 13.2.1 20230801]
sys.platform linux
sys.maxsize 9223372036854775807
numpy 1.25.2
cython (built with) 0.29.35
numpy (built against) 1.23.2
HDF5 (built against) 1.12.2

The text was updated successfully, but these errors were encountered:

tacaswell · 2024-03-19T13:27:57Z

Can you include a script to generate the bad files? I doubt that the issue is dependent on the exact values in the array so random (with a fixed seed) or a known sequence should be good enough but I do expect this to be dependent on the details of how the dataset is made. Without the details of the file it is very hard to debug.

cbassa · 2024-03-19T13:57:10Z

This issue also exists if the raw file is created by numpy with np.ones((228864, 6400)).astype("float32").tofile("L2037874_SAP000_B000_S0_P000_bf.raw"). The corresponding HDF5 header file can be downloaded from https://filesender.surf.nl/?s=download&token=4ef7f8df-d51c-47e3-9abf-762bee623f92

I confirmed that this issue does not occur when the dataset is generated with, and included in, a single HDF5 file, with

import numpy as np
import h5py

h5 = h5py.File("test.h5", "w")
h5.create_dataset("data", data=np.ones((228864, 6400)).astype("float32"))
h5.close()

So it appears that this issue is related to having the HDF5 header separated from the raw binary file that contains the data.

tacaswell · 2024-03-19T13:59:45Z

Do you have a script to make that header file?

cbassa · 2024-03-19T14:05:15Z

No, I do not.

tacaswell · 2024-03-19T14:10:11Z

@ajelenak I suspect that this may be an issue with libhdf5.

ajelenak · 2024-03-19T16:43:16Z

For now, here's what h5dump says about this dataset:

HDF5 "L2037874_SAP000_B000_S0_P000_bf.h5" {
DATASET "/SUB_ARRAY_POINTING_000/BEAM_000/STOKES_0" {
   DATATYPE  H5T_IEEE_F32LE
   DATASPACE  SIMPLE { ( 228864, 6400 ) / ( H5S_UNLIMITED, 6400 ) }
   STORAGE_LAYOUT {
      CONTIGUOUS
      EXTERNAL {
         FILENAME L2037874_SAP000_B000_S0_P000_bf.raw SIZE 18446744073709551615 OFFSET 0
      }
   }
   FILTERS {
      NONE
   }
   FILLVALUE {
      FILL_TIME H5D_FILL_TIME_IFSET
      VALUE  H5D_FILL_VALUE_DEFAULT
   }
   ...
   ...
   ...
}

I think the SIZE 18446744073709551615 above is the libhdf5's H5F_UNLIMITED constant which is one of the allowed values in the H5Pset_external() function.

ajelenak · 2024-03-20T03:34:49Z

   DATASPACE  SIMPLE { ( 228864, 6400 ) / ( H5S_UNLIMITED, 6400 ) }
   STORAGE_LAYOUT {
      CONTIGUOUS

This is strange... An extendable dataset (when at least one dimension is unlimited) must be chunked and yet the storage is declared as contiguous. Below is a small Python program to test reading from a dataset with data in an external file:

from pathlib import Path
import numpy as np
import h5py


cwd = Path(__file__).parent
h5path = cwd.joinpath('dset_in_ext_file.h5')
h5path.unlink(missing_ok=True)
raw_path = h5path.with_suffix('.bin')
raw_path.unlink(missing_ok=True)

data = np.ones((128, 64), dtype=np.float32)
data.tofile(raw_path)

dset_path = 'SUB_ARRAY_POINTING_000/BEAM_000/STOKES_0'
with h5py.File(h5path, mode='w') as f:
    f.create_dataset(dset_path, dtype=data.dtype, shape=data.shape, external=raw_path)

with h5py.File(h5path, mode='r') as f:
    assert np.array_equal(f[dset_path][...], data)

It works for me with libhdf5-1.14.3.

I tried to create an extendable dataset by setting the first dimension in maxshape to None and got an error: ValueError: Unable to synchronously create dataset (external storage not supported with chunked layout).

epourmal · 2024-03-20T06:45:25Z

HDF5 doesn’t support chunked storage for external datasets. The storage is always contiguous; data can be added only along the slowest changing dimension.

…

On Tue, Mar 19, 2024 at 10:35 PM Aleksandar Jelenak < ***@***.***> wrote: DATASPACE SIMPLE { ( 228864, 6400 ) / ( H5S_UNLIMITED, 6400 ) } STORAGE_LAYOUT { CONTIGUOUS This is strange... An extendable dataset (when at least one dimension is unlimited) must be chunked and yet the storage is declared as contiguous. Below is a small Python program to test reading from a dataset with data in an external file: from pathlib import Pathimport numpy as npimport h5py cwd = Path(__file__).parenth5path = cwd.joinpath('dset_in_ext_file.h5')h5path.unlink(missing_ok=True)raw_path = h5path.with_suffix('.bin')raw_path.unlink(missing_ok=True) data = np.ones((128, 64), dtype=np.float32)data.tofile(raw_path) dset_path = 'SUB_ARRAY_POINTING_000/BEAM_000/STOKES_0'with h5py.File(h5path, mode='w') as f: f.create_dataset(dset_path, dtype=data.dtype, shape=data.shape, external=raw_path) with h5py.File(h5path, mode='r') as f: assert np.array_equal(f[dset_path][...], data) It works for me with libhdf5-1.14.3. I tried to create an extendable dataset by setting the first dimension in maxshape to None and got an error: ValueError: Unable to synchronously create dataset (external storage not supported with chunked layout). — Reply to this email directly, view it on GitHub <#2394 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ADLFT3NA7QIQ5OYZGZVLH63YZD7XHAVCNFSM6AAAAABE5RQOK2VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDAMBYGYYTEOJVGM> . You are receiving this because you are subscribed to this thread.Message ID: ***@***.***>

ajelenak · 2024-03-20T15:31:01Z

I tried to get the same dataspace definition as the external dataset in the user's HDF5 file: 2D, extendable in the first dimension. As far as I know, this is only possible with chunked layout and the library then correctly raised an error. The question is how was the user's HDF5 file then created?

epourmal · 2024-03-20T15:36:20Z

See external.c file in the HDF5 test directory and test_unlimited test within it. Elena

…

On Wed, Mar 20, 2024 at 10:31 AM Aleksandar Jelenak < ***@***.***> wrote: I tried to get the same dataspace definition as the external dataset in the user's HDF5 file: 2D, extendable in the first dimension. As far as I know, this is only possible with chunked layout and the library then correctly raised an error. The question is how was the user's HDF5 file then created? — Reply to this email directly, view it on GitHub <#2394 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ADLFT3NSYSVPAA2ZFNMYC6TYZGTUVAVCNFSM6AAAAABE5RQOK2VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDAMBZHA2TGNZWG4> . You are receiving this because you commented.Message ID: ***@***.***>

ajelenak · 2024-03-20T16:42:27Z

Thanks @epourmal! Seems like maxshape and external keywords when creating a dataset do not work well now.

I tried my Python example for the user's dataset shape 228864x6400 and got error when reading back the entire dataset:

HDF5-DIAG: Error detected in HDF5 (1.14.3) thread 0:
  #000: /Users/ajelenak/Documents/h5py/hdf5/src/H5D.c line 1061 in H5Dread(): can't synchronously read data
    major: Dataset
    minor: Read failed
  #001: /Users/ajelenak/Documents/h5py/hdf5/src/H5D.c line 1008 in H5D__read_api_common(): can't read data
    major: Dataset
    minor: Read failed
  #002: /Users/ajelenak/Documents/h5py/hdf5/src/H5VLcallback.c line 2092 in H5VL_dataset_read_direct(): dataset read failed
    major: Virtual Object Layer
    minor: Read failed
  #003: /Users/ajelenak/Documents/h5py/hdf5/src/H5VLcallback.c line 2048 in H5VL__dataset_read(): dataset read failed
    major: Virtual Object Layer
    minor: Read failed
  #004: /Users/ajelenak/Documents/h5py/hdf5/src/H5VLnative_dataset.c line 373 in H5VL__native_dataset_read(): can't read data
    major: Dataset
    minor: Read failed
  #005: /Users/ajelenak/Documents/h5py/hdf5/src/H5Dio.c line 401 in H5D__read(): can't read data
    major: Dataset
    minor: Read failed
  #006: /Users/ajelenak/Documents/h5py/hdf5/src/H5Dcontig.c line 842 in H5D__contig_read(): contiguous read failed
    major: Dataset
    minor: Read failed
  #007: /Users/ajelenak/Documents/h5py/hdf5/src/H5Dselect.c line 459 in H5D__select_read(): read error
    major: Dataspace
    minor: Read failed
  #008: /Users/ajelenak/Documents/h5py/hdf5/src/H5Dselect.c line 219 in H5D__select_io(): read error
    major: Dataspace
    minor: Read failed
  #009: /Users/ajelenak/Documents/h5py/hdf5/src/H5Defl.c line 453 in H5D__efl_readvv(): can't perform vectorized EFL read
    major: Dataset
    minor: Can't operate on object
  #010: /Users/ajelenak/Documents/h5py/hdf5/src/H5VM.c line 1263 in H5VM_opvv(): can't perform operation
    major: Internal error (too specific to document in detail)
    minor: Can't operate on object
  #011: /Users/ajelenak/Documents/h5py/hdf5/src/H5Defl.c line 403 in H5D__efl_readvv_cb(): EFL read failed
    major: Dataset
    minor: Read failed
  #012: /Users/ajelenak/Documents/h5py/hdf5/src/H5Defl.c line 276 in H5D__efl_read(): read error in external raw data file
    major: External file list
    minor: Read failed

My favorite error message in the stack: Internal error (too specific to document in detail). 😄

cbassa · 2024-03-21T20:55:34Z

The question is how was the user's HDF5 file then created?

The header and raw file were created by an application that requires high throughput. That is why the HDF5 header is created first (I assume using the HDF5 tools, but I would have to check), while the data in the raw file is written using the standard fwrite functionality of C/C++.

epourmal · 2024-03-22T16:08:02Z

The question is how was the user's HDF5 file then created?

The header and raw file were created by an application that requires high throughput. That is why the HDF5 header is created first (I assume using the HDF5 tools, but I would have to check), while the data in the raw file is written using the standard fwrite functionality of C/C++.

One can create an external binary file first and then create HDF5 file and a dataset with external storage pointing to the binary file. The feature was created exactly for this use case to facilitate access to external binary data using HDF5 library.

ajelenak mentioned this issue Mar 20, 2024

Fix maxshape and external keywords interaction when creating an external dataset #2396

Closed

takluyver mentioned this issue Mar 22, 2024

Allow creation of expandable external datasets #2398

Merged

ajelenak mentioned this issue Mar 22, 2024

Getting fill values when reading from external dataset HDFGroup/hdf5#4216

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Dataset to numpy conversion replaces part of the data with zeros #2394

Dataset to numpy conversion replaces part of the data with zeros #2394

cbassa commented Mar 19, 2024

tacaswell commented Mar 19, 2024

cbassa commented Mar 19, 2024

tacaswell commented Mar 19, 2024

cbassa commented Mar 19, 2024

tacaswell commented Mar 19, 2024

ajelenak commented Mar 19, 2024

ajelenak commented Mar 20, 2024

epourmal commented Mar 20, 2024 via email

ajelenak commented Mar 20, 2024

epourmal commented Mar 20, 2024 via email

ajelenak commented Mar 20, 2024

cbassa commented Mar 21, 2024

epourmal commented Mar 22, 2024

Dataset to numpy conversion replaces part of the data with zeros #2394

Dataset to numpy conversion replaces part of the data with zeros #2394

Comments

cbassa commented Mar 19, 2024

Particulars of the used software: Summary of the h5py configuration

tacaswell commented Mar 19, 2024

cbassa commented Mar 19, 2024

tacaswell commented Mar 19, 2024

cbassa commented Mar 19, 2024

tacaswell commented Mar 19, 2024

ajelenak commented Mar 19, 2024

ajelenak commented Mar 20, 2024

epourmal commented Mar 20, 2024 via email

ajelenak commented Mar 20, 2024

epourmal commented Mar 20, 2024 via email

ajelenak commented Mar 20, 2024

cbassa commented Mar 21, 2024

epourmal commented Mar 22, 2024

Particulars of the used software:
Summary of the h5py configuration