Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Dataset to numpy conversion replaces part of the data with zeros #2394

Open
cbassa opened this issue Mar 19, 2024 · 13 comments
Open

Dataset to numpy conversion replaces part of the data with zeros #2394

cbassa opened this issue Mar 19, 2024 · 13 comments

Comments

@cbassa
Copy link

cbassa commented Mar 19, 2024

An occasional problem I run into is that when reading an HDF5 dataset with h5py and converting it to a numpy array, a large fraction of the output array is replaced with zeros. This appears to only be the case when reading the entire dataset, as slices of a subset of the dataset results in the correct data being returned. What may be going wrong here?

The example code below yields these results:

Fraction of zeros with np.fromfile:  0.0007078439597315436
Fraction of zeros with cast to np.array:  0.633608081655481
Fraction of zeros with read_direct:  0.633608081655481
Fraction of zeros with horizontal slice:  0.633608081655481
Fraction of zeros with horizontal and vertical slice:  0.633608081655481
Fraction of zeros with almost complete vertical slice:  0.0007078439597315436
Fraction of zeros with complete vertical slice:  0.633608081655481

As the data and header are split into separate files, reading the data with np.fromfile returns the expected input, which only has 0.07% consisting of zeros. All other approaches, except the partial slicing, return arrays in which 63% has been replaced with zeros.

The input array has a shape of 228864x6400 values as float32.

#!/usr/bin/env python3
import h5py
import numpy as np

if __name__ == "__main__":
    # File names
    h5fname = "L2037874_SAP000_B000_S0_P000_bf.h5"
    rawfname = "L2037874_SAP000_B000_S0_P000_bf.raw"
    group_name = "SUB_ARRAY_POINTING_000/BEAM_000/STOKES_0"

    # Read with numpy directly
    data = np.fromfile(rawfname, dtype="float32").reshape(-1, 6400)
    print("Fraction of zeros with np.fromfile: ", np.sum(data == 0) / np.prod(data.shape))

    # Cast to numpy array
    h5 = h5py.File(h5fname, "r")
    data = np.array(h5[group_name])
    h5.close()
    print("Fraction of zeros with cast to np.array: ", np.sum(data == 0) / np.prod(data.shape))

    # Use read_direct
    h5 = h5py.File(h5fname, "r")
    data = np.zeros((228864, 6400), dtype="float32")
    h5[group_name].read_direct(data)
    h5.close()
    print("Fraction of zeros with read_direct: ", np.sum(data == 0) / np.prod(data.shape))    

    # Using horizontal slice
    h5 = h5py.File(h5fname, "r")
    data = h5[group_name][:]
    h5.close()
    print("Fraction of zeros with horizontal slice: ", np.sum(data == 0) / np.prod(data.shape))    

    # Using horizontal and vertical slice
    h5 = h5py.File(h5fname, "r")
    data = h5[group_name][:, :]
    h5.close()
    print("Fraction of zeros with horizontal and vertical slice: ", np.sum(data == 0) / np.prod(data.shape))    
    
    # Using almost complete vertical slice
    h5 = h5py.File(h5fname, "r")
    data = h5[group_name][:, :6399]
    h5.close()
    print("Fraction of zeros with almost complete vertical slice: ", np.sum(data == 0) / np.prod(data.shape))    

    # Using complete vertical slice
    h5 = h5py.File(h5fname, "r")
    data = h5[group_name][:, :6400]
    h5.close()
    print("Fraction of zeros with complete vertical slice: ", np.sum(data == 0) / np.prod(data.shape))    

Particulars of the used software:
Summary of the h5py configuration

h5py 3.9.0
HDF5 1.12.2
Python 3.11.7 (main, Jan 29 2024, 16:03:57) [GCC 13.2.1 20230801]
sys.platform linux
sys.maxsize 9223372036854775807
numpy 1.25.2
cython (built with) 0.29.35
numpy (built against) 1.23.2
HDF5 (built against) 1.12.2

@tacaswell
Copy link
Member

Can you include a script to generate the bad files? I doubt that the issue is dependent on the exact values in the array so random (with a fixed seed) or a known sequence should be good enough but I do expect this to be dependent on the details of how the dataset is made. Without the details of the file it is very hard to debug.

@cbassa
Copy link
Author

cbassa commented Mar 19, 2024

This issue also exists if the raw file is created by numpy with np.ones((228864, 6400)).astype("float32").tofile("L2037874_SAP000_B000_S0_P000_bf.raw"). The corresponding HDF5 header file can be downloaded from https://filesender.surf.nl/?s=download&token=4ef7f8df-d51c-47e3-9abf-762bee623f92

I confirmed that this issue does not occur when the dataset is generated with, and included in, a single HDF5 file, with

import numpy as np
import h5py

h5 = h5py.File("test.h5", "w")
h5.create_dataset("data", data=np.ones((228864, 6400)).astype("float32"))
h5.close()

So it appears that this issue is related to having the HDF5 header separated from the raw binary file that contains the data.

@tacaswell
Copy link
Member

Do you have a script to make that header file?

@cbassa
Copy link
Author

cbassa commented Mar 19, 2024

No, I do not.

@tacaswell
Copy link
Member

@ajelenak I suspect that this may be an issue with libhdf5.

@ajelenak
Copy link
Contributor

For now, here's what h5dump says about this dataset:

HDF5 "L2037874_SAP000_B000_S0_P000_bf.h5" {
DATASET "/SUB_ARRAY_POINTING_000/BEAM_000/STOKES_0" {
   DATATYPE  H5T_IEEE_F32LE
   DATASPACE  SIMPLE { ( 228864, 6400 ) / ( H5S_UNLIMITED, 6400 ) }
   STORAGE_LAYOUT {
      CONTIGUOUS
      EXTERNAL {
         FILENAME L2037874_SAP000_B000_S0_P000_bf.raw SIZE 18446744073709551615 OFFSET 0
      }
   }
   FILTERS {
      NONE
   }
   FILLVALUE {
      FILL_TIME H5D_FILL_TIME_IFSET
      VALUE  H5D_FILL_VALUE_DEFAULT
   }
   ...
   ...
   ...
}

I think the SIZE 18446744073709551615 above is the libhdf5's H5F_UNLIMITED constant which is one of the allowed values in the H5Pset_external() function.

@ajelenak
Copy link
Contributor

   DATASPACE  SIMPLE { ( 228864, 6400 ) / ( H5S_UNLIMITED, 6400 ) }
   STORAGE_LAYOUT {
      CONTIGUOUS

This is strange... An extendable dataset (when at least one dimension is unlimited) must be chunked and yet the storage is declared as contiguous. Below is a small Python program to test reading from a dataset with data in an external file:

from pathlib import Path
import numpy as np
import h5py


cwd = Path(__file__).parent
h5path = cwd.joinpath('dset_in_ext_file.h5')
h5path.unlink(missing_ok=True)
raw_path = h5path.with_suffix('.bin')
raw_path.unlink(missing_ok=True)

data = np.ones((128, 64), dtype=np.float32)
data.tofile(raw_path)

dset_path = 'SUB_ARRAY_POINTING_000/BEAM_000/STOKES_0'
with h5py.File(h5path, mode='w') as f:
    f.create_dataset(dset_path, dtype=data.dtype, shape=data.shape, external=raw_path)

with h5py.File(h5path, mode='r') as f:
    assert np.array_equal(f[dset_path][...], data)

It works for me with libhdf5-1.14.3.

I tried to create an extendable dataset by setting the first dimension in maxshape to None and got an error: ValueError: Unable to synchronously create dataset (external storage not supported with chunked layout).

@epourmal
Copy link

epourmal commented Mar 20, 2024 via email

@ajelenak
Copy link
Contributor

I tried to get the same dataspace definition as the external dataset in the user's HDF5 file: 2D, extendable in the first dimension. As far as I know, this is only possible with chunked layout and the library then correctly raised an error. The question is how was the user's HDF5 file then created?

@epourmal
Copy link

epourmal commented Mar 20, 2024 via email

@ajelenak
Copy link
Contributor

Thanks @epourmal! Seems like maxshape and external keywords when creating a dataset do not work well now.

I tried my Python example for the user's dataset shape 228864x6400 and got error when reading back the entire dataset:

HDF5-DIAG: Error detected in HDF5 (1.14.3) thread 0:
  #000: /Users/ajelenak/Documents/h5py/hdf5/src/H5D.c line 1061 in H5Dread(): can't synchronously read data
    major: Dataset
    minor: Read failed
  #001: /Users/ajelenak/Documents/h5py/hdf5/src/H5D.c line 1008 in H5D__read_api_common(): can't read data
    major: Dataset
    minor: Read failed
  #002: /Users/ajelenak/Documents/h5py/hdf5/src/H5VLcallback.c line 2092 in H5VL_dataset_read_direct(): dataset read failed
    major: Virtual Object Layer
    minor: Read failed
  #003: /Users/ajelenak/Documents/h5py/hdf5/src/H5VLcallback.c line 2048 in H5VL__dataset_read(): dataset read failed
    major: Virtual Object Layer
    minor: Read failed
  #004: /Users/ajelenak/Documents/h5py/hdf5/src/H5VLnative_dataset.c line 373 in H5VL__native_dataset_read(): can't read data
    major: Dataset
    minor: Read failed
  #005: /Users/ajelenak/Documents/h5py/hdf5/src/H5Dio.c line 401 in H5D__read(): can't read data
    major: Dataset
    minor: Read failed
  #006: /Users/ajelenak/Documents/h5py/hdf5/src/H5Dcontig.c line 842 in H5D__contig_read(): contiguous read failed
    major: Dataset
    minor: Read failed
  #007: /Users/ajelenak/Documents/h5py/hdf5/src/H5Dselect.c line 459 in H5D__select_read(): read error
    major: Dataspace
    minor: Read failed
  #008: /Users/ajelenak/Documents/h5py/hdf5/src/H5Dselect.c line 219 in H5D__select_io(): read error
    major: Dataspace
    minor: Read failed
  #009: /Users/ajelenak/Documents/h5py/hdf5/src/H5Defl.c line 453 in H5D__efl_readvv(): can't perform vectorized EFL read
    major: Dataset
    minor: Can't operate on object
  #010: /Users/ajelenak/Documents/h5py/hdf5/src/H5VM.c line 1263 in H5VM_opvv(): can't perform operation
    major: Internal error (too specific to document in detail)
    minor: Can't operate on object
  #011: /Users/ajelenak/Documents/h5py/hdf5/src/H5Defl.c line 403 in H5D__efl_readvv_cb(): EFL read failed
    major: Dataset
    minor: Read failed
  #012: /Users/ajelenak/Documents/h5py/hdf5/src/H5Defl.c line 276 in H5D__efl_read(): read error in external raw data file
    major: External file list
    minor: Read failed

My favorite error message in the stack: Internal error (too specific to document in detail). 😄

@cbassa
Copy link
Author

cbassa commented Mar 21, 2024

The question is how was the user's HDF5 file then created?

The header and raw file were created by an application that requires high throughput. That is why the HDF5 header is created first (I assume using the HDF5 tools, but I would have to check), while the data in the raw file is written using the standard fwrite functionality of C/C++.

@epourmal
Copy link

The question is how was the user's HDF5 file then created?

The header and raw file were created by an application that requires high throughput. That is why the HDF5 header is created first (I assume using the HDF5 tools, but I would have to check), while the data in the raw file is written using the standard fwrite functionality of C/C++.

One can create an external binary file first and then create HDF5 file and a dataset with external storage pointing to the binary file. The feature was created exactly for this use case to facilitate access to external binary data using HDF5 library.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants