New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Dataset to numpy conversion replaces part of the data with zeros #2394
Comments
Can you include a script to generate the bad files? I doubt that the issue is dependent on the exact values in the array so random (with a fixed seed) or a known sequence should be good enough but I do expect this to be dependent on the details of how the dataset is made. Without the details of the file it is very hard to debug. |
This issue also exists if the raw file is created by numpy with I confirmed that this issue does not occur when the dataset is generated with, and included in, a single HDF5 file, with
So it appears that this issue is related to having the HDF5 header separated from the raw binary file that contains the data. |
Do you have a script to make that header file? |
No, I do not. |
@ajelenak I suspect that this may be an issue with libhdf5. |
For now, here's what h5dump says about this dataset: HDF5 "L2037874_SAP000_B000_S0_P000_bf.h5" {
DATASET "/SUB_ARRAY_POINTING_000/BEAM_000/STOKES_0" {
DATATYPE H5T_IEEE_F32LE
DATASPACE SIMPLE { ( 228864, 6400 ) / ( H5S_UNLIMITED, 6400 ) }
STORAGE_LAYOUT {
CONTIGUOUS
EXTERNAL {
FILENAME L2037874_SAP000_B000_S0_P000_bf.raw SIZE 18446744073709551615 OFFSET 0
}
}
FILTERS {
NONE
}
FILLVALUE {
FILL_TIME H5D_FILL_TIME_IFSET
VALUE H5D_FILL_VALUE_DEFAULT
}
...
...
...
} I think the |
This is strange... An extendable dataset (when at least one dimension is unlimited) must be chunked and yet the storage is declared as contiguous. Below is a small Python program to test reading from a dataset with data in an external file: from pathlib import Path
import numpy as np
import h5py
cwd = Path(__file__).parent
h5path = cwd.joinpath('dset_in_ext_file.h5')
h5path.unlink(missing_ok=True)
raw_path = h5path.with_suffix('.bin')
raw_path.unlink(missing_ok=True)
data = np.ones((128, 64), dtype=np.float32)
data.tofile(raw_path)
dset_path = 'SUB_ARRAY_POINTING_000/BEAM_000/STOKES_0'
with h5py.File(h5path, mode='w') as f:
f.create_dataset(dset_path, dtype=data.dtype, shape=data.shape, external=raw_path)
with h5py.File(h5path, mode='r') as f:
assert np.array_equal(f[dset_path][...], data) It works for me with libhdf5-1.14.3. I tried to create an extendable dataset by setting the first dimension in |
HDF5 doesn’t support chunked storage for external datasets. The storage is
always contiguous; data can be added only along the slowest changing
dimension.
…On Tue, Mar 19, 2024 at 10:35 PM Aleksandar Jelenak < ***@***.***> wrote:
DATASPACE SIMPLE { ( 228864, 6400 ) / ( H5S_UNLIMITED, 6400 ) }
STORAGE_LAYOUT {
CONTIGUOUS
This is strange... An extendable dataset (when at least one dimension is
unlimited) must be chunked and yet the storage is declared as contiguous.
Below is a small Python program to test reading from a dataset with data in
an external file:
from pathlib import Pathimport numpy as npimport h5py
cwd = Path(__file__).parenth5path = cwd.joinpath('dset_in_ext_file.h5')h5path.unlink(missing_ok=True)raw_path = h5path.with_suffix('.bin')raw_path.unlink(missing_ok=True)
data = np.ones((128, 64), dtype=np.float32)data.tofile(raw_path)
dset_path = 'SUB_ARRAY_POINTING_000/BEAM_000/STOKES_0'with h5py.File(h5path, mode='w') as f:
f.create_dataset(dset_path, dtype=data.dtype, shape=data.shape, external=raw_path)
with h5py.File(h5path, mode='r') as f:
assert np.array_equal(f[dset_path][...], data)
It works for me with libhdf5-1.14.3.
I tried to create an extendable dataset by setting the first dimension in
maxshape to None and got an error: ValueError: Unable to synchronously
create dataset (external storage not supported with chunked layout).
—
Reply to this email directly, view it on GitHub
<#2394 (comment)>, or
unsubscribe
<https://github.com/notifications/unsubscribe-auth/ADLFT3NA7QIQ5OYZGZVLH63YZD7XHAVCNFSM6AAAAABE5RQOK2VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDAMBYGYYTEOJVGM>
.
You are receiving this because you are subscribed to this thread.Message
ID: ***@***.***>
|
I tried to get the same dataspace definition as the external dataset in the user's HDF5 file: 2D, extendable in the first dimension. As far as I know, this is only possible with chunked layout and the library then correctly raised an error. The question is how was the user's HDF5 file then created? |
See external.c file in the HDF5 test directory and test_unlimited test
within it.
Elena
…On Wed, Mar 20, 2024 at 10:31 AM Aleksandar Jelenak < ***@***.***> wrote:
I tried to get the same dataspace definition as the external dataset in
the user's HDF5 file: 2D, extendable in the first dimension. As far as I
know, this is only possible with chunked layout and the library then
correctly raised an error. The question is how was the user's HDF5 file
then created?
—
Reply to this email directly, view it on GitHub
<#2394 (comment)>, or
unsubscribe
<https://github.com/notifications/unsubscribe-auth/ADLFT3NSYSVPAA2ZFNMYC6TYZGTUVAVCNFSM6AAAAABE5RQOK2VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDAMBZHA2TGNZWG4>
.
You are receiving this because you commented.Message ID:
***@***.***>
|
Thanks @epourmal! Seems like I tried my Python example for the user's dataset shape 228864x6400 and got error when reading back the entire dataset:
My favorite error message in the stack: |
The header and raw file were created by an application that requires high throughput. That is why the HDF5 header is created first (I assume using the HDF5 tools, but I would have to check), while the data in the raw file is written using the standard fwrite functionality of C/C++. |
One can create an external binary file first and then create HDF5 file and a dataset with external storage pointing to the binary file. The feature was created exactly for this use case to facilitate access to external binary data using HDF5 library. |
An occasional problem I run into is that when reading an HDF5 dataset with h5py and converting it to a numpy array, a large fraction of the output array is replaced with zeros. This appears to only be the case when reading the entire dataset, as slices of a subset of the dataset results in the correct data being returned. What may be going wrong here?
The example code below yields these results:
As the data and header are split into separate files, reading the data with
np.fromfile
returns the expected input, which only has 0.07% consisting of zeros. All other approaches, except the partial slicing, return arrays in which 63% has been replaced with zeros.The input array has a shape of 228864x6400 values as float32.
Particulars of the used software:
Summary of the h5py configuration
h5py 3.9.0
HDF5 1.12.2
Python 3.11.7 (main, Jan 29 2024, 16:03:57) [GCC 13.2.1 20230801]
sys.platform linux
sys.maxsize 9223372036854775807
numpy 1.25.2
cython (built with) 0.29.35
numpy (built against) 1.23.2
HDF5 (built against) 1.12.2
The text was updated successfully, but these errors were encountered: