Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Failed to write independently in parallel hdf5 #2330

Open
ekdldkqkr opened this issue Oct 17, 2023 · 6 comments
Open

Failed to write independently in parallel hdf5 #2330

ekdldkqkr opened this issue Oct 17, 2023 · 6 comments

Comments

@ekdldkqkr
Copy link

ekdldkqkr commented Oct 17, 2023

Hi, I'm having trouble to run parallel hdf5 from h5py, tutorial in documents.
https://docs.h5py.org/en/latest/mpi.html

I have tried the following:

from mpi4py import MPI
import h5py

rank = MPI.COMM_WORLD.rank  # The process ID (integer 0-3 for 4-process run)

f = h5py.File('parallel_test.hdf5', 'w', driver='mpio', comm=MPI.COMM_WORLD)

dset = f.create_dataset('test', (4,), dtype='i')
dset[rank] = rank

f.close()

and error occurs

dset[rank] = rank
    ~~~~^^^^^^
  File "h5py/_objects.pyx", line 54, in h5py._objects.with_phil.wrapper
  File "h5py/_objects.pyx", line 55, in h5py._objects.with_phil.wrapper
  File "/home/user_name/.conda/envs/pmp/lib/python3.11/site-packages/h5py/_hl/dataset.py", line 999, in __setitem__
    self.id.write(mspace, fspace, val, mtype, dxpl=self._dxpl)
  File "h5py/_objects.pyx", line 54, in h5py._objects.with_phil.wrapper
  File "h5py/_objects.pyx", line 55, in h5py._objects.with_phil.wrapper
  File "h5py/h5d.pyx", line 282, in h5py.h5d.DatasetID.write
  File "h5py/_proxy.pyx", line 115, in h5py._proxy.dset_rw
OSError: Can't synchronously write data (Can't perform independent write when MPI_File_sync is required by ROMIO driver.)

It does create hdf5 file, but seems to write dataset with only one processing units.

I checked created hdf5 file with h5dump

h5dump parallel_test.hdf5
HDF5 "parallel_test.hdf5" {
GROUP "/" {
   DATASET "test" {
      DATATYPE  H5T_STD_I32LE
      DATASPACE  SIMPLE { ( 4 ) / ( 4 ) }
      DATA {
      (0): 0, 0, 0, 0
      }
   }
}
}

results different from following which is done in docuents

HDF5 "parallel_test.hdf5" {
GROUP "/" {
   DATASET "test" {
      DATATYPE  H5T_STD_I32LE
      DATASPACE  SIMPLE { ( 4 ) / ( 4 ) }
      DATA {
      (0): 0, 1, 2, 3
      }
   }
}
}
  • Operating System : Ubuntu 20.04.2 LTS
  • Python version : 3.11.6
  • Where Python was acquired : Anaconda on linux
  • h5py version : 3.10.0
  • HDF5 version : 1.14.2
  • mpi4py version: 3.1.4
python -c 'import h5py; print(h5py.version.info)'

gives following

h5py    3.10.0
HDF5    1.14.2
Python  3.11.6 | packaged by conda-forge | (main, Oct  3 2023, 10:40:35) [GCC 12.3.0]
sys.platform    linux
sys.maxsize     9223372036854775807
numpy   1.26.0
cython (built with) 0.29.36
numpy (built against) 1.23.5
HDF5 (built against) 1.14.2
@ajelenak
Copy link
Contributor

Have you installed the parallel build of libhdf5? What does the conda list hdf5 command show?

There are different builds of libhdf5:

% conda search hdf5 | grep "1\.14\.2"
hdf5                          1.14.2 mpi_mpich_h3618df7_0  conda-forge
hdf5                          1.14.2 mpi_openmpi_h01be5f8_0  conda-forge
hdf5                          1.14.2 nompi_hedada53_100  conda-forge

You must use one of the "mpi" builds.

@ekdldkqkr
Copy link
Author

ekdldkqkr commented Oct 18, 2023

@ajelenak Thanks for great comment.

builds of my hdf5 and other related follows from conda list

# hdf related
hdf5                      1.14.2          mpi_mpich_ha2c2bf8_0    conda-forge
h5py                      3.10.0          mpi_mpich_py311hef8708c_0    conda-forge

# mpi related
mpi                       1.0                       mpich    conda-forge
mpi4py                    3.1.4           py311he01e52e_1    conda-forge
mpich                     4.1.2              h846660c_100    conda-forge

@ajelenak
Copy link
Contributor

You seem to be using the right build of libhdf5. The error comes from the library not h5py. Someone who also runs h5py with MPI could be able to help more.

@jhendersonHDF
Copy link

Hi @ekdldkqkr,

Note that this check was added in HDFGroup/hdf5@6633210 for initial support of UnifyFS. I'd advise against it, but you may be able to disable this check by setting the environment variable HDF5_DO_MPI_FILE_SYNC=FALSE or HDF5_DO_MPI_FILE_SYNC=0.

However, I believe what's happening is that collective MPI I/O was requested, but HDF5 had to break collective I/O for some reason (one example is type conversions may have been needed with the criteria for that not being supported). I don't know if h5py exposes a routine for it, but HDF5 has the C API H5Pget_mpio_no_collective_cause that is called on a Dataset Transfer Property List (DXPL) and returns a uint32_t that has all the reasons that collective I/O was broken bitwise-ORed together. If it's possible to call that from h5py and figure out the reason that collective I/O was broken, we could figure out what's going on. The h5py folks will know more than me about anything going on in h5py that may cause this.

@ajelenak
Copy link
Contributor

Thanks @jhendersonHDF for a helpful suggestion how to resolve the problem. I can confirm the H5Pget_mpio_no_collective_cause function is not available in h5py.

Since the h5py code used here is from an example in its documentation, it must have run successfully for an older libhdf5 version. This then implies that perhaps the recent libhdf5b features for MPI-based computing may require revisiting the current h5py support for it.

@jhendersonHDF
Copy link

Based on some discussion with other folks, I'm wondering if this isn't due to changes made in ROMIO that are picked up in the install of MPICH from conda forge. The issue is that HDF5 is picking up an MPI_Info hint from ROMIO, romio_visibility_immediate, that is used to determine whether MPI_File_sync calls need to be made after MPI I/O writes. If that hint comes back as false, you will get the error in this thread if trying to do independent I/O. Otherwise, if it comes back as true, there should be no problem. I believe this hint is only supposed to be set to false when using the UnifyFS backend in ROMIO, which I'm assuming isn't the case here. There are details about this on pages 13 and 15 of https://unifyfs.readthedocs.io/_/downloads/en/latest/pdf/.

If the UnifyFS backend is not being used, then setting the environment variable HDF5_DO_MPI_FILE_SYNC=FALSE or HDF5_DO_MPI_FILE_SYNC=0 should be fine for disabling this error, though this shouldn't be needed. I may need to ask the MPICH folks about this to see if there is unexpected behavior here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants