New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
MPI hang during collective writes due to erroneous elision of ranks with nothing to write #965
Labels
Comments
@chissg If you can create a PR which adds the necessary calls (and tests), then I'd be happy to merge it. |
jrs65
added a commit
to jrs65/h5py
that referenced
this issue
Apr 12, 2019
Collective IO where one rank attempts to read or write a zero length slice causes the a hang in an underlying collective call. This fix ensures that all ranks participate in any collective IO operation even if some operations are on zero length slices. This also includes some basic unit testing for collective IO that tests for this problem.
aragilar
pushed a commit
to aragilar/h5py
that referenced
this issue
Aug 9, 2020
Collective IO where one rank attempts to read or write a zero length slice causes the a hang in an underlying collective call. This fix ensures that all ranks participate in any collective IO operation even if some operations are on zero length slices. This also includes some basic unit testing for collective IO that tests for this problem.
aragilar
pushed a commit
to aragilar/h5py
that referenced
this issue
Aug 9, 2020
Collective IO where one rank attempts to read or write a zero length slice causes the a hang in an underlying collective call. This fix ensures that all ranks participate in any collective IO operation even if some operations are on zero length slices. This also includes some basic unit testing for collective IO that tests for this problem.
aragilar
pushed a commit
to aragilar/h5py
that referenced
this issue
Aug 9, 2020
Collective IO where one rank attempts to read or write a zero length slice causes the a hang in an underlying collective call. This fix ensures that all ranks participate in any collective IO operation even if some operations are on zero length slices. This also includes some basic unit testing for collective IO that tests for this problem.
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
CentOS 7-ish (Scientific Linux 7.3)
2.7.14
Windows)
Institutional build (https://scisoft.fnal.gov/packages/python/)
2.7.1
1.13.3
develop@4d4eb4cd5c (verified with 1.10.1)
N/A
In current versions of HDF5, collective writes are an available feature, but there is no really compelling reason to use them. They are faster than the equivalent set of individual writes, but the work required to ensure collectivity is often not worth the trouble. With the upcoming release though, HDF5 offers the ability to use filters on parallel writes to a datasets, but only if those writes are collective. This means that, in the event a particular rank has nothing to write, the HDF5 write command must still be made. Unfortunately, as shown by the attached python demonstration program, h5py appears to elide NOP writes, thus causing an MPI-related hang during the collective operation.
Correspondence with an HDF5 dev (attached) assures that the underlying HDF5 system is perfectly capable of dealing with NOP writes as part of a collective operation, but they do need to be passed through.
I'd be grateful if there were some way to enhance h5py to no longer elide NOP collective write operations.
To demonstrate the problem:
The text was updated successfully, but these errors were encountered: