Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix for zero length collective IO - rebased #1634

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

aragilar
Copy link
Member

@aragilar aragilar commented Aug 9, 2020

Closes #965, rebased version of #1206. As there were a bunch of changes which were adding something, then removing it, I've squashed eveything as it made it much easier to rebase.

Collective IO where one rank attempts to read or write a zero length
slice causes the a hang in an underlying collective call. This fix
ensures that all ranks participate in any collective IO operation even
if some operations are on zero length slices.

This also includes some basic unit testing for collective IO that tests
for this problem.
@aragilar aragilar added the MPI Bugs related to MPI label Aug 9, 2020
@aragilar
Copy link
Member Author

aragilar commented Aug 9, 2020

It seems the tests are broadcasting, not sure why?

@takluyver
Copy link
Member

test_collective_write and test_collective_write_empty_rank both write a scalar (0D) value into a 2D selection, i.e. broadcasting it.

We implement some special casing which in this case pre-broadcasts the 0D data to 1D. I'd suggest that the broadcasting check should happen before that - it's an implementation detail, and it's still logically broadcasting if we do it with numpy before passing data to HDF5.

@kburns
Copy link

kburns commented May 18, 2021

I just want to add a "+1" for this since it's something we've been working around for many years (#412 !!). I haven't poked around in the h5py internals before, but I might be able to help out with a little direction.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
MPI Bugs related to MPI
Projects
None yet
Development

Successfully merging this pull request may close these issues.

MPI hang during collective writes due to erroneous elision of ranks with nothing to write
4 participants