Fix for zero length collective IO (issue #965) #1206

jrs65 · 2019-04-13T00:00:34Z

Collective IO where one rank attempts to read or write a zero length slice causes the a hang in an underlying collective call (issue #965). This bit me recently, so I thought I'd attempt to fix it. This PR ensures that all ranks participate in any collective IO operation even if some operations are on zero length slices. This also includes some basic unit testing for collective IO that tests for this problem.

I've attempted to limits the effects of this fix to only happen when MPI collective mode is enabled, to do this I needed to make a slight change to the API of Selection.broadcast to add a parameter which tells it to return zero length selections when in collective mode.

Anyway, please let me know what other work this needs. Thanks!

Collective IO where one rank attempts to read or write a zero length slice causes the a hang in an underlying collective call. This fix ensures that all ranks participate in any collective IO operation even if some operations are on zero length slices. This also includes some basic unit testing for collective IO that tests for this problem.

codecov · 2019-04-13T00:05:01Z

Codecov Report

Merging #1206 into master will decrease coverage by 0.11%.
The diff coverage is 72.22%.

@@            Coverage Diff             @@
##           master    #1206      +/-   ##
==========================================
- Coverage   83.79%   83.67%   -0.12%     
==========================================
  Files          18       18              
  Lines        2160     2169       +9     
==========================================
+ Hits         1810     1815       +5     
- Misses        350      354       +4

Impacted Files	Coverage Δ
h5py/_hl/selections.py	`81.31% <100%> (ø)`	⬆️
h5py/_hl/dataset.py	`83.58% <64.28%> (-0.62%)`	⬇️
h5py/_hl/filters.py	`89.2% <0%> (+0.25%)`	⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 05ea019...8cdbad4. Read the comment docs.

takluyver · 2019-04-23T15:42:42Z

This looks OK to me, but as I don't use MPI, I'd like someone who does to review it.

Notes:

Could you clarify the comments a bit about why it needs to behave differently when doing collective operations? I think I follow it from your PR description, but I don't think I'd understand it if I was just looking at the code in a few months time.
- Possibly also mention why we're not doing the same thing in all cases - presumably it's a performance optimisation. Have we actually measured how much it saves?
I'd be inclined to rename the parameters when calling into the select module, something like include_empty. This means you can understand that code without knowing about MPI, even if you can't see why the parameter is there.
Are the collective I/O tests run in any of the many CI jobs? If not, is it practical to do so? When CI runs most of the tests, it's easy to overlook the ones that it doesn't.

jrs65 · 2019-04-29T15:03:26Z

Thanks @takluyver.

That all makes sense. I will add some more inline comments to explain exactly why this needs to be done, and I'll update the parameter names.

As for why this was being done in the first place, I'm not sure. Presumably as a performance optimisation, although whether it's a worthwhile one I don't know. For this PR I wanted to keep it minimal, it seemed like it could be a subtly breaking change for a much larger pool of users to now start doing the HDF5 calls for empty selections.

Finally on the CI server front, I don't see why it couldn't be run the CI server. You just need to call your test runner for that test only with mpirun. I found the CI config a little impenetrable though. I'll try and look back through it again, but if you have any pointers that would be appreciated.

takluyver · 2019-04-29T15:25:06Z

The CI is a bit complex. It's all controlled through tox, which is configured by the tox.ini file. That defines a number of named 'environments', i.e. test jobs. Each environment has dependencies to be installed and commands to run.

Then the Travis and Appveyor config files largely just specify which tox environment to run in each job (plus some other bits and pieces to ensure it can build against HDF5 properly.

There's an mpi4py factor already in the tox config which includes mpi4py in the environment. But there's not currently anything to run with mpirun, so it would need a new factor for that.

Alternatively, if there are only going to be a few tests that need this, it might be simpler for the test code itself to start mpirun as a subprocess, run a script, and assert on the expected results from the script. This could also make it practical to turn those hangs into failures, by imposing a (generous) time-limit on waiting for the subprocess.

takluyver · 2019-04-29T16:50:44Z

h5py/_hl/dataset.py

@@ -552,7 +552,14 @@ def __getitem__(self, args):
        # Perform the dataspace selection.
        selection = sel.select(self.shape, args, dsid=self.id)

-        if selection.nselect == 0:
+        # If we are running in MPI mode we need to check if we are set in
+        # collective IO mode to ensure we actually perform all IO modes


I think the comment could use a bit of extra explanation, something like:

...so that it doesn't get stuck waiting for an I/O operation that some workers have skipped

(IDK if 'workers' is the standard term in MPI, but I trust you'll know this)

Thanks. Worker would make sense, but I think "MPI process" is most clear, so I'll use that.

takluyver · 2019-04-29T17:10:38Z

h5py/_hl/selections.py

+        """ Get an iterable for broadcasting.
+
+        Setting `collective` includes zero length selections that are
+        required for collective operations.


Actually, looking at it some more, does the selections code need to know about collective mode at all? Shouldn't it handle empty selections correctly regardless?

I appreciate the desire to not change anything for normal I/O. But if we can avoid special-casing it as far as possible, the code is cleaner and better tested.

I agree that it would be best to avoid special cases. If you're amenable to that I'll try and go that route. Hopefully the tests will illuminate any unintended problems.

Oh, I see. The obvious thing for a generator over an empty selection to do is to yield nothing. But that won't work in collective mode.

What happens if the different workers get different (non-zero) numbers of chunks from .broadcast()? Does that cause a hang because they're doing different numbers of HDF5 writes?

I feel like I'm missing something here. It shouldn't need this much code to be aware that it's doing something collectively. Maybe the answer is just to block broadcasting for collective I/O, so you have to set up the right shape of data yourself.

What happens if the different workers get different (non-zero) numbers of chunks from .broadcast()? Does that cause a hang because they're doing different numbers of HDF5 writes?

That is exactly what happens.

However, empirically, just changing the clause in the .broadcast() from

- if nchunks == 1 or (nchunks == 0 and collective): + if nchunks < 2:

does actually work fine. I also think it probably doesn't have any odd side effects because if nchunks was zero, then the nselect == 0 check probably means that normal IO calls don't get there.

@takluyver I'm starting to agree with your suggestion that we should forbid broadcasting in collective mode, or at least strongly restrict it. In theory you might want to do something like this:

arr = np.ones(10) dset = fh.create_dataset('test', shape=(3, 10)) with dset.collective: if comm.rank == 0: dset[:2, :10] = arr # 2 chunks, yields and writes twice on MPI rank=0 else: dset[2:, :10] = arr # 1 chunk, yields and writes once on MPI rank=1

which even with either of the changes I've tried would hang, because there's fundamentally a different number of chunks needed to be written by each. I think there's two options here:

Disallow broadcasting when doing MPI collective writes.

Before writing, communicate with all other ranks to determine how many chunks will be written, and then insert empty writes to keep the number the same across all ranks.

The latter seems like a bit of a hack.

What happens if one MPI process throws an error? Because if we tried to disallow broadcasting, at least in the obvious way, I suspect rank 0 in your example would throw an error, while rank 1 would do its 1 chunk write and then wait for a matching write in rank 0 which isn't going to happen.

This is ugly. I'm trying to think about how best to deal with it, but I'm not sure I know enough about MPI.

I also notice that we should update the docs, because the section on collective I/O doesn't even mention with with dset.collective context manager:
http://docs.h5py.org/en/stable/mpi.html#collective-versus-independent-operations

In the way I was thinking of implementing this, I don't think it would break. You probably want to insist that each ranks file and memory shapes match exactly. So rank=1 wouldn't work here because it tries to write a shape=(10,) array into a shape=(2,10) selection.

However, despite that I'd probably have each rank do it's preparation for writing, and then just before writing have all ranks communicate if they have any errors and have then all throw an exception about the need to broadcast. Something like:

local_mpi_broadcast_error = (memshape != selshape) # Reduce to see if *any* rank had an error global_mpi_bcast_error = self.comm.allreduce(local_mpi_broadcast_error, op=MPI.MAX) if local_mpi_broadcast_error: raise ValueError("Cannot broadcast in MPI collective mode (mem shape=%s, file selection=%s) % ....) if global_broadcast_error: raise ValueError("Yikes. Another rank tried to broadcast during a write.") # ... actually do the write now

A bit like that anyway. Let me change it again and update the PR.

Thinking about it more, I think we can do it well enough with a simple check, without exchanging messages, which means less to go wrong. It will always be possible to create a hang if you branch on if comm.rank ..., but we want to make it obvious that things like this are a problem:

dset = fh.create_dataset('test', shape=(10,)) chunk_start = int(comm.rank / comm.size) chunk_end = int((comm.rank + 1) / comm.size) with dset.collective: dset[chunk_start:chunk_end] = comm.rank

At the moment, that will work (inefficiently: #1067) for 4 workers, but fail for 8, because some workers will write 12 values and some 13. But if we forced people to do the broadcasting with numpy beforehand, it would work:

chunk = np.full(chunk_end - chunk_start, comm.rank) with dset.collective: dset[chunk_start:chunk_end] = comm.rank

I think it needs to be a fairly strict check: not just that broadcast would only return one chunk, but that the dimensions of the data match those of the selection, even when those are length-1 dimensions.

And drop collective parameter from selection module

takluyver · 2019-05-04T12:06:10Z

I've had a go at implementing what I was thinking of. We'll still need to work out testing before this can be merged, though.

aragilar · 2019-06-25T15:31:58Z

I had a look at this (in terms of testing mpi), and I'm not sure we're actually testing the codebase under MPI. I've kinda got something working (I'll push a PR once I've sorted out a few things).

@jrs65 Did you have the test_collective_write_empty_rank working, and on which commit id was it?

scopatz · 2019-06-26T14:12:11Z

Is this not testing with mpi?

takluyver · 2019-06-26T15:02:47Z

I think it only tests building with MPI enabled; when I looked before I didn't find anything using mpirun in the tests.

aragilar · 2019-06-27T01:10:28Z

At best, travis is currently running a MPI build of HDF5 with a MPI build of h5py (HDF5_MPI isn't passed through by tox currently, so it's possible using CC=mpicc is enough). I'm not sure which tests are being skipped though, but I've nearly got a PR ready which will show which tests are being skipped (passing -rsxX to pytest), and runs the tests under mpirun (which showed that our MPI tests don't work under mpirun with more than one process). I'll try to finish the PR tonight.

scopatz · 2019-06-27T13:12:02Z

Thanks @aragilar!

aragilar · 2020-06-20T12:30:34Z

I think this needs rebasing to test with the latest MPI support (not sure if there's anything else holding this up)?

jrs65 added 2 commits April 12, 2019 16:47

Minor tidy up.

bb0d37c

jrs65 changed the title ~~Fix for zero length collective IO (#965)~~ Fix for zero length collective IO (issue #965) Apr 13, 2019

jrs65 mentioned this pull request Apr 13, 2019

MPI hang during collective writes due to erroneous elision of ranks with nothing to write #965

Open

takluyver mentioned this pull request Apr 29, 2019

Broadcasting rules not compatible with collective parallel write #1108

Open

takluyver reviewed Apr 29, 2019

View reviewed changes

takluyver added 4 commits May 4, 2019 12:03

Make a private method to check collective mode

a5ce45c

Add warnings for broadcasting in collective mode

08dce73

Allow zero-length target shape in SimpleSelection.broadcast()

6231d8a

And drop collective parameter from selection module

Remove docstring about removed parameter

8cdbad4

aragilar mentioned this pull request Jun 27, 2019

Use mpirun when testing MPI #1255

Merged

takluyver added the MPI Bugs related to MPI label May 22, 2020

aragilar mentioned this pull request Aug 9, 2020

Fix for zero length collective IO - rebased #1634

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix for zero length collective IO (issue #965) #1206

Fix for zero length collective IO (issue #965) #1206

jrs65 commented Apr 13, 2019

codecov bot commented Apr 13, 2019 •

edited

takluyver commented Apr 23, 2019

jrs65 commented Apr 29, 2019

takluyver commented Apr 29, 2019

takluyver Apr 29, 2019

jrs65 Apr 29, 2019

takluyver Apr 29, 2019

jrs65 Apr 29, 2019

takluyver Apr 29, 2019

jrs65 Apr 29, 2019 •

edited

jrs65 Apr 29, 2019

takluyver Apr 29, 2019

jrs65 Apr 29, 2019 •

edited

takluyver Apr 30, 2019

takluyver commented May 4, 2019

aragilar commented Jun 25, 2019

scopatz commented Jun 26, 2019

takluyver commented Jun 26, 2019

aragilar commented Jun 27, 2019

scopatz commented Jun 27, 2019

aragilar commented Jun 20, 2020

Fix for zero length collective IO (issue #965) #1206

Are you sure you want to change the base?

Fix for zero length collective IO (issue #965) #1206

Conversation

jrs65 commented Apr 13, 2019

codecov bot commented Apr 13, 2019 • edited

Codecov Report

takluyver commented Apr 23, 2019

jrs65 commented Apr 29, 2019

takluyver commented Apr 29, 2019

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jrs65 Apr 29, 2019 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jrs65 Apr 29, 2019 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

takluyver commented May 4, 2019

aragilar commented Jun 25, 2019

scopatz commented Jun 26, 2019

takluyver commented Jun 26, 2019

aragilar commented Jun 27, 2019

scopatz commented Jun 27, 2019

aragilar commented Jun 20, 2020

codecov bot commented Apr 13, 2019 •

edited

jrs65 Apr 29, 2019 •

edited

jrs65 Apr 29, 2019 •

edited