Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MPI hang during collective writes due to erroneous elision of ranks with nothing to write #965

Open
greenc-FNAL opened this issue Dec 20, 2017 · 2 comments · May be fixed by #1634
Open

Comments

@greenc-FNAL
Copy link

  • Operating System (e.g. Windows 10, MacOS 10.11, Ubuntu 16.04.2 LTS, CentOS 7)
    CentOS 7-ish (Scientific Linux 7.3)
  • Python version (e.g. 2.7, 3.5)
    2.7.14
  • Where Python was acquired (e.g. system Python on MacOS or Linux, Anaconda on
    Windows)
    Institutional build (https://scisoft.fnal.gov/packages/python/)
  • h5py version (e.g. 2.6)
    2.7.1
  • numpy version
    1.13.3
  • HDF5 version (e.g. 1.8.17)
    develop@4d4eb4cd5c (verified with 1.10.1)
  • The full traceback/stack trace shown (if it appears)
    N/A

In current versions of HDF5, collective writes are an available feature, but there is no really compelling reason to use them. They are faster than the equivalent set of individual writes, but the work required to ensure collectivity is often not worth the trouble. With the upcoming release though, HDF5 offers the ability to use filters on parallel writes to a datasets, but only if those writes are collective. This means that, in the event a particular rank has nothing to write, the HDF5 write command must still be made. Unfortunately, as shown by the attached python demonstration program, h5py appears to elide NOP writes, thus causing an MPI-related hang during the collective operation.

Correspondence with an HDF5 dev (attached) assures that the underlying HDF5 system is perfectly capable of dealing with NOP writes as part of a collective operation, but they do need to be passed through.

I'd be grateful if there were some way to enhance h5py to no longer elide NOP collective write operations.

To demonstrate the problem:

    mpiexec -np 4 ./demonstrate_h5py_cw_issue.py test.hdf5 4 # Works
    mpiexec -np 4 ./demonstrate_h5py_cw_issue.py test.hdf5 3 # Hangs
#!/usr/bin/env python
"""Demonstrate issue with collective writes when at least one rank has nothing to do."""

from __future__ import print_function

import h5py
import numpy as np
import string
import sys
import re
import os
import argparse

try:
    from mpi4py import MPI
    n_ranks = MPI.COMM_WORLD.size
    my_rank = MPI.COMM_WORLD.rank
    WANT_MPI = os.environ["WANT_MPI"] if "WANT_MPI" in os.environ else (n_ranks > 1)
except ImportError:
    MPI = None
    n_ranks = 1
    my_rank = 0
    WANT_MPI = False

def parse_args():
    parser = argparse.ArgumentParser(description='Collective write hang demonstrator.', prefix_chars='-+')
    parser.add_argument('output', help='Name of HDF5 output file.')
    parser.add_argument('nrows', help='number_of_rows', type=int, default=n_ranks-1, nargs='?')
    return parser.parse_args()

class dummy_context_mgr:
    def __enter__(self):
        return None
    def __exit__(self, exc_type, exc_value, traceback):
        return False

if __name__ == "__main__":

    args = parse_args()

    nrows = int(args.nrows)

    output_file = h5py.File(args.output, 'w', driver='mpio', comm=MPI.COMM_WORLD) if WANT_MPI \
                  else h5py.File(args.output, 'w')

    dataset = output_file.create_dataset('data', shape=(nrows,), maxshape=(None,), chunks = True)

    data = np.linspace(0, nrows, num=nrows, endpoint=False, dtype=np.int32)

    iterations = 0
    while iterations * n_ranks < nrows:
        start = iterations * n_ranks + my_rank
        end = start + 1 if start < nrows else start

        with dataset.collective if WANT_MPI else dummy_context_mgr():
            dataset[start:end] = data[start:end]

        iterations += 1

From: XXX@hdfgroup.org

Hi Chris,

It's unfortunate that you're encountering this issue with ranks who have nothing to contribute to the I/O operation. Indeed, I specifically designed around the case of one or more ranks contributing nothing, as this special case caused the need for a lot of refactoring. I have two tests for this functionality, one which has one rank write nothing and another which has everyone write nothing to the dataset, and these tests pass on all of our testing platforms. In these tests, the non-participating ranks call H5Sselect_none, however I quickly modified the tests to have the non-participating ranks simply create a dataspace of size 0 and select an empty hyperslab before writing, and they still pass, so I'm fairly confident that HDF5 should handle this case well in its various forms.

I believe it makes sense to first report this as a bug/improvement against h5py and let them investigate into it, as I believe they may still not even know about the support for this new functionality since it hasn't been officially released yet. However, I did see an old discussion about this a few years ago and I imagine they will likely be interested in supporting this for users. I also believe that the h5py maintainers will likely want to converse with us about the feature in order to correctly support it, but raising their awareness of the issue is probably a good first step.

When I get a free chance, I'll run your programs and see if I can't figure out exactly what's going on with the code. If the non-participating ranks are indeed not participating in the write call at the lower levels, then, as you experienced, MPI is going to be unhappy while the other ranks are waiting. If h5py explicitly does not let these ranks participate, I can't think of a good temporary workaround short of modifying the h5py code to allow this case.

@aragilar
Copy link
Member

@chissg If you can create a PR which adds the necessary calls (and tests), then I'd be happy to merge it.

@tacaswell tacaswell added this to the 2.9 milestone Feb 11, 2018
@takluyver takluyver modified the milestones: 2.9, 2.10 Oct 24, 2018
jrs65 added a commit to jrs65/h5py that referenced this issue Apr 12, 2019
Collective IO where one rank attempts to read or write a zero length
slice causes the a hang in an underlying collective call. This fix
ensures that all ranks participate in any collective IO operation even
if some operations are on zero length slices.

This also includes some basic unit testing for collective IO that tests
for this problem.
@jrs65
Copy link

jrs65 commented Apr 13, 2019

@aragilar I've made an attempt to fix this one in #1206 if you're still looking for a fix.

@takluyver takluyver removed this from the 2.10 milestone Sep 6, 2019
@takluyver takluyver added the MPI Bugs related to MPI label May 22, 2020
aragilar pushed a commit to aragilar/h5py that referenced this issue Aug 9, 2020
Collective IO where one rank attempts to read or write a zero length
slice causes the a hang in an underlying collective call. This fix
ensures that all ranks participate in any collective IO operation even
if some operations are on zero length slices.

This also includes some basic unit testing for collective IO that tests
for this problem.
@aragilar aragilar linked a pull request Aug 9, 2020 that will close this issue
aragilar pushed a commit to aragilar/h5py that referenced this issue Aug 9, 2020
Collective IO where one rank attempts to read or write a zero length
slice causes the a hang in an underlying collective call. This fix
ensures that all ranks participate in any collective IO operation even
if some operations are on zero length slices.

This also includes some basic unit testing for collective IO that tests
for this problem.
aragilar pushed a commit to aragilar/h5py that referenced this issue Aug 9, 2020
Collective IO where one rank attempts to read or write a zero length
slice causes the a hang in an underlying collective call. This fix
ensures that all ranks participate in any collective IO operation even
if some operations are on zero length slices.

This also includes some basic unit testing for collective IO that tests
for this problem.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants