Subfiling #2227

nadvornikjiri · 2023-02-07T14:26:33Z

Hi, according to discussion in #2194, I'm creating a pull request even though the build is failing now to make it more convenient for review.

nadvornikjiri · 2023-02-07T14:39:55Z

Btw the MPI markers don't change the result, I'm building with parallel HDF5 anyway. I am missing something obvious I think, but the struct definitions should be everywhere the guide suggests.

takluyver

I don't know at the moment what's causing it to fail for your MPI builds, I'm afraid. These comments are about cleaning it up so it can still build when without MPI (or on older HDF5 versions).

h5py/h5fd.pyx

h5py/h5p.pyx

h5py/h5fd.pxd

h5py/h5p.pyx

h5py/api_types_hdf5.pxd

codecov · 2023-02-08T15:29:26Z

Codecov Report

Base: 90.01% // Head: 86.91% // Decreases project coverage by -3.11% ⚠️

Coverage data is based on head (c3e4310) compared to base (c6262ac).
Patch coverage: 9.09% of modified lines in pull request are covered.

Additional details and impacted files

@@            Coverage Diff             @@
##           master    #2227      +/-   ##
==========================================
- Coverage   90.01%   86.91%   -3.11%     
==========================================
  Files          17       17              
  Lines        2394     2315      -79     
==========================================
- Hits         2155     2012     -143     
- Misses        239      303      +64

Impacted Files	Coverage Δ
h5py/_hl/files.py	`77.89% <9.09%> (-10.20%)`	⬇️
h5py/_hl/filters.py	`77.43% <0.00%> (-15.43%)`	⬇️
h5py/_hl/compat.py	`73.68% <0.00%> (-10.53%)`	⬇️
h5py/_hl/dims.py	`95.18% <0.00%> (-0.70%)`	⬇️
h5py/_hl/dataset.py	`92.77% <0.00%> (-0.67%)`	⬇️
h5py/_hl/datatype.py	`95.00% <0.00%> (-0.66%)`	⬇️
h5py/_hl/selections.py	`86.91% <0.00%> (-0.59%)`	⬇️
h5py/_hl/base.py	`95.70% <0.00%> (-0.22%)`	⬇️
h5py/_hl/group.py	`96.83% <0.00%> (-0.13%)`	⬇️
... and 2 more

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

☔ View full report at Codecov.
📢 Do you have feedback about the report comment? Let us know in this issue.

Co-authored-by: Thomas Kluyver <takowl@gmail.com>

nadvornikjiri · 2023-02-10T14:21:05Z

Aha, I have found it. The Subfiling is actually built with 1.14 only conditionally :) I was actually missing it from the HDF5 build.It needs to be built with ./configure --enable-parallel --enable-subfiling-vfd=yes.

Now, I'm trying to create some tests for the API, but I have two questions:

It seems that the code works:

plist: PropFAID = h5py.h5p.create(h5py.h5p.FILE_ACCESS)
ioc_config = plist.get_fapl_ioc()
subf_config = plist.get_fapl_subfiling()

but I can't then set the subf_config.shared_cfg.stripe_count directly, because shared_cfg is a dictionary? That is how it is supposed to work or do I have the struct definitions wrong somewhere? Setting it by key subf_config.shared_cfg["stripe_count"] works normally.

I can't get the grasp of how where do I find and use the enum types. If I want to set the ioc_selection enum not by remembering their integer values, but rather by setting the constants, what is the proper way of doing it?

The definition in api_types_hdf5.pxd looks like:

ctypedef enum H5FD_subfiling_ioc_select_t:
      SELECT_IOC_ONE_PER_NODE,
      SELECT_IOC_EVERY_NTH_RANK,
      SELECT_IOC_WITH_CONFIG,
      SELECT_IOC_TOTAL

so my naive assumption would be to use it like:

h5py.H5FD_subfiling_ioc_select_t.SELECT_IOC_ONE_PER_NODE

By the way, the value of SELECT_IOC_ONE_PER_NODE from the default config is 0.

takluyver · 2023-02-12T15:22:39Z

OK, if it's an optional feature in HDF5, we'll probably want to follow the pattern we have for the ROS3 and direct file drivers: autodetection of whether HDF5 is built with that feature, plus an environment variable that can override it. See this code:

h5py/setup_configure.py

Lines 87 to 113 in 3507819

    
           h5_version_s = os.environ.get('HDF5_VERSION') 
        
           h5py_ros3 = os.environ.get('H5PY_ROS3') 
        
           h5py_direct_vfd = os.environ.get('H5PY_DIRECT_VFD') 
        
           if h5_version_s and not mpi and h5py_ros3 and h5py_direct_vfd: 
        
               # if we know config, don't use wrapper, it may not be supported 
        
               return cls( 
        
                   h5_inc, h5_lib, h5_macros, validate_version(h5_version_s), mpi, 
        
                   h5py_ros3 == '1', h5py_direct_vfd == '1') 
        
           h5_wrapper = HDF5LibWrapper(h5_lib) 
        
           if h5_version_s: 
        
               h5_version = validate_version(h5_version_s) 
        
           else: 
        
               h5_version = h5_wrapper.autodetect_version() 
        
               if mpi and not h5_wrapper.has_mpi_support(): 
        
                   raise RuntimeError("MPI support not detected") 
        
           if h5py_ros3: 
        
               ros3 = h5py_ros3 == '1' 
        
           else: 
        
               ros3 = h5_wrapper.has_ros3_support() 
        
           if h5py_direct_vfd: 
        
               direct_vfd = h5py_direct_vfd == '1' 
        
           else: 
        
               direct_vfd = h5_wrapper.has_direct_vfd_support()

I can't then set the subf_config.shared_cfg.stripe_count directly, because shared_cfg is a dictionary?

If a Cython function callable from Python (like your shared_cfg property's __get__ method) returns a C struct directly, it's expected that Cython will convert that to a dict (docs). In this case, as there's only one interesting member in shared_cfg, I might expose that (thread_pool_size) as a property on SubfilingConfig as well.

I can't get the grasp of how where do I find and use the enum types.

The enum name isn't a namespace, so in Cython code you should be able to use a name like SELECT_IOC_ONE_PER_NODE directly, with no qualifiers. The api_types_hdf5.pxd definitions are included into all of our Cython modules (via defs.pxd), so all those names should work. Cython follows C in this regard.

If you want these constants to be accessible from Python code, you'll need to expose them on the relevant module, like this example:

h5py/h5py/h5fd.pyx

Lines 21 to 28 in f186e2d

    
           MEM_DEFAULT = H5FD_MEM_DEFAULT 
        
           MEM_SUPER = H5FD_MEM_SUPER 
        
           MEM_BTREE = H5FD_MEM_BTREE 
        
           MEM_DRAW = H5FD_MEM_DRAW 
        
           MEM_GHEAP = H5FD_MEM_GHEAP 
        
           MEM_LHEAP = H5FD_MEM_LHEAP 
        
           MEM_OHDR = H5FD_MEM_OHDR 
        
           MEM_NTYPES = H5FD_MEM_NTYPES

… need to add tests now.

for more information, see https://pre-commit.ci

nadvornikjiri · 2023-02-16T08:00:02Z

Thanks for pointing me. I think I got it to work, but now I'm suspecting I broke the high level API for the mpio driver. Could you point me also to what kind of tests should I add so I'm sure I didn't break anything else?

nadvornikjiri · 2023-02-16T08:01:46Z

Oh and I forgot to add the automatic detection for the subfiling VFD, I'll add that with next commit.

nadvornikjiri · 2023-03-08T16:50:34Z

@takluyver I would need some help still - I am able to successfully run the h5py with ONE_PER_NODE IOC strategy (see docs ).

But when I try to use the the SELECT_IOC_TOTAL or SELECT_IOC_EVERY_NTH_RANK it fails with the following:

Traceback (most recent call last):
  File "hisscube.py", line 80, in <module>
    cli_command_invoker.execute()
  File "/home/caucau/SDSSCube/hisscube/command.py", line 14, in execute
    build_director.construct()
  File "/home/caucau/SDSSCube/hisscube/director.py", line 48, in construct
    builder.build()
  File "/home/caucau/SDSSCube/hisscube/builders.py", line 130, in build
    self._build()
  File "/home/caucau/SDSSCube/hisscube/builders.py", line 149, in _build
    with self.h5_connector as h5_connector:
  File "/home/caucau/SDSSCube/hisscube/utils/io.py", line 99, in __enter__
    self.open_h5_file()
  File "/home/caucau/SDSSCube/hisscube/utils/io.py", line 237, in open_h5_file
    truncate(self.h5_path, self.config, self.comm)
  File "/home/caucau/SDSSCube/hisscube/utils/io.py", line 77, in truncate
    f = h5py.File(h5_path, 'w', driver='mpio', comm=comm, libver="latest",
  File "/home/caucau/SDSSCube/venv_par/lib/python3.8/site-packages/h5py/_hl/files.py", line 614, in __init__
    fid = make_fid(name, mode, userblock_size, fapl, fcpl, swmr=swmr)
  File "/home/caucau/SDSSCube/venv_par/lib/python3.8/site-packages/h5py/_hl/files.py", line 265, in make_fid
    fid = h5f.create(name, h5f.ACC_TRUNC, fapl=fapl, fcpl=fcpl)
  File "h5py/_objects.pyx", line 54, in h5py._objects.with_phil.wrapper
  File "h5py/_objects.pyx", line 55, in h5py._objects.with_phil.wrapper
  File "h5py/h5f.pyx", line 126, in h5py.h5f.create
OSError: Unable to synchronously create file (open failed)

gdb cannot latch onto any exception as it seems this OSError is handled by h5py and not rethrown. How can I debug this without stepping through the whole file creation? I'm not sure I would be able to pinpoint it that way anyway.

It might be also a bug in the 1.14 as this is quite recent functionality..

Thanks for help!

ajelenak · 2023-03-08T17:21:24Z

You can get better libhdf5 error stack trace with h5py._errors.unsilence_errors() somewhere before the code that creates the error.

takluyver · 2023-03-09T11:48:42Z

h5py/_hl/files.py

+                 alignment_threshold=1, alignment_interval=1, meta_block_size=None,
+                 ioc_selection=None, stripe_size=None, stripe_count=None, ioc_thread_pool_size=None, **kwds):


Generally the driver-specific options go into **kwds rather than being explicit at this level.

takluyver · 2023-03-09T11:49:08Z

@aragilar can you take a look, as this relates to MPI?

nadvornikjiri · 2023-03-09T14:20:24Z

@ajelenak The result of unsilence_errors is following:

HDF5 Subfiling-DIAG: Error detected in HDF5 Subfiling (1.0.0) MPI-process 0:
  #000: H5FDsubfiling/H5FDsubfiling.c line 1141 in H5FD__subfiling_open(): unable to open IOC file
    major: Virtual File Layer
    minor: Unable to open file
  #001: H5FDsubfiling/H5FDioc.c line 874 in H5FD__ioc_open(): one or more MPI ranks were unable to open file 'results/SDSS_cube_sub.h5'
    major: File accessibility
    minor: Unable to open file
  #002: H5FDsubfiling/H5FDioc.c line 831 in H5FD__ioc_open(): unable to open subfiles for file 'results/SDSS_cube_sub.h5'
    major: File accessibility
    minor: Unable to open file
  #003: H5FDsubfiling/H5subfiling_common.c line 679 in H5_open_subfiles(): one or more IOC ranks couldn't open subfiles
    major: Virtual File Layer
    minor: Unable to open file
  #004: H5FDsubfiling/H5subfiling_common.c line 619 in H5_open_subfiles(): couldn't initialize subfiling context
    major: Virtual File Layer
    minor: Unable to initialize object
  #005: H5FDsubfiling/H5subfiling_common.c line 851 in init_subfiling(): couldn't initialize application topology
    major: Virtual File Layer
    minor: Unable to initialize object
  #006: H5FDsubfiling/H5subfiling_common.c line 1023 in init_app_topology(): invalid IOC selection strategy
    major: Virtual File Layer
    minor: Bad value
HDF5-DIAG: Error detected in HDF5 (1.14.0) MPI-process 0:
  #000: H5F.c line 660 in H5Fcreate(): unable to synchronously create file
    major: File accessibility
    minor: Unable to create file
  #001: H5F.c line 614 in H5F__create_api_common(): unable to create file
    major: File accessibility
    minor: Unable to open file
  #002: H5VLcallback.c line 3605 in H5VL_file_create(): file create failed
    major: Virtual Object Layer
    minor: Unable to create file
  #003: H5VLcallback.c line 3571 in H5VL__file_create(): file create failed
    major: Virtual Object Layer
    minor: Unable to create file
  #004: H5VLnative_file.c line 94 in H5VL__native_file_create(): unable to create file
    major: File accessibility
    minor: Unable to open file
  #005: H5Fint.c line 1815 in H5F_open(): unable to open file: name = 'results/SDSS_cube_sub.h5', tent_flags = 13
    major: File accessibility
    minor: Unable to open file
  #006: H5FD.c line 769 in H5FD_open(): open failed
    major: Virtual File Layer
    minor: Unable to initialize object

The problem is that it thinks the IOC strategy is:

print ioc_selection_type
$1 = SELECT_IOC_WITH_CONFIG

But I'm explicitly saying use the constant from here:

    if ioc_selection:
        if ioc_selection == "one_per_node":
            ioc_selection_enum = h5fd.IOC_ONE_PER_NODE
        elif ioc_selection == "every_nth_rank":
            ioc_selection_enum = h5fd.IOC_EVERY_NTH_RANK
        elif ioc_selection == "total":
            ioc_selection_enum = h5fd.IOC_TOTAL

So how come they are mismatched?

ajelenak · 2023-03-09T17:12:21Z

I would first print h5fd.IOC_... values to see if they match those in the libhdf5 header file. Next, insert a print statement just before the call to the HDF5 function in the Cython code to see the value of the constant. You also should debug the Python code to see exactly the execution flow when setting the ioc variables.

aragilar · 2023-03-10T02:13:12Z

h5py/_hl/files.py

+    if ioc_selection:
+        if ioc_selection == "one_per_node":
+            ioc_selection_enum = h5fd.IOC_ONE_PER_NODE
+        elif ioc_selection == "every_nth_rank":
+            ioc_selection_enum = h5fd.IOC_EVERY_NTH_RANK
+        elif ioc_selection == "total":
+            ioc_selection_enum = h5fd.IOC_TOTAL
+        else:
+            raise NotImplementedError("Unsupported IO concentrator allocation mode.")
+        subf_config.ioc_selection = ioc_selection_enum


Minor thing, but it might be nicer to express this as an (Python stdlib) Enum, rather than switch on strings?

@aragilar Thanks for the style comment, though if you mean the Python Enum, that one is never used within h5py library, so I'd rather go for consistency than cleanliness. Others might suggest better but what I see the most common is something like this?

strategies = { 'fsm': h5f.FSPACE_STRATEGY_FSM_AGGR, 'page': h5f.FSPACE_STRATEGY_PAGE, 'aggregate': h5f.FSPACE_STRATEGY_AGGR, 'none': h5f.FSPACE_STRATEGY_NONE } fs_strat_num = strategies.get(fs_strategy, -1) if fs_strat_num == -1: raise ValueError("Invalid file space strategy type")

Anyway I expect that there will be more cleaning up once the code is working :).

aragilar · 2023-03-10T02:40:16Z

I'm not seeing anything that concerns me with respect to MPI (the main thing would be the minor code style stuff that's already been mentioned).

On the SELECT_IOC_TOTAL and SELECT_IOC_EVERY_NTH_RANK issue, how many worker processes are you running and what do you have H5FD_SUBFILING_IOC_SELECTION_CRITERIA set to? I've not looked at the HDF5 code, but it would not surprise me that you need to be careful to align the two values to ensure you only get whole integers (given you've got the simple case of SELECT_IOC_ONE_PER_NODE working, and I'm guessing you're testing on one node?).

nadvornikjiri · 2023-03-10T08:49:08Z

I'm not seeing anything that concerns me with respect to MPI (the main thing would be the minor code style stuff that's already been mentioned).

On the SELECT_IOC_TOTAL and SELECT_IOC_EVERY_NTH_RANK issue, how many worker processes are you running and what do you have H5FD_SUBFILING_IOC_SELECTION_CRITERIA set to? I've not looked at the HDF5 code, but it would not surprise me that you need to be careful to align the two values to ensure you only get whole integers (given you've got the simple case of SELECT_IOC_ONE_PER_NODE working, and I'm guessing you're testing on one node?).

The SELECT_IOC_ONE_PER_NODE is working on multiple nodes as well (as long as I don't set the H5FD_SUBFILING_IOC_SELECTION_CRITERIA). Now I'm using 4 processes and H5FD_SUBFILING_IOC_SELECTION_CRITERIA=2 when testing both SELECT_IOC_TOTAL and SELECT_IOC_EVERY_NTH_RANK, which should result in the same number of io concentrators.

As mentioned above, the main problem is that the enum defined within api_types_hdf5.pxd:

     ctypedef enum H5FD_subfiling_ioc_select_t:
      SELECT_IOC_ONE_PER_NODE = 0,
      SELECT_IOC_EVERY_NTH_RANK,
      SELECT_IOC_WITH_CONFIG,
      SELECT_IOC_TOTAL

while printing reasonable enum values enum values within python:

        print("h5fd.IOC_ONE_PER_NODE: %d" % h5fd.IOC_ONE_PER_NODE)
        print("h5fd.IOC_EVERY_NTH_RANK: %d" % h5fd.IOC_EVERY_NTH_RANK)
        print("h5fd.IOC_TOTAL: %d" % h5fd.IOC_TOTAL)

h5fd.IOC_ONE_PER_NODE: 0
h5fd.IOC_EVERY_NTH_RANK: 1
h5fd.IOC_TOTAL: 3

is not using correct values at the end of the line within H5FDsubfiling_common.c where I end up at line 1021:

    switch (ioc_selection_type) {
...
        case SELECT_IOC_WITH_CONFIG:

The interesting part is that only the SELECT_IOC_ONE_PER_NODE works correctly, using anything else (SELECT_IOC_TOTAL or SELECT_IOC_EVERY_NTH_RANK) always ends up with:

Thread 1 "python" hit Breakpoint 1, init_app_topology (
    subfiling_config=0x26aedc0, comm=0x29ad970, node_comm=0x2a14f90, 
    app_topology_out=0x7fffffffb100) at H5FDsubfiling/H5subfiling_common.c:941
941         switch (ioc_selection_type) {
(gdb) print ioc_selection_type
$1 = SELECT_IOC_WITH_CONFIG

Sadly, I'm not sure how to print the enum value within gdb, it doesn't seem to support it out of the box.

nadvornikjiri · 2023-03-10T11:34:33Z

Aha, one more chapter to this novel.

The enumerations are being passed ok, this prints it ok inside the init_subfiling method:

print subfiling_config->ioc_selection
$3 = SELECT_IOC_EVERY_NTH_RANK

Stay tuned for next episode..

nadvornikjiri · 2023-03-10T13:36:33Z

Alright, I figured out the parsing of the enums/environment variables.

There is a small mismatch (or I at least misunderstood it) between the HDF5 documentation that says should be in the H5FD_SUBFILING_IOC_SELECTION_CRITERIA and what the code expects. If you dig deeper, this is the comment in the code that explains it actually :).

        /*
         * For non-default options, the environment variable
         * should have the following form:  integer:[integer|string]
         * In particular, EveryNthRank == 1:64 or every 64 ranks assign an IOC
         * or WithConfig == 2:/<full_path_to_config_file>
         */

This was unexpected for me as it actually wants me to repeat the enum ID of the IO concentrator in the config, which is rather strange :).

Anyway I still think there is a bug in H5subfiling_common.c at line 1602:

            /* Set final number of I/O concentrators after adjustments */
            app_topology->n_io_concentrators = ioc_next;

For the simplest case if I have just one rank than the code couple of lines above it

if (ioc_next + 1 >= max_iocs)
                    break;

breaks on the first loop and says number of io concentrators is 0 if the index of the first one is 0. This is the same though if I have n concentrators, this number will always be n-1. Since I'm testing if (ioc_next + 1 >= max_iocs), the n_io_concentrators should also be ioc_next + 1.

@aragilar could you please confirm whether my conclusion is correct? Because if it is, SELECT_IOC_EVERY_NTH_RANK could not have been working in 1.14.0.

aragilar · 2023-03-10T13:48:39Z

Ah, I hadn't tested any of the code (either yours or the HDF5 subfiling code). I looked at the RFC to understand what subfiling was aiming to do, looked at the HDF5 docs and looked at this PR. As I believe the HDF5 codebase is developed in public on GitHub, you may be able to look at what testing was done on the original PR, in case there happened to be a test setup that hid the bug, or if it's expected (and rather the documentation needs clarifying).

I'll try to look at this in more detail tomorrow.

nadvornikjiri · 2023-03-10T14:01:06Z

Hmm, seems indeed that the tests were written only for the default 1 IOC per node, this is the only one I could find.

jhendersonHDF · 2023-03-10T19:18:52Z

Hi All,

unfortunately this is mostly a case of "half-baked feature options we were considering removing". We were planning to evaluate whether options like "select an I/O Concentrator every N ranks" were useful compared to the default "one per node" where the number of I/O Concentrators can be scaled up with the H5FD_SUBFILING_IOC_PER_NODE environment variable, but didn't get around to that yet. Since our performance testing was mostly done with the SELECT_IOC_ONE_PER_NODE option, the other options didn't end up getting any love or regression testing and it appears I partially broke them at some point in the past. Thanks for reporting these findings though; I plan to fix issues in this thread and hopefully tidy up the VFD's testing and documentation for the 1.14.1 release. A couple other things:

The comment previously mentioned:

    /*
     * For non-default options, the environment variable
     * should have the following form:  integer:[integer|string]
     * In particular, EveryNthRank == 1:64 or every 64 ranks assign an IOC
     * or WithConfig == 2:/<full_path_to_config_file>
     */

is somewhat leftover cruft (especially the "WithConfig" part since we don't support a configuration file option yet). The code does check for selection criteria being in that form, but for the most part specifying a single number is supposed to work. For example, setting H5FD_SUBFILING_IOC_SELECTION_CRITERIA=5 for the SELECT_IOC_TOTAL selection type is meant to specify 5 total I/O Concentrators. However, there's also a bug I just noticed where specifying things that way has the unintended side effect of modifying the selection type to that value as well, which is why one could end up hitting the error above of "invalid IOC selection strategy". I believe the intention of the original author of the VFD was that having the form of integer:[integer|string] for the criteria would allow one to change both the selection strategy and criteria for that strategy on the fly with a single environment variable, but I don't think that was well documented. Since there's also currently no field in the H5FD_subfiling_config_t configuration structure to set the criteria for a selection strategy, there's a bit of an awkward disconnect where one might set the selection strategy for the VFD in its configuration structure, but then be required to set criteria for that strategy with the H5FD_SUBFILING_IOC_SELECTION_CRITERIA environment variable. We should really add that field to the configuration structure, but could anyone here weigh in on whether it would be more useful to keep the current environment variable string format and document it better so that the selection strategy and its criteria are together in a single environment variable, or whether it makes sense to keep two separate environment variables for this purpose?

Also, I agree with @nadvornikjiri that there's a bug (more likely multiple) in the MPI rank selection logic for SELECT_IOC_TOTAL and SELECT_IOC_EVERY_NTH_RANK in H5subfiling_common.c when using a single MPI rank or really any number of ranks.

nadvornikjiri · 2023-03-13T08:17:15Z

@jhendersonHDF Hi, thanks so much for looking into this. Could we also do some hot-fix to make any (every nth rank or total) strategy work so I can test the multiple IOCs per node?

We are now writing an article for the Supercomputing conference, which means we need the results in 1-2 weeks to consolidate them. We already have the ones for one ioc per node but due to some aspects of the Lustre filesystem beneath we are getting mediocre results with that...

Maybe we could also have a call on what parameters would be getting the best performance with our setup?

jhendersonHDF · 2023-03-13T20:03:18Z

Hi @nadvornikjiri,

Here's a patch (.txt because GitHub doesn't seem to like a .patch extension) against the HDF5 1.14.0 release that hopefully fixes the issues with the different selection strategies. With those changes you should be able to specify values in the H5FD_SUBFILING_IOC_SELECTION_CRITERIA environment variable as either a single value, or the old strategy_enum_value:criteria form without issues. Also note that when testing the default SELECT_IOC_ONE_PER_NODE strategy you should already be able to scale it up by setting a value in the H5FD_SUBFILING_IOC_PER_NODE environment variable. For example, setting it to '2' would use 2 I/O Concentrators per machine node rather than the default of 1. Eventually, we should integrate that value into the H5FD_SUBFILING_IOC_SELECTION_CRITERIA environment variable, but for now it's a separate one.

As for parameters for Subfiling, if you don't already you should probably use H5Pset_alignment to set the alignment of objects in the file to be equal to the stripe size on your Lustre system so that accesses are well-aligned. You should also probably enable collective metadata I/O with H5Pset_all_coll_metadata_ops and H5Pset_coll_metadata_write to try and improve metadata I/O, as long as your application's access pattern can support collective metadata I/O without causing problems.

Note that we've typically seen that the Subfiling VFD is very good for writing data, but may have a few inefficiencies when it comes to reading data. Until we investigate those issues, it may be that the VFD is most useful for writing out large data and then fusing the subfiles back into a regular HDF5 file with the h5fuse tool before reading data with a different VFD, such as HDF5's MPI I/O VFD. I'd be happy to find time for a call with you and your team though if you find that you're still having performance issues with the VFD.

jhendersonHDF · 2023-03-15T22:01:23Z

See also HDFGroup/hdf5#2571

nadvornikjiri · 2023-03-16T16:04:18Z

@jhendersonHDF I have tried the ONE_IOC_PER_NODE strategy when setting the H5FD_SUBFILING_IOC_PER_NODE variable. I am not using the alignment yet, more on that later.

The results for set up:
subfiles:192
subfile_stripe_size:64 MB
Lustre stripe_count: 1
Lustre stripe_size: 64 MB
32 IOC threads

Are:

Nodes	Workers	IOC	Rate/s	Efficiency	Read [MB/s]	Write Bandwidth [MB/s]	Total IO [MB/s]
6	765	1	191.25	22%			2224.8			5908.6			8133.4
6	765	2	237.77	27%			2765.9			7345.8			10111.7
6	765	4	276.12	31%			3212.1			8530.6			11742.7
6	765	8	277.56	32%			3228.8			8575.1			11803.9

So it's scaling a little bit.. but not really.

Could you please give me a hint on the alignment threshold and the alignment parameters described here for this set up above? It's still not clear to me whether it's bytes, intervals, or what is it really.

Thanks.

jhendersonHDF · 2023-03-16T17:02:24Z

@nadvornikjiri In our testing we mostly found that 1 IOC per node with 4 I/O threads performed the best with 2 IOCs per node coming close, but generally not outperforming the former case. Of course that may differ depending on the hardware, but those were results from a few different HPC machines.

For the parameters to H5Pset_alignment, both the threshold value and the alignment value are in terms of bytes. The threshold value specifies that only objects in the file which are equal to or larger than "threshold bytes" will be aligned in the file. The alignment value specifies the alignment of those objects in terms of bytes, which the library enforces by adding padding when allocating space for objects in the file. As an example, if I were to create a chunked dataset in an HDF5 file with a chunk size of 4MiB and I was using a Lustre system with a stripe size of 1MiB, I'd usually use H5Pset_alignment(fapl_id, 4194304, 1048576) so that only the dataset's chunks (or anything 4MiB or larger in the file) will get aligned according to the 1MiB Lustre stripe size. In your case I'd probably use either 1048576 or 67108864 for the alignment value (though note the latter might possibly waste a lot of space in the file) and then something reasonable for the threshold based on whether your datasets are chunked, how big they are etc. If unsure, a somewhat safe bet for making just your dataset's raw data be aligned would probably be something like 4096 for the threshold. You could also set the threshold value to 0 so that everything in the file is aligned on your chosen alignment value, but that will likely waste more file space than desired.

Also, I don't know how large your files are, but I'm wondering if the Lustre stripe count of 1 is possibly holding you back on performance?

nadvornikjiri · 2023-03-20T17:25:48Z

@jhendersonHDF Wow thanks for the alignment setting tip - this actually resulted in cca 4x speedup even for the MPIO driver. Right now we are running for 6 nodes, stripe count 48, stripe size 4MB:

Nodes	Workers	IOC per node	IOC Threads	Rate/s	Efficiency	Read [MB/s]	Write Bandwidth [MB/s]	Total IO [MB/s]	Write efficiency
6	765	MPIO				782.59	89%		9103.7		24177.8			33281.5		39%
6	765	1		64		793.12	90%		9226.2		24503.1			33729.3		39%
6	765	2		64		826.61	94%		9615.8		25537.7			35153.5		41%
6	765	1		4		825.22	94%		9599.6		25494.8			35094.4		41%
6	765	2		4		827.48	94%		9625.9		25564.6			35190.5		41%

The write efficiency is compared to File per process measured by IOR, giving 10404MB/s per node.

So the result is that the IOC is still not scaling, but our use case and the setup has also one more restriction introduced by perhaps wrong setting off Lustre on the HPC? which I try to explain here, but maybe the call would be better. Let's have a PM regarding the time for that one.

Lustre restriction - odd scaling on one node:

Method			Writers per node	Nodes	Writers	Stripe count	Average bandwidth	Bandwith / node
File per process	32			8	256	32		67214			8401.75
Shared file		32			1	32	32		868.44			868.44
Shared file		1			1	1	1		920.53			920.53
Shared file		1			1	1	2		915.83			915.83
Shared file		2			1	2	1		1236.13			1236.13
Shared file		2			1	2	2		1239.72			1239.72
Shared file		4			1	4	1		1180.49			1180.49
Shared file		4			1	4	2		1178.48			1178.48
Shared file		8			1	8	1		1014.99			1014.99
Shared file		8			1	8	2		1020.35			1020.35
Shared file		1			2	2	1		1562			781.00
Shared file		1			4	4	1		3827.38			956.85
Shared file		1			8	8	1		6238.3			779.79
Shared file		1			2	2	2		1722.18			861.09
Shared file		1			4	4	2		3470.51			867.63
Shared file		1			8	8	2		6987.28			873.41

This shows that the Lustre write bandwidth is not scaling with number of processes on one node, but more importantly, with stripe count on one node. Interestingly enough, it does scale with multiple nodes (without increasing stripe count!), so with adding more nodes, it scales almost linearly. The HPC support has escalated the issue on the HPE vendor. But this is the reason I was testing it with stripe count 1 as having multiple files puts them into different OSTs anyway. Our working file size is 30 TB.

Funny part is that my HDF5 with the alignment set the same as Lustre stripe size is running ~4x faster (4GB/s per node write bandwidth) than the POSIX IOR test. Why is that I still need to figure out.. maybe another topic for our call.

This is the configuration of IOR that it's running with:

Options: 
api                 : POSIX
apiVersion          : 
test filename       : testFile
access              : single-shared-file
type                : independent
segments            : 32
ordering in a file  : sequential
ordering inter file : no tasks offsets
nodes               : 8
tasks               : 8
clients per node    : 1
memoryBuffer        : CPU
dataAccess          : CPU
GPUDirect           : 0
repetitions         : 10
xfersize            : 4 MiB
blocksize           : 64 MiB
aggregate filesize  : 16 GiB

nadvornikjiri · 2023-04-08T20:15:20Z

Alright, @jhendersonHDF I have done some extensive testing and realized these good results were false positive, there was a bug in my code that caused these.

So back to the trees, we are back with both MPIO and the Subfiling to a crawl - write bandwidth cca 500 MB/s per node. So it is comparable with the IOR single file shared write, even though it's around 50% efficiency of that one as well..

I have found there is an environment variable HDF5_USE_FILE_LOCKING=FALSE. But using it doesn't seem to change anything.. Is there a way how to check the Lustre file locking setting that was used for opening the HDF5 file directly?

Thank s for the heads-up and sorry for the confusion.

jhendersonHDF · 2023-04-12T16:47:37Z

Alright, @jhendersonHDF I have done some extensive testing and realized these good results were false positive, there was a bug in my code that caused these.

So back to the trees, we are back with both MPIO and the Subfiling to a crawl - write bandwidth cca 500 MB/s per node. So it is comparable with the IOR single file shared write, even though it's around 50% efficiency of that one as well..

I have found there is an environment variable HDF5_USE_FILE_LOCKING=FALSE. But using it doesn't seem to change anything.. Is there a way how to check the Lustre file locking setting that was used for opening the HDF5 file directly?

Thank s for the heads-up and sorry for the confusion.

Hi @nadvornikjiri, the HDF5_USE_FILE_LOCKING environment variable is for controlling file locking in the sec2, core, direct, stdio and log VFDs. The HDF5 MPI I/O and Subfiling VFDs don't pay attention to the environment variable because they don't directly perform any file locking/unlocking. Presumably this is done by MPI in the case of the MPI I/O VFD. In the case of the Subfiling VFD, no locks are currently placed on the subfiles and the HDF5 "stub" file is only opened on MPI rank 0 through use of the MPI I/O VFD, so there are likely no locks placed on it either. Debugging this would probably need to make use of any facilities provided by MPI or Lustre.

nadvornikjiri · 2023-04-20T13:47:45Z

Alright, @jhendersonHDF I have done some extensive testing and realized these good results were false positive, there was a bug in my code that caused these.
So back to the trees, we are back with both MPIO and the Subfiling to a crawl - write bandwidth cca 500 MB/s per node. So it is comparable with the IOR single file shared write, even though it's around 50% efficiency of that one as well..
I have found there is an environment variable HDF5_USE_FILE_LOCKING=FALSE. But using it doesn't seem to change anything.. Is there a way how to check the Lustre file locking setting that was used for opening the HDF5 file directly?
Thank s for the heads-up and sorry for the confusion.

Hi @nadvornikjiri, the HDF5_USE_FILE_LOCKING environment variable is for controlling file locking in the sec2, core, direct, stdio and log VFDs. The HDF5 MPI I/O and Subfiling VFDs don't pay attention to the environment variable because they don't directly perform any file locking/unlocking. Presumably this is done by MPI in the case of the MPI I/O VFD. In the case of the Subfiling VFD, no locks are currently placed on the subfiles and the HDF5 "stub" file is only opened on MPI rank 0 through use of the MPI I/O VFD, so there are likely no locks placed on it either. Debugging this would probably need to make use of any facilities provided by MPI or Lustre.

Thank you for the explanation. I have some updates on the testing of the file driver then, if you are interested with Scott again. I am able to get a little bit better performance out of it but still have some questions. And I have a test case for you where the read performance is really degrading, if you are interested, see how the MPIO driver compares to Subfiling with 1 IOC, 4 threads, 4 MB striping below:

MPIO:

IOC:

The Global DB query is basically querying the "database" of spectra and images in the HDF5 file in parallel, creating an aggregated dataset of all spectra and their overlapping image regions which are just dereferenced from the image stacked dataset. So it is many small read and slicing operations on 2 huge (but chunked) datasets.

ajelenak · 2023-04-20T14:16:25Z

As a casual onlooker of this discussion I have a question: Is this PR ready for merge to master as far as supporting subfiling feature in h5py is considered? The discussion seems to be focusing on performance for some time now. Perhaps a better place for that would be HDF Forum?

nadvornikjiri · 2023-04-24T14:26:07Z

Hi @ajelenak , good point, this is indeed more focused on HDF5 itself. From my point of view, the pull request is ready for your code review, please let me know if and what I need to change to get it through. Functionally, it is already complete.

nadvornikjiri · 2023-05-26T14:39:30Z

Hi guys, could you please advise what will be the next steps? I'd like to drive this one home..

aragilar · 2023-05-27T01:48:07Z

I think the main thing would be to work out why the CI is failing. I suspect it's because subfiling isn't enabled, so you'll need to modify setup_configure.py and setup_build.py to add that as an option (like MPI and ROS3). Once the CI is working, I would suggest adding to the CI a build that includes subfiling, so it can be tested. Some MPI tests are at https://github.com/h5py/h5py/blob/master/h5py/tests/test_file.py#L877 (and see the rest of that file), if that helps with working out how to write some tests.

nadvornikjiri added 3 commits February 7, 2023 13:01

Initial version of the subfiling low-level API support in HDF5 1.14

0447a96

Changed back the header to hdf5.h extern

e975260

Added MPI markers

03a0978

takluyver reviewed Feb 8, 2023

View reviewed changes

h5py/h5fd.pyx Outdated Show resolved Hide resolved

h5py/h5p.pyx Outdated Show resolved Hide resolved

h5py/h5fd.pxd Outdated Show resolved Hide resolved

h5py/h5p.pyx Outdated Show resolved Hide resolved

h5py/api_types_hdf5.pxd Outdated Show resolved Hide resolved

nadvornikjiri and others added 2 commits February 10, 2023 10:17

Update h5py/h5fd.pyx

ac8897e

Co-authored-by: Thomas Kluyver <takowl@gmail.com>

Update h5py/h5p.pyx

96beb8b

Co-authored-by: Thomas Kluyver <takowl@gmail.com>

Build is now working with 1.14 and subfiling VFD enabled

f186e2d

nadvornikjiri and others added 2 commits February 16, 2023 08:59

Subfiling VFD properties are now accepted by the File high level api,…

0227294

… need to add tests now.

[pre-commit.ci] auto fixes from pre-commit.com hooks

c3e4310

for more information, see https://pre-commit.ci

takluyver reviewed Mar 9, 2023

View reviewed changes

aragilar reviewed Mar 10, 2023

View reviewed changes

nadvornikjiri added 2 commits April 17, 2023 15:22

Fixed long vs int for stripe size

e4bb3a4

Merge branch 'subfiling' of github.com:nadvornikjiri/h5py into subfiling

5789c12

		alignment_threshold=1, alignment_interval=1, meta_block_size=None,
		ioc_selection=None, stripe_size=None, stripe_count=None, ioc_thread_pool_size=None, **kwds):

Subfiling #2227

Are you sure you want to change the base?

Subfiling #2227

Conversation

nadvornikjiri commented Feb 7, 2023

nadvornikjiri commented Feb 7, 2023

takluyver left a comment

Choose a reason for hiding this comment

codecov bot commented Feb 8, 2023 • edited

Codecov Report

nadvornikjiri commented Feb 10, 2023

takluyver commented Feb 12, 2023

nadvornikjiri commented Feb 16, 2023

nadvornikjiri commented Feb 16, 2023

nadvornikjiri commented Mar 8, 2023 • edited by takluyver

ajelenak commented Mar 8, 2023

takluyver Mar 9, 2023

Choose a reason for hiding this comment

takluyver commented Mar 9, 2023

nadvornikjiri commented Mar 9, 2023

ajelenak commented Mar 9, 2023

aragilar Mar 10, 2023

Choose a reason for hiding this comment

nadvornikjiri Mar 10, 2023

Choose a reason for hiding this comment

aragilar commented Mar 10, 2023

nadvornikjiri commented Mar 10, 2023

nadvornikjiri commented Mar 10, 2023

nadvornikjiri commented Mar 10, 2023

aragilar commented Mar 10, 2023

nadvornikjiri commented Mar 10, 2023

jhendersonHDF commented Mar 10, 2023

nadvornikjiri commented Mar 13, 2023

jhendersonHDF commented Mar 13, 2023

jhendersonHDF commented Mar 15, 2023

nadvornikjiri commented Mar 16, 2023

jhendersonHDF commented Mar 16, 2023

nadvornikjiri commented Mar 20, 2023

nadvornikjiri commented Apr 8, 2023

jhendersonHDF commented Apr 12, 2023

nadvornikjiri commented Apr 20, 2023

ajelenak commented Apr 20, 2023

nadvornikjiri commented Apr 24, 2023

nadvornikjiri commented May 26, 2023

aragilar commented May 27, 2023

codecov bot commented Feb 8, 2023 •

edited

nadvornikjiri commented Mar 8, 2023 •

edited by takluyver