New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Subfiling #2227
base: master
Are you sure you want to change the base?
Subfiling #2227
Conversation
Btw the MPI markers don't change the result, I'm building with parallel HDF5 anyway. I am missing something obvious I think, but the struct definitions should be everywhere the guide suggests. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't know at the moment what's causing it to fail for your MPI builds, I'm afraid. These comments are about cleaning it up so it can still build when without MPI (or on older HDF5 versions).
Codecov ReportBase: 90.01% // Head: 86.91% // Decreases project coverage by
Additional details and impacted files@@ Coverage Diff @@
## master #2227 +/- ##
==========================================
- Coverage 90.01% 86.91% -3.11%
==========================================
Files 17 17
Lines 2394 2315 -79
==========================================
- Hits 2155 2012 -143
- Misses 239 303 +64
Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here. ☔ View full report at Codecov. |
Co-authored-by: Thomas Kluyver <takowl@gmail.com>
Co-authored-by: Thomas Kluyver <takowl@gmail.com>
Aha, I have found it. The Subfiling is actually built with 1.14 only conditionally :) I was actually missing it from the HDF5 build.It needs to be built with Now, I'm trying to create some tests for the API, but I have two questions:
but I can't then set the
The definition in api_types_hdf5.pxd looks like:
so my naive assumption would be to use it like:
By the way, the value of |
OK, if it's an optional feature in HDF5, we'll probably want to follow the pattern we have for the ROS3 and direct file drivers: autodetection of whether HDF5 is built with that feature, plus an environment variable that can override it. See this code: Lines 87 to 113 in 3507819
If a Cython function callable from Python (like your
The enum name isn't a namespace, so in Cython code you should be able to use a name like If you want these constants to be accessible from Python code, you'll need to expose them on the relevant module, like this example: Lines 21 to 28 in f186e2d
|
… need to add tests now.
for more information, see https://pre-commit.ci
Thanks for pointing me. I think I got it to work, but now I'm suspecting I broke the high level API for the mpio driver. Could you point me also to what kind of tests should I add so I'm sure I didn't break anything else? |
Oh and I forgot to add the automatic detection for the subfiling VFD, I'll add that with next commit. |
@takluyver I would need some help still - I am able to successfully run the h5py with ONE_PER_NODE IOC strategy (see docs ). But when I try to use the the SELECT_IOC_TOTAL or SELECT_IOC_EVERY_NTH_RANK it fails with the following:
gdb cannot latch onto any exception as it seems this OSError is handled by h5py and not rethrown. How can I debug this without stepping through the whole file creation? I'm not sure I would be able to pinpoint it that way anyway. It might be also a bug in the 1.14 as this is quite recent functionality.. Thanks for help! |
You can get better libhdf5 error stack trace with |
alignment_threshold=1, alignment_interval=1, meta_block_size=None, | ||
ioc_selection=None, stripe_size=None, stripe_count=None, ioc_thread_pool_size=None, **kwds): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Generally the driver-specific options go into **kwds
rather than being explicit at this level.
@aragilar can you take a look, as this relates to MPI? |
@ajelenak The result of unsilence_errors is following:
The problem is that it thinks the IOC strategy is:
But I'm explicitly saying use the constant from here:
So how come they are mismatched? |
I would first print |
if ioc_selection: | ||
if ioc_selection == "one_per_node": | ||
ioc_selection_enum = h5fd.IOC_ONE_PER_NODE | ||
elif ioc_selection == "every_nth_rank": | ||
ioc_selection_enum = h5fd.IOC_EVERY_NTH_RANK | ||
elif ioc_selection == "total": | ||
ioc_selection_enum = h5fd.IOC_TOTAL | ||
else: | ||
raise NotImplementedError("Unsupported IO concentrator allocation mode.") | ||
subf_config.ioc_selection = ioc_selection_enum |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Minor thing, but it might be nicer to express this as an (Python stdlib) Enum, rather than switch on strings?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@aragilar Thanks for the style comment, though if you mean the Python Enum, that one is never used within h5py library, so I'd rather go for consistency than cleanliness. Others might suggest better but what I see the most common is something like this?
strategies = {
'fsm': h5f.FSPACE_STRATEGY_FSM_AGGR,
'page': h5f.FSPACE_STRATEGY_PAGE,
'aggregate': h5f.FSPACE_STRATEGY_AGGR,
'none': h5f.FSPACE_STRATEGY_NONE
}
fs_strat_num = strategies.get(fs_strategy, -1)
if fs_strat_num == -1:
raise ValueError("Invalid file space strategy type")
Anyway I expect that there will be more cleaning up once the code is working :).
I'm not seeing anything that concerns me with respect to MPI (the main thing would be the minor code style stuff that's already been mentioned). On the |
The As mentioned above, the main problem is that the enum defined within
while printing reasonable enum values enum values within python:
is not using correct values at the end of the line within
The interesting part is that only the
Sadly, I'm not sure how to print the enum value within gdb, it doesn't seem to support it out of the box. |
Aha, one more chapter to this novel. The enumerations are being passed ok, this prints it ok inside the init_subfiling method:
Stay tuned for next episode.. |
Alright, I figured out the parsing of the enums/environment variables. There is a small mismatch (or I at least misunderstood it) between the HDF5 documentation that says should be in the
This was unexpected for me as it actually wants me to repeat the enum ID of the IO concentrator in the config, which is rather strange :). Anyway I still think there is a bug in H5subfiling_common.c at line 1602:
For the simplest case if I have just one rank than the code couple of lines above it
breaks on the first loop and says number of io concentrators is 0 if the index of the first one is 0. This is the same though if I have n concentrators, this number will always be n-1. Since I'm testing @aragilar could you please confirm whether my conclusion is correct? Because if it is, |
Ah, I hadn't tested any of the code (either yours or the HDF5 subfiling code). I looked at the RFC to understand what subfiling was aiming to do, looked at the HDF5 docs and looked at this PR. As I believe the HDF5 codebase is developed in public on GitHub, you may be able to look at what testing was done on the original PR, in case there happened to be a test setup that hid the bug, or if it's expected (and rather the documentation needs clarifying). I'll try to look at this in more detail tomorrow. |
Hmm, seems indeed that the tests were written only for the default 1 IOC per node, this is the only one I could find. |
Hi All, unfortunately this is mostly a case of "half-baked feature options we were considering removing". We were planning to evaluate whether options like "select an I/O Concentrator every N ranks" were useful compared to the default "one per node" where the number of I/O Concentrators can be scaled up with the The comment previously mentioned:
is somewhat leftover cruft (especially the "WithConfig" part since we don't support a configuration file option yet). The code does check for selection criteria being in that form, but for the most part specifying a single number is supposed to work. For example, setting Also, I agree with @nadvornikjiri that there's a bug (more likely multiple) in the MPI rank selection logic for |
@jhendersonHDF Hi, thanks so much for looking into this. Could we also do some hot-fix to make any (every nth rank or total) strategy work so I can test the multiple IOCs per node? We are now writing an article for the Supercomputing conference, which means we need the results in 1-2 weeks to consolidate them. We already have the ones for one ioc per node but due to some aspects of the Lustre filesystem beneath we are getting mediocre results with that... Maybe we could also have a call on what parameters would be getting the best performance with our setup? |
Hi @nadvornikjiri, Here's a patch (.txt because GitHub doesn't seem to like a .patch extension) against the HDF5 1.14.0 release that hopefully fixes the issues with the different selection strategies. With those changes you should be able to specify values in the As for parameters for Subfiling, if you don't already you should probably use H5Pset_alignment to set the alignment of objects in the file to be equal to the stripe size on your Lustre system so that accesses are well-aligned. You should also probably enable collective metadata I/O with H5Pset_all_coll_metadata_ops and H5Pset_coll_metadata_write to try and improve metadata I/O, as long as your application's access pattern can support collective metadata I/O without causing problems. Note that we've typically seen that the Subfiling VFD is very good for writing data, but may have a few inefficiencies when it comes to reading data. Until we investigate those issues, it may be that the VFD is most useful for writing out large data and then fusing the subfiles back into a regular HDF5 file with the h5fuse tool before reading data with a different VFD, such as HDF5's MPI I/O VFD. I'd be happy to find time for a call with you and your team though if you find that you're still having performance issues with the VFD. |
See also HDFGroup/hdf5#2571 |
@jhendersonHDF I have tried the The results for set up: Are:
So it's scaling a little bit.. but not really. Could you please give me a hint on the alignment threshold and the alignment parameters described here for this set up above? It's still not clear to me whether it's bytes, intervals, or what is it really. Thanks. |
@nadvornikjiri In our testing we mostly found that 1 IOC per node with 4 I/O threads performed the best with 2 IOCs per node coming close, but generally not outperforming the former case. Of course that may differ depending on the hardware, but those were results from a few different HPC machines. For the parameters to H5Pset_alignment, both the threshold value and the alignment value are in terms of bytes. The threshold value specifies that only objects in the file which are equal to or larger than "threshold bytes" will be aligned in the file. The alignment value specifies the alignment of those objects in terms of bytes, which the library enforces by adding padding when allocating space for objects in the file. As an example, if I were to create a chunked dataset in an HDF5 file with a chunk size of 4MiB and I was using a Lustre system with a stripe size of 1MiB, I'd usually use Also, I don't know how large your files are, but I'm wondering if the Lustre stripe count of 1 is possibly holding you back on performance? |
@jhendersonHDF Wow thanks for the alignment setting tip - this actually resulted in cca 4x speedup even for the MPIO driver. Right now we are running for 6 nodes, stripe count 48, stripe size 4MB:
The write efficiency is compared to File per process measured by IOR, giving 10404MB/s per node. So the result is that the IOC is still not scaling, but our use case and the setup has also one more restriction introduced by perhaps wrong setting off Lustre on the HPC? which I try to explain here, but maybe the call would be better. Let's have a PM regarding the time for that one. Lustre restriction - odd scaling on one node:
This shows that the Lustre write bandwidth is not scaling with number of processes on one node, but more importantly, with stripe count on one node. Interestingly enough, it does scale with multiple nodes (without increasing stripe count!), so with adding more nodes, it scales almost linearly. The HPC support has escalated the issue on the HPE vendor. But this is the reason I was testing it with stripe count 1 as having multiple files puts them into different OSTs anyway. Our working file size is 30 TB. Funny part is that my HDF5 with the alignment set the same as Lustre stripe size is running ~4x faster (4GB/s per node write bandwidth) than the POSIX IOR test. Why is that I still need to figure out.. maybe another topic for our call. This is the configuration of IOR that it's running with:
|
Alright, @jhendersonHDF I have done some extensive testing and realized these good results were false positive, there was a bug in my code that caused these. So back to the trees, we are back with both MPIO and the Subfiling to a crawl - write bandwidth cca 500 MB/s per node. So it is comparable with the IOR single file shared write, even though it's around 50% efficiency of that one as well.. I have found there is an environment variable HDF5_USE_FILE_LOCKING=FALSE. But using it doesn't seem to change anything.. Is there a way how to check the Lustre file locking setting that was used for opening the HDF5 file directly? Thank s for the heads-up and sorry for the confusion. |
Hi @nadvornikjiri, the |
Thank you for the explanation. I have some updates on the testing of the file driver then, if you are interested with Scott again. I am able to get a little bit better performance out of it but still have some questions. And I have a test case for you where the read performance is really degrading, if you are interested, see how the MPIO driver compares to Subfiling with 1 IOC, 4 threads, 4 MB striping below: The Global DB query is basically querying the "database" of spectra and images in the HDF5 file in parallel, creating an aggregated dataset of all spectra and their overlapping image regions which are just dereferenced from the image stacked dataset. So it is many small read and slicing operations on 2 huge (but chunked) datasets. |
As a casual onlooker of this discussion I have a question: Is this PR ready for merge to master as far as supporting subfiling feature in h5py is considered? The discussion seems to be focusing on performance for some time now. Perhaps a better place for that would be HDF Forum? |
Hi @ajelenak , good point, this is indeed more focused on HDF5 itself. From my point of view, the pull request is ready for your code review, please let me know if and what I need to change to get it through. Functionally, it is already complete. |
Hi guys, could you please advise what will be the next steps? I'd like to drive this one home.. |
I think the main thing would be to work out why the CI is failing. I suspect it's because subfiling isn't enabled, so you'll need to modify |
Hi, according to discussion in #2194, I'm creating a pull request even though the build is failing now to make it more convenient for review.