Add support for distributed cholla datasets. #4702

mabruzzo · 2023-10-10T23:19:39Z

PR Summary

This PR adds support for loading Cholla datasets that are distributed over multiple files. Previously, the frontend could only load Cholla datasets after they were concatenated into a single large dataset.

This functionality is currently a little inefficient right now - we need to read in every hdf5 file to figure out the mapping between spatial locations and locations on disk. This seems like something we can easily improve in the future (possibly by having Cholla write out an extra attribute how 3D locations are mapped into 1D).

PR Checklist

Adds a test for any bugs fixed. Adds tests for new features.

For this PR, I suspect that we will need to upload a new test dataset. I just had a few questions:

It's been a while since I've done this. Could someone remind me of the procedure for doing this?
Weirdly enough, I get the following message when I run the unit-tests on the main branch. Do you have any idea why this is happening? (For context, the other 3 tests all run)

yt/frontends/cholla/tests/test_outputs.py::test_cholla_data SKIPPED (cannot load dataset ChollaSimple/0.h5)
Is there any preference for unit tests vs answer-tests when it comes to frontends?

neutrinoceros · 2023-10-11T05:51:21Z

It's been a while since I've done this. Could someone remind me of the procedure for doing this?

you'll need to

open a pull request on the website repository (see for example Adding basic gizmo_zeldovich entry. website#121)
add an entry to yt/sample_data_registry.json (this repo). This is to support loading the new sample dataset with yt.load_sample

Weirdly enough, I get the following message when I run the unit-tests on the main branch. Do you have any idea why this is happening? (For context, the other 3 tests all run)

Maybe that's a bug with small_patch_amr. I suggest trying to work on a simplified version of the test and refine it until it doesn't skip, to discover what's happening.

Is there any preference for unit tests vs answer-tests when it comes to frontends?

I think unit tests should be preferred whenever they suffice for a couple reasons:

answer tests are currently deeply rooted in the nose test framework (migration to pytest is still ongoing), so adding more of them makes this long lasting migration ever so slightly harder
fast tests are easier to scale

That said, if what you need is some answer tests, go for it !

matthewturk · 2023-10-18T17:10:16Z

yt/frontends/cholla/data_structures.py

 from yt.geometry.api import Geometry
 from yt.geometry.grid_geometry_handler import GridIndex
 from yt.utilities.on_demand_imports import _h5py as h5py

 from .fields import ChollaFieldInfo


+def _split_fname_proc_suffix(filename: str):


Could you put a short note about how this is different from os.path.splitext? Just to avoid future confusion.

matthewturk

only minor stuff -- looks good otherwise

matthewturk · 2023-10-18T17:15:00Z

yt/frontends/cholla/data_structures.py

+            self.grid_left_edge[i] = left_frac
+            self.grid_right_edge[i] = right_frac
+            self.grid_dimensions[i] = dims_local


Suggested change

self.grid_left_edge[i] = left_frac

self.grid_right_edge[i] = right_frac

self.grid_dimensions[i] = dims_local

self.grid_left_edge[i,:] = left_frac

self.grid_right_edge[i,:] = right_frac

self.grid_dimensions[i,:] = dims_local

Just for clarity, could we make it obvious that it's setting a slice to the values?

matthewturk · 2023-10-18T17:16:17Z

yt/frontends/cholla/io.py

+    def io_iter(self, chunks, fields):
+        # this is loosely inspired by the implementation used for Enzo/Enzo-E
+        # - those other options use the lower-level hdf5 interface. Unclear
+        #   whether that affords any advantages...


Good question. I think in the past it did because we avoided having to re-allocate temporary scratch space, but I am not sure that would hold up to current inquiries. I think the big advantage those have is tracking the groups within the iteration.

matthewturk · 2023-10-18T17:16:39Z

yt/frontends/cholla/io.py

+        fh, filename = None, None
+        for chunk in chunks:
+            for obj in chunk.objs:
+                if obj.filename is None:  # unclear when this case arises...


likely it will not here, unless you manually construct virtual grids

Out of curiosity, what is a virtual grid?

I realize this may be an involved answer - so if you could just point me to a frontend (or other area of the code) using virtual grids, I can probably investigate that on my own.

matthewturk · 2023-10-18T17:16:42Z

yt/frontends/cholla/io.py

+        fh, filename = None, None
+        for chunk in chunks:
+            for obj in chunk.objs:
+                if obj.filename is None:  # unclear when this case arises...


likely it will not here, unless you manually construct virtual grids

mabruzzo · 2023-10-27T15:20:22Z

My apologies for taking a while to follow up on this. I plan to circle back in the next week or so.

neutrinoceros added code frontends Things related to specific frontends enhancement Making something better labels Oct 11, 2023

Add support for distributed cholla datasets.

4428058

mabruzzo force-pushed the cholla-frontend-improvements branch from 2806b17 to 4428058 Compare October 18, 2023 14:28

matthewturk reviewed Oct 18, 2023

View reviewed changes

matthewturk approved these changes Oct 18, 2023

View reviewed changes

mabruzzo mentioned this pull request Oct 27, 2023

minor bugfix in cholla frontend #4686

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add support for distributed cholla datasets. #4702

Add support for distributed cholla datasets. #4702

mabruzzo commented Oct 10, 2023

neutrinoceros commented Oct 11, 2023

matthewturk Oct 18, 2023

matthewturk left a comment

matthewturk Oct 18, 2023

matthewturk Oct 18, 2023

matthewturk Oct 18, 2023

mabruzzo Oct 27, 2023

matthewturk Oct 18, 2023

mabruzzo commented Oct 27, 2023

Add support for distributed cholla datasets. #4702

Are you sure you want to change the base?

Add support for distributed cholla datasets. #4702

Conversation

mabruzzo commented Oct 10, 2023

PR Summary

PR Checklist

neutrinoceros commented Oct 11, 2023

matthewturk Oct 18, 2023

Choose a reason for hiding this comment

matthewturk left a comment

Choose a reason for hiding this comment

matthewturk Oct 18, 2023

Choose a reason for hiding this comment

matthewturk Oct 18, 2023

Choose a reason for hiding this comment

matthewturk Oct 18, 2023

Choose a reason for hiding this comment

mabruzzo Oct 27, 2023

Choose a reason for hiding this comment

matthewturk Oct 18, 2023

Choose a reason for hiding this comment

mabruzzo commented Oct 27, 2023