Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DAS-1376 - Update HarmonyAdapter to perform many-to-one operations. #18

Merged
merged 1 commit into from Feb 25, 2022

Conversation

owenlittlejohns
Copy link
Member

This PR updates the main NetCDFToZarrAdapter class with a view to many-to-one operations. This includes adding utilities for downloading multiple granules in parallel threads, and for combining the input STAC items into a single output stack item that describes the aggregated output.

A lot of this work is heavily inspired by some great, similar functionality in the PO.DAAC Concise service.

For now, I've tried to ensure a single granule request could still operate, but a multiple granule request will raise an exception, because we've not yet implemented that functionality.

@@ -1,5 +1,5 @@
autopep8 ~= 1.5
coverage ~= 4.5
coverage ~= 6.3.1
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interesting thing here - the older version of coverage was causing test failures for the multiprocessing module. I think this is the related GitHub issue.

@@ -0,0 +1,126 @@
""" A module containing utility functions to extract information from
`pystac.Item` instances for Harmony, or to produce an output
`pystac.Catalog`. These functions adapt work from the PO.DAAC concise
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So these aren't identical - but they really do lean heavily on the great work done by @frankinspace et al. (Same for the download_utilities.py module. There might be some stuff that could be migrated into harmony-service-lib-py (like multi-threaded downloads for many-to-one services). But (as David Auty likes to say) those are "future distractions".

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

100% on multi-threaded downloads. That absolutely should be a shared library - eventually. Or perhaps even better - moved into the HarmonyPy package?

None)


def get_output_catalog(input_catalog: Catalog, zarr_root: str) -> Catalog:
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm also now wondering if we need to have similar code in HOSS - particularly for requests that specify a bounding box.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As in get_output_bounding_box (below) - So much common code, Harmony Adapter, or HarmonyPy ? Agree we may need to consider a HOSS ticket.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the harmony-service-lib-py package is probably the place to pop some of this stuff. That's the core Python package that offers common functionality to the backend services, including the HarmonyAdapter. It's very tempting to just suggest copying this module (almost line-for-line) into that package.

be caused by requesting a file with an unknown protocol.

"""
with self.assertRaises(RuntimeError):
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test has been a little flaky on my local machine (the harmony.util.download function is successful for a URL that isn't an obvious local file path or URL). I'm not sure I can see why. I'm hoping it behaves on GitHub.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The problem was that, in the first failure, the local file /tmp/<uuid>.nc4 was being created, even in failure. The harmony.util.download functions sees this file in subsequent test runs.

I've updated this test to put test artefacts in a temporary directory, and then remove that directory in the TestCase.tearDown method.

@owenlittlejohns
Copy link
Member Author

owenlittlejohns commented Feb 19, 2022

I looked at the test failures. It looks like the end-to-end tests. Probably due to swapping out the process_item for process_items_many_to_one. I'll work on fixing that up.

Update: I also added validation to ensure a STAC catalog was included in the input. The end-to-end tests didn't have one in the arguments. I've tried updating the end-to-end tests with a --harmony-sources argument... we'll see how those go.

@owenlittlejohns owenlittlejohns force-pushed the DAS-1376-multiple-input-granules branch 3 times, most recently from 4898523 to a3385cf Compare February 19, 2022 17:24
@owenlittlejohns owenlittlejohns force-pushed the DAS-1376-multiple-input-granules branch 2 times, most recently from 7950ce9 to 44d62d5 Compare February 22, 2022 18:51

@patch('harmony_netcdf_to_zarr.adapter.download_granules')
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm a little bit disappointed to have to mock the download of the input granule as that makes this (and the large file conversion) less of an end-to-end test. However, the mock used to make the multiprocessing module work with moto was then causing issues with the multiprocessing.Queue used to download the input granules.

I spent quite a while looking into this - there was some odd behaviour, indeed! Mocking the download in these two tests was the simplest way I could find to make the tests work in the short term.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

check

# Mimicking PO.DAAC Concise: for many-to-one the file name is
# "<collection>_merged.zarr".
collection = self._get_item_source(items[0]).collection
output_name = f'{collection}_merged.zarr'

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I assume {collection} is concept_id, not just short_name.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had to dig around for this in the unit tests for harmony-service-lib-py to confirm. Looking at the unit tests there, it looks like you are right; the URL will likely contain the collection concept ID, not the short name. (Here's the class method, where it's essentially looking for the parent item in the STAC catalogue).

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Short-Name is preferable for end-user. Is this possible?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I definitely agree that a short name is more end-user friendly! I'm not sure that the short name is available to us - this is making use of a method on the BaseHarmonyAdapter class, so outside of the scope of the service itself.

I guess the other thing worth noting is that this is what the PO.DAAC Concise service is doing, so we're being consistent with existing services. I definitely think this is a good question, and think the short name would be preferable if available. This might be a conversation to float up to a higher level, to see if the collection short name can be made available to the services somehow.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addendum (sorry, still mulling this over): I think, mainly, this all stems from what Harmony chooses to put in the input STAC catalogue sent to the backend service.

(An alternative might be to delve into the metadata of one of the granules to find a short name in the global attributes, but then we'd be inconsistent with Concise)

access_token: str, harmony_config: Config,
logger: Logger, process_count: int = None) -> List[str]:
""" A method which scales concurrent downloads to the number of available
CPU cores. For further explaination, see documentation on "multi-track

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

explanation

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Gah. That typo is just embarrassing. I'll fix that up! Thanks, Ken.

Update: I've fixed this typo.

# Temporarily retain old version of NetCDF-to-Zarr service
netcdf_to_zarr(local_file_paths[0], zarr_store)
else:
raise NotImplementedError('Concatenation is not yet supported.')
Copy link

@D-Auty D-Auty Feb 25, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm confused - isn't this where we initiate our aggregation processing? Above code addresses many-to-one options, why raise an exception here? Oh - I got it, from some conversation - this is still a mergeable solution but not the final solution (yet to come). Classic case for a ToDo comment, but not strictly necessary.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah exactly. My aim here was to ensure that we could deploy this and continue service for a single granule input, but we've not yet actually done the bit to write many-to-one input (that's DAS-1379).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've added a short-lived TODO comment referring to DAS-1379.

@@ -0,0 +1,126 @@
""" A module containing utility functions to extract information from
`pystac.Item` instances for Harmony, or to produce an output
`pystac.Catalog`. These functions adapt work from the PO.DAAC concise
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

100% on multi-threaded downloads. That absolutely should be a shared library - eventually. Or perhaps even better - moved into the HarmonyPy package?

None)


def get_output_catalog(input_catalog: Catalog, zarr_root: str) -> Catalog:
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As in get_output_bounding_box (below) - So much common code, Harmony Adapter, or HarmonyPy ? Agree we may need to consider a HOSS ticket.

objects are gone when the child processes quit. This method mocks
`multiprocessing.popen_fork.Popen._launch` to do the work in the
parent process.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice


@patch('harmony_netcdf_to_zarr.adapter.download_granules')
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

check

@owenlittlejohns owenlittlejohns merged commit 9e1b8b0 into main Feb 25, 2022
@owenlittlejohns owenlittlejohns deleted the DAS-1376-multiple-input-granules branch February 25, 2022 21:34
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants