AWS Data Clients #3000

dopplershift · 2023-04-07T23:11:07Z

Description Of Changes

This adds some clients that make it possible to request a range of products, or the single closest product, to a given time/time range, based on some other product id/sites. This applies to the noaa-nexrad-level2, unidata-nexrad-level3, and noaa-goes1[6-8] S3 buckets.

This is the first PR in recent times that adds remote data access capabilities to MetPy (again).

Checklist

Tests added
Fully documented

dopplershift · 2023-04-07T23:16:09Z

This is still very much todo with a need for tests and docstrings, but I want to see if anyone has API/interface thoughts. Such as: is the class-instance based interface confusing? Naming? How's the use of the Product class to wrap returns? Should get_product() be get_nearest()? Or some other names?

Todo:

GOES needs the range query capability implemented
Better examples for each of the archives
One example that combines data from all 3 archives (nexrad 2, nexrad 3, and some GOES) would be sweet. An example with radar on top of satellite with maybe tornado detections comes to mind
Need test infrastructure for these
Are parse(), download(), and file all we need for Product?
How do we want to handle the reliance on boto3?

Ping @kgoebber @deeplycloudy

src/metpy/remote/aws.py

+    def build_key(self, site, prod_id, dt, depth=None):
+        parts = [site, prod_id, f'{dt:%Y}', f'{dt:%m}', f'{dt:%d}', f'{dt:%H}', f'{dt:%M}',
+                 f'{dt:%S}']
+        return self.delimiter.join(parts[slice(0, depth)])


src/metpy/remote/aws.py

+            prefixes = list(itertools.chain(*(self.common_prefixes(b) for b in bounding_keys)))
+            loc = bisect.bisect_left(prefixes, search_key)
+            rng = slice(loc - 1, loc + 1) if loc else slice(0, 1)
+            bounding_keys = prefixes[rng]


src/metpy/remote/aws.py

+
+    def build_key(self, site, dt, depth=None):
+        parts = [f'{dt:%Y}', f'{dt:%m}', f'{dt:%d}', site, f'{site}{dt:%Y%m%d_%H%M%S}']
+        return self.delimiter.join(parts[slice(0, depth)])


src/metpy/remote/aws.py

+
+    def build_key(self, product, dt, depth=None):
+        parts = [product, f'{dt:%Y}', f'{dt:%j}', f'{dt:%H}', f'OR_{product}']
+        return self.delimiter.join(parts[slice(0, depth)])


src/metpy/remote/__init__.py

+(e.g. latitude, longitude, altitude) from various sources.
+"""
+
+from .aws import *  # noqa: F403


blaylockbk · 2023-04-08T05:21:07Z

I'll add my 2 cents based on my experience with goes2go...apologies for the long comment.

For comparison, the goes2go API is roughly this...

from goes2go import GOES

G = GOES(satellite=16, product="ABI-L2-MCMIP", domain='C')

# each of these downloads then reads the data with xarray
G.nearesttime('2022-01-01 6:15')
G.latest()
G.timerange(start='2022-06-01 00:00', end='2022-06-01 01:00')
G.timerange(recent='30min')

Main comment

Even though the GOES data is in different buckets for each satellite, it's practically a single data type (like how NEXRAD is made of different sites, GOES is just different satellites).

Instead of

from metpy.remote import GOES16Archive, GOES17Archive, GOES18Archive

I would prefer a single import and then specify the satellite in my request. Something like...

from metpy.remote import GOESArchive

GOESArchive(satellite=16)

This would make it easier for a user to change the satellite they want without adding/changing an import.

Minor comments

One feature of goes2go is the ability to download files locally and will read the local file if it exists rather than going to AWS for the file. This has been popular for users who want to work with some data offline or reuse files a lot (case study). This might be out of scope for this PR.

From an easy-of-use perspective, I find it easier to write code when date inputs for an API like this can optionally be given as a datetime string (I've always used pandas.to_datetime to parse these strings because they can be formatted in different ways; maybe there are other ways).

# This is easier to write and read...
GOESArchive().get_product(dt="2021-01-01 06:00")
GOESArchive().get_product(dt="20210101T06")

# than this...
GOESArchive().get_product(dt=datetime(2021,1,1,6,0))

Not all of it is pretty, but I'd be happy to share other aspects of goes2go and why I did certain things if you're interested. When this is merged, I might update goes2go to use this instead; this PR seems more robust and future-proof.

dopplershift · 2023-04-10T19:20:01Z

@blaylockbk Thanks for the feedback! The point about using a single GOES client with different satellites is a really good suggestion, thanks.

I'd love to hear more considerations based your experience with goes2go. I'm certainly staring at your top comment with the API thinking over the strengths and weaknesses of the approach I took in comparison.

I'm mixed on the idea of accepting strings for date/time; on one hand, it does seem to make it easy; on the other, it seems really weird to couple this code to pandas functionality when it otherwise isn't using Pandas at all. I don't think specifying one particular string format would be nearly as handy, though. (Given that Pandas is already a MetPy dependency, this is probably something I just have to get over.) Are there some use cases for supporting the string input that aren't direct user input? (i.e. pulling from another source)

blaylockbk · 2023-04-11T15:52:07Z

I'm mixed on the idea of accepting strings for date/time; on one hand, it does seem to make it easy; on the other, it seems really weird to couple this code to pandas functionality when it otherwise isn't using Pandas at all.

@dopplershift, yeah that's just a personal preference because I'm lazy. Anytime I write a function with date or datetime input I convert it using pandas (about 80% of the time I already imported pandas for something else); probably not the best practice for everyone...

import pandas as pd
from datetime import datetime, timedelta

def my_function(date, delta):
    date = pd.to_datetime(date)
    delta = pd.to_timedelta(delta)
    return date + delta

a = my_function("2023-01-01 06:00", "3H")
b = my_function(datetime(2023,1,1,6,0), timedelta(hours=3))

It's mainly a convenience for direct user input (notebooks), but often I have a list of ISO dates I need to loop over. Not a problem if you don't like it.

blaylockbk · 2023-04-11T23:18:49Z

Another "feature" would be allowing an alias "east" or "west" that switches to the appropriate satellite depending on which was operational for the date requested.

GOESArchive(satellite="west") # could be 17 or 18, depending on the date requested

dopplershift · 2023-04-12T00:21:14Z

It's mainly a convenience for direct user input (notebooks), but often I have a list of ISO dates I need to loop over. Not a problem if you don't like it.

Eh, it doesn't have to match my personal preferences necessarily. It's about the engineering trade-offs. It's entirely possible the complexity/coupling is worth it to yield a better user experience. That's why I'm trying to figure out what the concrete benefits are.

marcowurth · 2023-04-17T02:00:42Z

src/metpy/remote/aws.py

+        start_time = key.split('_')[-3]
+        return datetime.strptime(start_time[:-1], 's%Y%j%H%M%S')
+
+    def get_product(self, product, dt, mode=None, channel=None):


I think the kwarg should be "band" instead of "channel". To my knowledge the word "channel" was since the 90s more and more replaced with the synonymous word "band" e.g. AVHRR has "channels" but MODIS, VIIRS have "bands". Sometimes the word "channel" is still used today in official documents when talking more about the hardware side of the instruments. The GOES-R SERIES PRODUCT DEFINITION AND USERS’ GUIDE with 726 pages yields 25 search results for "channel" but several hundreds for "band". Oddly the GOES-ABI L2 filenames use C13, "C" for Channel though.

Glad I'm not the only one scratching my head over when to say "band" or "channel" 😂

It does seem "band" is the preferred term in the GOES NetCDF files; some examples...

band_id_C01:long_name = "ABI channel 1" ; band_id_C01:standard_name = "sensor_band_identifier" ; band_wavelength_C01 = 0.47 band_wavelength_C01:long_name = "ABI band 1 central wavelength" ;

Yeah, the "C" prefix for band/channel is the only inconsistency from the GOES side with regards to what we call it.

I'm happy to just use "band", but I think that's why "channel" comes to my mind first.

The file object will be left at the end of the file after reading, leading to other things that use it (like parse) trying to read from EOF.

deeplycloudy · 2023-05-26T16:30:28Z

To test this functionality I modified python-training#136 to plot GLM data. See my comment there for what that looks like. The GOESArchive client works well to pull the data.

Some thoughts on aspects of data use unique to GLM:

GLM data come in 20s bundles. That’s usually too small to give a representative view of “lightning at this time”, and so one has to download multiple files and loop over many Datasets. Is that a user convenience worth adding? Operational use defaults to a 5 min aggregation. Is it worth trying to match it to one of the ABI cadences (full disk, conus, or mesoscale)?
- I can contribute code for concatenating GLM files into one Dataset, though it's a full screen of code.
Subsetting the GLM LCFA files in a self-consistent way requires something like glmtools to handle the flash-group-event tree. If you just want to plot flashes and groups in whatever field of view you have, this step is not necessary, but I could see this question arising if someone wanted to do a more sophisticated data reduction.
A better visualization solution for most users would be to use the GLM gridded imagery, which Unidata provides in real-time through THREDDS, but since those are a psuedo-operational product, they are not on NOAA’s S3. NASA has kindly added them as L3 products to their archive. They are on S3, but behind an EarthData login, but if they were more open I'd love to add them to the GOESArchive to abstract across data repositories.
- They are in 1 min files, and so also are usually aggregated before use. That’s a pretty trivial operation in xarray.

deeplycloudy · 2023-11-08T01:37:11Z

I had occasion to think about model data, which is also now increasingly on S3. I wanted to document that here for further rumination about API design.

The docs for the GEFS at the link above are somewhat out of date. For yesterday's data, atmos is in the path and 0p50 in the filenames for reasons that are unclear. Earlier years (e.g., 2017) have a different structure.

There are also many file types, but I needed geopotential height, which is in the "popular variables" file type. For one time for one member, the key looks like:
gefs.20231106/00/atmos/pgrb2ap5/gep01.t00z.pgrb2a.0p50.f120

Below is code for downloading and concatenating all the ensemble members for one time. It shows the parameters that need to templated for this (admittedly narrow) use case.

output = '/data/'
S3bucket = 'noaa-gefs-pds'
s3 = boto3.resource('s3', config=Config(signature_version=botocore.UNSIGNED,
                                        user_agent_extra='Resource'))
bucket = s3.Bucket(S3bucket)

s3ymd = datetime(2023,11,6).strftime('%Y%m%d')
s3hr = 0 # 06, 12, 18
s3members = np.arange(1, 20+1, 1)
s3fhour = 60 # arange(0,384,6)

for s3member in s3members:
    S3baserun = f"gefs.{s3ymd}/{s3hr:02d}/atmos/pgrb2ap5/"
    S3grib = f"gep{s3member:02d}.t{s3hr:02d}z.pgrb2a.0p50.f{s3fhour:03d}"
    S3key = S3baserun+S3grib
    outfile = os.path.join(outpath, S3grib)
    with open(outfile, 'wb') as fileobj: 
        bucket.download_fileobj(S3key, fileobj)

All the members for one time can then be concatenated with cfgrib as follows:

f060grib = glob(os.path.join(outpath, '*f060'))
gribs = xr.open_mfdataset(f060grib, engine="cfgrib", 
                          combine='nested', concat_dim='ens',
                          backend_kwargs={'filter_by_keys':
                                          {'typeOfLevel': 'isobaricInhPa', 'shortName':'gh'}})

dopplershift added the Type: Feature New functionality label Apr 7, 2023

github-advanced-security bot found potential problems Apr 7, 2023

View reviewed changes

marcowurth reviewed Apr 17, 2023

View reviewed changes

dopplershift added 4 commits April 27, 2023 16:51

ENH: Add clients for accessing AWS data

bd44eac

DOC: Add basic example showing some of the client usage

7a1329b

Don't cache the file object

d98faec

The file object will be left at the end of the file after reading, leading to other things that use it (like parse) trying to read from EOF.

Just have a single client for the GOES archives

3d3ec2e

dopplershift force-pushed the cloud-clients branch from d9e4261 to 3d3ec2e Compare April 27, 2023 22:51

dopplershift added this to the September 2023 milestone Apr 28, 2023

dopplershift mentioned this pull request May 17, 2023

Nice combo example Unidata/python-training#136

Open

dopplershift modified the milestones: September 2023, November 2023 Aug 30, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

AWS Data Clients #3000

AWS Data Clients #3000

dopplershift commented Apr 7, 2023

dopplershift commented Apr 7, 2023

blaylockbk commented Apr 8, 2023

dopplershift commented Apr 10, 2023

blaylockbk commented Apr 11, 2023

blaylockbk commented Apr 11, 2023

dopplershift commented Apr 12, 2023

marcowurth Apr 17, 2023

blaylockbk Apr 17, 2023

dopplershift Apr 24, 2023

deeplycloudy commented May 26, 2023

deeplycloudy commented Nov 8, 2023

AWS Data Clients #3000

Are you sure you want to change the base?

AWS Data Clients #3000

Conversation

dopplershift commented Apr 7, 2023

Description Of Changes

Checklist

dopplershift commented Apr 7, 2023

blaylockbk commented Apr 8, 2023

Main comment

Minor comments

dopplershift commented Apr 10, 2023

blaylockbk commented Apr 11, 2023

blaylockbk commented Apr 11, 2023

dopplershift commented Apr 12, 2023

marcowurth Apr 17, 2023

Choose a reason for hiding this comment

blaylockbk Apr 17, 2023

Choose a reason for hiding this comment

dopplershift Apr 24, 2023

Choose a reason for hiding this comment

deeplycloudy commented May 26, 2023

deeplycloudy commented Nov 8, 2023