Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

AWS Data Clients #3000

Draft
wants to merge 4 commits into
base: main
Choose a base branch
from
Draft

Conversation

dopplershift
Copy link
Member

Description Of Changes

This adds some clients that make it possible to request a range of products, or the single closest product, to a given time/time range, based on some other product id/sites. This applies to the noaa-nexrad-level2, unidata-nexrad-level3, and noaa-goes1[6-8] S3 buckets.

This is the first PR in recent times that adds remote data access capabilities to MetPy (again).

Checklist

  • Tests added
  • Fully documented

@dopplershift dopplershift added the Type: Feature New functionality label Apr 7, 2023
@dopplershift
Copy link
Member Author

This is still very much todo with a need for tests and docstrings, but I want to see if anyone has API/interface thoughts. Such as: is the class-instance based interface confusing? Naming? How's the use of the Product class to wrap returns? Should get_product() be get_nearest()? Or some other names?

Todo:

  • GOES needs the range query capability implemented
  • Better examples for each of the archives
  • One example that combines data from all 3 archives (nexrad 2, nexrad 3, and some GOES) would be sweet. An example with radar on top of satellite with maybe tornado detections comes to mind
  • Need test infrastructure for these
  • Are parse(), download(), and file all we need for Product?
  • How do we want to handle the reliance on boto3?

Ping @kgoebber @deeplycloudy

def build_key(self, site, prod_id, dt, depth=None):
parts = [site, prod_id, f'{dt:%Y}', f'{dt:%m}', f'{dt:%d}', f'{dt:%H}', f'{dt:%M}',
f'{dt:%S}']
return self.delimiter.join(parts[slice(0, depth)])

Check failure

Code scanning / CodeQL

Unhashable object hashed Error

This
instance
of
slice
is unhashable.
prefixes = list(itertools.chain(*(self.common_prefixes(b) for b in bounding_keys)))
loc = bisect.bisect_left(prefixes, search_key)
rng = slice(loc - 1, loc + 1) if loc else slice(0, 1)
bounding_keys = prefixes[rng]

Check failure

Code scanning / CodeQL

Unhashable object hashed Error

This
instance
of
slice
is unhashable.
This
instance
of
slice
is unhashable.

def build_key(self, site, dt, depth=None):
parts = [f'{dt:%Y}', f'{dt:%m}', f'{dt:%d}', site, f'{site}{dt:%Y%m%d_%H%M%S}']
return self.delimiter.join(parts[slice(0, depth)])

Check failure

Code scanning / CodeQL

Unhashable object hashed Error

This
instance
of
slice
is unhashable.

def build_key(self, product, dt, depth=None):
parts = [product, f'{dt:%Y}', f'{dt:%j}', f'{dt:%H}', f'OR_{product}']
return self.delimiter.join(parts[slice(0, depth)])

Check failure

Code scanning / CodeQL

Unhashable object hashed Error

This
instance
of
slice
is unhashable.
(e.g. latitude, longitude, altitude) from various sources.
"""

from .aws import * # noqa: F403

Check notice

Code scanning / CodeQL

'import *' may pollute namespace Note

Import pollutes the enclosing namespace, as the imported module
metpy.remote.aws
does not define '__all__'.
@blaylockbk
Copy link
Contributor

I'll add my 2 cents based on my experience with goes2go...apologies for the long comment.


For comparison, the goes2go API is roughly this...

from goes2go import GOES

G = GOES(satellite=16, product="ABI-L2-MCMIP", domain='C')

# each of these downloads then reads the data with xarray
G.nearesttime('2022-01-01 6:15')
G.latest()
G.timerange(start='2022-06-01 00:00', end='2022-06-01 01:00')
G.timerange(recent='30min')

Main comment

Even though the GOES data is in different buckets for each satellite, it's practically a single data type (like how NEXRAD is made of different sites, GOES is just different satellites).

Instead of

from metpy.remote import GOES16Archive, GOES17Archive, GOES18Archive

I would prefer a single import and then specify the satellite in my request. Something like...

from metpy.remote import GOESArchive

GOESArchive(satellite=16)

This would make it easier for a user to change the satellite they want without adding/changing an import.


Minor comments

One feature of goes2go is the ability to download files locally and will read the local file if it exists rather than going to AWS for the file. This has been popular for users who want to work with some data offline or reuse files a lot (case study). This might be out of scope for this PR.


From an easy-of-use perspective, I find it easier to write code when date inputs for an API like this can optionally be given as a datetime string (I've always used pandas.to_datetime to parse these strings because they can be formatted in different ways; maybe there are other ways).

# This is easier to write and read...
GOESArchive().get_product(dt="2021-01-01 06:00")
GOESArchive().get_product(dt="20210101T06")

# than this...
GOESArchive().get_product(dt=datetime(2021,1,1,6,0))

Not all of it is pretty, but I'd be happy to share other aspects of goes2go and why I did certain things if you're interested. When this is merged, I might update goes2go to use this instead; this PR seems more robust and future-proof.

@dopplershift
Copy link
Member Author

@blaylockbk Thanks for the feedback! The point about using a single GOES client with different satellites is a really good suggestion, thanks.

I'd love to hear more considerations based your experience with goes2go. I'm certainly staring at your top comment with the API thinking over the strengths and weaknesses of the approach I took in comparison.

I'm mixed on the idea of accepting strings for date/time; on one hand, it does seem to make it easy; on the other, it seems really weird to couple this code to pandas functionality when it otherwise isn't using Pandas at all. I don't think specifying one particular string format would be nearly as handy, though. (Given that Pandas is already a MetPy dependency, this is probably something I just have to get over.) Are there some use cases for supporting the string input that aren't direct user input? (i.e. pulling from another source)

@blaylockbk
Copy link
Contributor

I'm mixed on the idea of accepting strings for date/time; on one hand, it does seem to make it easy; on the other, it seems really weird to couple this code to pandas functionality when it otherwise isn't using Pandas at all.

@dopplershift, yeah that's just a personal preference because I'm lazy. Anytime I write a function with date or datetime input I convert it using pandas (about 80% of the time I already imported pandas for something else); probably not the best practice for everyone...

import pandas as pd
from datetime import datetime, timedelta

def my_function(date, delta):
    date = pd.to_datetime(date)
    delta = pd.to_timedelta(delta)
    return date + delta

a = my_function("2023-01-01 06:00", "3H")
b = my_function(datetime(2023,1,1,6,0), timedelta(hours=3))

It's mainly a convenience for direct user input (notebooks), but often I have a list of ISO dates I need to loop over. Not a problem if you don't like it.

@blaylockbk
Copy link
Contributor

Another "feature" would be allowing an alias "east" or "west" that switches to the appropriate satellite depending on which was operational for the date requested.

GOESArchive(satellite="west") # could be 17 or 18, depending on the date requested

@dopplershift
Copy link
Member Author

It's mainly a convenience for direct user input (notebooks), but often I have a list of ISO dates I need to loop over. Not a problem if you don't like it.

Eh, it doesn't have to match my personal preferences necessarily. It's about the engineering trade-offs. It's entirely possible the complexity/coupling is worth it to yield a better user experience. That's why I'm trying to figure out what the concrete benefits are.

start_time = key.split('_')[-3]
return datetime.strptime(start_time[:-1], 's%Y%j%H%M%S')

def get_product(self, product, dt, mode=None, channel=None):

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the kwarg should be "band" instead of "channel". To my knowledge the word "channel" was since the 90s more and more replaced with the synonymous word "band" e.g. AVHRR has "channels" but MODIS, VIIRS have "bands". Sometimes the word "channel" is still used today in official documents when talking more about the hardware side of the instruments. The GOES-R SERIES PRODUCT DEFINITION AND USERS’ GUIDE with 726 pages yields 25 search results for "channel" but several hundreds for "band". Oddly the GOES-ABI L2 filenames use C13, "C" for Channel though.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Glad I'm not the only one scratching my head over when to say "band" or "channel" 😂

It does seem "band" is the preferred term in the GOES NetCDF files; some examples...

band_id_C01:long_name = "ABI channel 1" ;
band_id_C01:standard_name = "sensor_band_identifier" ;

band_wavelength_C01 = 0.47
band_wavelength_C01:long_name = "ABI band 1 central wavelength" ;

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, the "C" prefix for band/channel is the only inconsistency from the GOES side with regards to what we call it.

I'm happy to just use "band", but I think that's why "channel" comes to my mind first.

The file object will be left at the end of the file after reading,
leading to other things that use it (like parse) trying to read from
EOF.
@deeplycloudy
Copy link
Collaborator

To test this functionality I modified python-training#136 to plot GLM data. See my comment there for what that looks like. The GOESArchive client works well to pull the data.

Some thoughts on aspects of data use unique to GLM:

  • GLM data come in 20s bundles. That’s usually too small to give a representative view of “lightning at this time”, and so one has to download multiple files and loop over many Datasets. Is that a user convenience worth adding? Operational use defaults to a 5 min aggregation. Is it worth trying to match it to one of the ABI cadences (full disk, conus, or mesoscale)?
    • I can contribute code for concatenating GLM files into one Dataset, though it's a full screen of code.
  • Subsetting the GLM LCFA files in a self-consistent way requires something like glmtools to handle the flash-group-event tree. If you just want to plot flashes and groups in whatever field of view you have, this step is not necessary, but I could see this question arising if someone wanted to do a more sophisticated data reduction.
  • A better visualization solution for most users would be to use the GLM gridded imagery, which Unidata provides in real-time through THREDDS, but since those are a psuedo-operational product, they are not on NOAA’s S3. NASA has kindly added them as L3 products to their archive. They are on S3, but behind an EarthData login, but if they were more open I'd love to add them to the GOESArchive to abstract across data repositories.
    • They are in 1 min files, and so also are usually aggregated before use. That’s a pretty trivial operation in xarray.

@deeplycloudy
Copy link
Collaborator

I had occasion to think about model data, which is also now increasingly on S3. I wanted to document that here for further rumination about API design.

The docs for the GEFS at the link above are somewhat out of date. For yesterday's data, atmos is in the path and 0p50 in the filenames for reasons that are unclear. Earlier years (e.g., 2017) have a different structure.

There are also many file types, but I needed geopotential height, which is in the "popular variables" file type. For one time for one member, the key looks like:
gefs.20231106/00/atmos/pgrb2ap5/gep01.t00z.pgrb2a.0p50.f120

Below is code for downloading and concatenating all the ensemble members for one time. It shows the parameters that need to templated for this (admittedly narrow) use case.

output = '/data/'
S3bucket = 'noaa-gefs-pds'
s3 = boto3.resource('s3', config=Config(signature_version=botocore.UNSIGNED,
                                        user_agent_extra='Resource'))
bucket = s3.Bucket(S3bucket)

s3ymd = datetime(2023,11,6).strftime('%Y%m%d')
s3hr = 0 # 06, 12, 18
s3members = np.arange(1, 20+1, 1)
s3fhour = 60 # arange(0,384,6)

for s3member in s3members:
    S3baserun = f"gefs.{s3ymd}/{s3hr:02d}/atmos/pgrb2ap5/"
    S3grib = f"gep{s3member:02d}.t{s3hr:02d}z.pgrb2a.0p50.f{s3fhour:03d}"
    S3key = S3baserun+S3grib
    outfile = os.path.join(outpath, S3grib)
    with open(outfile, 'wb') as fileobj: 
        bucket.download_fileobj(S3key, fileobj)

All the members for one time can then be concatenated with cfgrib as follows:

f060grib = glob(os.path.join(outpath, '*f060'))
gribs = xr.open_mfdataset(f060grib, engine="cfgrib", 
                          combine='nested', concat_dim='ens',
                          backend_kwargs={'filter_by_keys':
                                          {'typeOfLevel': 'isobaricInhPa', 'shortName':'gh'}})

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Type: Feature New functionality
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants