Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add function to compute zonal mean #33

Open
mnlevy1981 opened this issue Dec 20, 2019 · 27 comments · May be fixed by #82
Open

Add function to compute zonal mean #33

mnlevy1981 opened this issue Dec 20, 2019 · 27 comments · May be fixed by #82
Labels
feature New feature or request

Comments

@mnlevy1981
Copy link
Collaborator

In talking to @klindsay28, this sounds like a large undertaking. There's a Fortran program in /glade/u/home/klindsay/bin/zon_avg/ and his current recommendation is to write a python script that takes in an xarray dataset, writes it to netCDF, uses an os.system call to run zon_avg, reads in the resulting netCDF file as another xarray dataset, and returns that second dataset.

Using the on hold label until we have the time to dedicate to a proper implementation.

@mnlevy1981 mnlevy1981 added feature New feature or request on hold This will be worked on when situation changes labels Dec 20, 2019
@andersy005
Copy link
Contributor

@mnlevy1981
Copy link
Collaborator Author

I've followed @klindsay28's recommendation and in the cesm2-marbl repo I created a python wrapper to the [Fortran-based] executable. Below is the code that I linked to above in case inline is easier to read / copy than the github parsing:

import os
import subprocess
import tempfile
import xarray as xr

def zonal_mean_via_fortran(ds, var, grid=None):
    """
    Write ds to a temporary netCDF file, compute zonal mean for
    a given variable based on Keith L's fortran program, read
    resulting netcdf file, and return the new xarray dataset
    """

    # xarray doesn't require the ".nc" suffix, but it's useful to know what the file is for
    ds_in_file = tempfile.NamedTemporaryFile(suffix='.nc')
    ds_out_file = tempfile.NamedTemporaryFile(suffix='.nc')
    ds.to_netcdf(ds_in_file.name)

    # Set up location of the zonal average executable
    za_exe = os.path.join(os.path.sep,
                          'glade',
                          'u',
                          'home',
                          'klindsay',
                          'bin',
                          'zon_avg',
                          'za')
    if grid:
        # Point to a file that contains all necessary grid variables
        # I think that is KMT, TLAT, TLONG, TAREA, ULAT, ULONG, UAREA, and REGION_MASK
        if grid == 'gx1v7':
            grid_file = os.path.join(os.path.sep,
                                     'glade',
                                     'collections',
                                     'cdg',
                                     'timeseries-cmip6',
                                     'b.e21.B1850.f09_g17.CMIP6-piControl.001',
                                     'ocn',
                                     'proc',
                                     'tseries',
                                     'month_1',
                                     'b.e21.B1850.f09_g17.CMIP6-piControl.001.pop.h.NO3.000101-009912.nc')
        else:
            print(f'WARNING: no grid file for {grid}, using xarray dataset for grid vars')
            grid_file = ds_in_file.name
    else:
        # Assume xarray dataset contains all needed fields
        grid_file = ds_in_file.name

    # Set up the call to za with correct options
    za_call = [za_exe,
               '-v', var,
               '-grid_file', grid_file,
               '-kmt_file', grid_file,
               '-O', '-o', ds_out_file.name, # -O overwrites existing file, -o gives file name
               ds_in_file.name]

    # Use subprocess to call za, allows us to capture stdout and print it
    proc = subprocess.Popen(za_call, stdout=subprocess.PIPE)
    (out, err) = proc.communicate()
    if not out:
        # Read in the newly-generated file
        print('za ran successfully, writing netcdf output')
        ds_out = xr.open_dataset(ds_out_file.name)
    else:
        print(f'za reported an error:\n{out.decode("utf-8")}')

    # Delete the temporary files and return the new xarray dataset
    ds_in_file.close()
    ds_out_file.close()
    if not out:
        return(ds_out)
    return(None)

@kristenkrumhardt you may want to use this python wrapper in your notebooks, and I'm happy to help you spin up with it.

@dcherian
Copy link
Contributor

dcherian commented Oct 1, 2020

cc @erogluorhan

@erogluorhan
Copy link

We will be reading through this issue with @anissa111

@mnlevy1981
Copy link
Collaborator Author

Thanks @erogluorhan! One thing for you and @anissa111 to keep in mind is that there's the existing Fortran version computes zonal means for a variety of regions, and it is very important to keep that functionality in the python version. I'm happy to chat with both of you and walk you through the Fortran version / the python wrapper from #33 (comment) if that would be helpful.

@matt-long
Copy link
Collaborator

In addition to using a region mask, @klindsay28 has a version of za that can compute time-varying thickness-weighted zonal means, which is important for working with data on alternative vertical coordinates.

Ultimately, I think we want to have an implementation that is not POP-specific. However, I suggest implementing in pop_tools as the first step.

@klindsay28 can provide the latest version of the za code, which includes the thickness-weighting capability.

We should discuss the plan of attack once you've decided whether to proceed.

@klindsay28
Copy link
Collaborator

I thought that I had a version of za with thickness-weighting capability, but I am unable to locate it. The most recent development version of the za code that I have is on CISK machines in the directory /glade/u/home/klindsay/bin/zon_avg.dev. A feature that this version has compared to the non-development version is that it doesn't rely on POP's KMT variable for determining when variables being averaged have valid values. It instead relies on a comparison of the field's value to a fill value, typically read from variable.attrs["_FillValue"]. This is a necessary step towards being able to compute zonal averages on alternative vertical coordinates.

Note that this logic of determining valid variable values will need to change when converting to tin the xarray framework where _FillValue values are typically converted to NaN.

I'm happy to work with software engineers that are pursuing converting the Fortran code to python.

@klindsay28
Copy link
Collaborator

A top-level question for this functionality is "What does the API look like?".

Some possibilities include:

  1. A function whose arguments include: dataset of variables to be averaged, optional POP grid, optional region mask, optional thickness weighting, other arguments from existing Fortran program
  2. Change grid.py so that grid is a class, and zonal mean computation is a method in this class. Maybe layer thickness could be component of this class, so handling thickness becomes transparent.
  3. open for suggestions...

To me, 1. seems like just converting the existing Fortran program to python. However, I think we have an opportunity to have a more pythonic, and extensible, solution if we don't do that. That's why I'm throwing 2. into the mix. Please think of it as an initial straw man proposal. I'm sure there are other approaches that would be superior in the long run.

@mnlevy1981 mnlevy1981 removed the on hold This will be worked on when situation changes label Oct 19, 2020
@mnlevy1981
Copy link
Collaborator Author

Per a meeting last week, it sounds like @anissa111 will be working on this (I can't assign the ticket directly to her, she probably needs to be added to some list or other in the github settings for this repository)

@andersy005
Copy link
Contributor

I can't assign the ticket directly to her, she probably needs to be added to some list or other in the github settings for this repository

@mnlevy1981, this is now fixed. You should be able to assign anyone from the Geocat team to issues/PRs...

@klindsay28
Copy link
Collaborator

An important detail in 'translate to python' is what netCDF API to use.

The Fortran program zon_avg uses wrappers around the the Fortran 77 netCDF API. The primary purpose of the wrappers is to handle errors away from the main source of zon_avg, in an attempt to keep the zon_avg code less cluttered. If you disregard the wrappers, zon_avg is basically using the Fortran 77 netCDF API. The closest thing to this in python is the netcdf4 interface to netCDF. I suggest NOT doing this.

Since we're considering eventually using xgcm to represent the ocean model's grid, I propose that we use the xarray interface to netCDF, using the Dataset & DataArray data structures of xarray.

Comments? @matt-long, @dcherian , @mnlevy1981

@matt-long
Copy link
Collaborator

Absolutely! We should use xarray!

@klindsay28
Copy link
Collaborator

I've placed notes for a walk-through of the existing zon_avg program into this google slides document.

@dcherian
Copy link
Contributor

dcherian commented Feb 8, 2021

Calculating a zonal mean is using a mask (or clipping region) to determine a set of weights that can then be passed to Dataset.weighted(weights).mean()

I wonder if we could use xoak to determine those weights. cc @benbovy

related: xarray-contrib/xoak#34

@klindsay28
Copy link
Collaborator

@dcherian, the existing Fortran program computes zonal averages in latitude bands. It does this by computing the area of the intersection between each model cell and each latitude band, and then performing an intersection area weighted average in each latitude band. Values from individual model cells have contributions in all latitude bands that the cell intersects. In particular, values from cells contribute to the zonal average in multiple latitude bands. I don't see how to represent this one-to-many mapping with xarray's weighted framework.

If you change the mathematical definition of what you mean by zonal mean then perhaps you can implement it with xarray's weighted object framework.

@dcherian
Copy link
Contributor

dcherian commented Feb 8, 2021

Sorry I should've been clearer, I was visualizing a single latitude band when writing my comment.

Is this pseudo-code right?

zonal_mean_field[time, z, lat_bands] = (
	xr.dot(
        field[time, z, nlat, nlon],
	    intersection_area_weights[nlat, nlon, lat_bands],
    )
    .mean(["nlan", "nlon"])
) 

We need to efficiently generate intersection_area_weights and I was wondering whether the tree-based data structures in xoak might help as an alternative to the fortran clip_sphere_area. I imagine this is a solved problem in the GIS domain. We could translate the fortran to numba too...

EDIT: changed from weighted to dot because I'm not sure if that weighted call would work.

@benbovy
Copy link

benbovy commented Feb 9, 2021

We need to efficiently generate intersection_area_weights and I was wondering whether the tree-based data structures in xoak might help as an alternative to the fortran clip_sphere_area.

I'm not sure yet how xoak could support this specific problem with an API that could be generalized to several indexes and/or geometries, e.g., some equivalent to geopandas' overlay but working with xarray objects and possibly with multiple tree-based index wrappers. Also, using more generic features (one of xoak's goal) like building the new geometry of intersected cells may not be optimal for this problem where only the area of the clipped cells is needed.

That said, if you look for an alternative to the fortran clip_sphere_area, I think s2geometry has everything to perform the same operation efficiently (or even very efficiently but approximately). It's not more convenient, though, since you would need to write and build python wrappers. It's not that hard, I did that for indexing points in https://github.com/benbovy/pys2index.

@klindsay28
Copy link
Collaborator

Perhaps I'm just being dense, but I'm not following what your pseudocode does. Are the nlat and nlon indices, or just labels for what would be : in real code? Just reading about xr.dot, it looks like it sums over common dimensions. If so, then nlat and nlon get summed over. What does the mean do then?

One concern I have with what you wrote is memory usage. If intersection_area_weights is dimensioned nlat x nlon x lat_bnds, it will be huge. In practice intersection_area_weights is sparse, there are only a handful of zeros foreach nlat, nlon pair. But I don't know how to represent that sparsity in xarray.

One thing that's important to get right is handling how the land-sea mask changes over depth when you compute the mean. If the weights are computed for the surface land-sea mask, then you need to be sure to omit terms in the denominator of the mean when the corresponding terms in the numerator are FillValue/NaN. I'm not sure if your pseudocode handles that (because I'm not certain what it does).

A totally different approach is to use ESMF (perhaps xESMF) to regrid to a dst lat-lon grid using a consersative area-weighted average, and take zonal means there. I think that going the ESMF route would handle the sparsity of the weights. In order for this to work properly at land-sea boundaries, you need to get from ESMF the fraction of the cell covered on the dst grid, and incorporate that as a weight in the zonal mean computation on the dst grid. I'm not familiar enough with the ESMF regridding API to know if this dst grid coverage is available.

@andersy005
Copy link
Contributor

andersy005 commented Feb 9, 2021

One concern I have with what you wrote is memory usage. If intersection_area_weights is dimensioned nlat x nlon x lat_bnds, it will be huge. In practice intersection_area_weights is sparse, there are only a handful of zeros foreach nlat, nlon pair. But I don't know how to represent that sparsity in xarray.

xarray supports sparse arrays via the sparse package:

In [1]: import xarray as xr
In [2]: import sparse

In [3]: coords = [[0, 1, 2, 3, 4],
   ...:           [0, 1, 2, 3, 4]]

In [4]: data = [10, 20, 30, 40, 50]

In [5]: s = sparse.COO(coords, data, shape=(5, 5))

In [7]: arr = xr.DataArray(s, dims=['lat', 'lon'])

In [8]: arr
Out[8]: 
<xarray.DataArray (lat: 5, lon: 5)>
<COO: shape=(5, 5), dtype=int64, nnz=5, fill_value=0>
Dimensions without coordinates: lat, lon


In [18]: arr.data.todense()
Out[18]: 
array([[10,  0,  0,  0,  0],
       [ 0, 20,  0,  0,  0],
       [ 0,  0, 30,  0,  0],
       [ 0,  0,  0, 40,  0],
       [ 0,  0,  0,  0, 50]])

The sparse package lacks support for some operations that are common with numpy though. For instance, as of today, the dot product doesn't work (see pydata/sparse#31):

Details
In [19]: xr.dot(arr, arr)
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-19-ad3fc918ce5d> in <module>
----> 1 xr.dot(arr, arr)

~/opt/miniconda3/envs/playground/lib/python3.8/site-packages/xarray/core/computation.py in dot(dims, *arrays, **kwargs)
   1477     # to construct a partial function for apply_ufunc to work.
   1478     func = functools.partial(duck_array_ops.einsum, subscripts, **kwargs)
-> 1479     result = apply_ufunc(
   1480         func,
   1481         *arrays,

~/opt/miniconda3/envs/playground/lib/python3.8/site-packages/xarray/core/computation.py in apply_ufunc(func, input_core_dims, output_core_dims, exclude_dims, vectorize, join, dataset_join, dataset_fill_value, keep_attrs, kwargs, dask, output_dtypes, output_sizes, meta, dask_gufunc_kwargs, *args)
   1126     # feed DataArray apply_variable_ufunc through apply_dataarray_vfunc
   1127     elif any(isinstance(a, DataArray) for a in args):
-> 1128         return apply_dataarray_vfunc(
   1129             variables_vfunc,
   1130             *args,

~/opt/miniconda3/envs/playground/lib/python3.8/site-packages/xarray/core/computation.py in apply_dataarray_vfunc(func, signature, join, exclude_dims, keep_attrs, *args)
    269 
    270     data_vars = [getattr(a, "variable", a) for a in args]
--> 271     result_var = func(*data_vars)
    272 
    273     if signature.num_outputs > 1:

~/opt/miniconda3/envs/playground/lib/python3.8/site-packages/xarray/core/computation.py in apply_variable_ufunc(func, signature, exclude_dims, dask, output_dtypes, vectorize, keep_attrs, dask_gufunc_kwargs, *args)
    722             )
    723 
--> 724     result_data = func(*input_data)
    725 
    726     if signature.num_outputs == 1:

~/opt/miniconda3/envs/playground/lib/python3.8/site-packages/xarray/core/duck_array_ops.py in f(*args, **kwargs)
     54             else:
     55                 wrapped = getattr(eager_module, name)
---> 56             return wrapped(*args, **kwargs)
     57 
     58     else:

<__array_function__ internals> in einsum(*args, **kwargs)

TypeError: no implementation found for 'numpy.einsum' on types that implement __array_function__: [<class 'sparse._coo.core.COO'>]

@dcherian
Copy link
Contributor

dcherian commented Feb 9, 2021

re: nlon, nlat: if field is TEMP then it has dimensions [time, z_t, nlat, nlon].

Ah sorry (again!) the corrected xr.dot call is probably

zonal_mean_field[time, z, lat_bands] = xr.dot(
        field[time, z, nlat, nlon],
	    intersection_area_weights[z, nlat, nlon, lat_bands],
	    dims=["nlan", "nlon"]
) 

(this is what .weighted does under-the-hood), as long as intersection_area_weights is normalized properly using the right denominator that accounts for the land-sea mask. I hadn't appreciated the complexity of depth-dependent land-sea mask (thanks), that does make weights pretty huge.

I was thinking of intersection_area_weights being a pydata/sparse array but the lack of einsum support makes that impossible for now (though in somes cases these not-implemented issues can be worked around by having dask wrap sparse).

The conservative regridding approach did come to mind. xgcm has implemented it for vertical regridding with numba. I wonder if they're open to adding the 2D version of that functionality.

I have only used xesmf for simple regridding problems, so I don't know if it handles everything that your code does.

I guess there are 4 possible approaches:

  1. FORTRAN wrapper in Mike's comment
  2. Modify FORTRAN code (or make python+numba port) to save a weights file then use xr.dot or a simpler multiply+sum
  3. Use xESMF to save a weights file then use xr.dot
  4. Generate weights online while computing average (xgcm's vertical regridding does this if I am remembering correctly)

@klindsay28 thanks for taking the time to reply here and correct my mistakes.

@klindsay28
Copy link
Collaborator

Another issue to keep in mind is the extensibility of this tool to MOM6. If the tool is being applied to MOM6 output that is on a z grid, then I think using FillValue/NaN for handling subsurface layers, like mentioned above, works appropriately. However, if the output is on not on a `z' grid (e.g, model grid which might be hybrid, or isopycnal layers), then MOM6 represents non-existing layers with very small layer thickness (I think). This highlights the importance of the capability to incorporate layer thickness into the zonal average computation.

@klindsay28
Copy link
Collaborator

@mgrover1 and I are talking about this issue. It sounds like it might be feasible to go straight to an implementation based on xESMF (Max has experience with the xESMF API). One challenge is getting grid cell corners to xESMF. pop-tools has support for various POP grids, but it is not clear how to automatically determine your grid (e.g, "gx1v7") from a POP history file. How does one do this reliably in practice?

pinging @matt-long and @dcherian

@dcherian
Copy link
Contributor

dcherian commented Mar 4, 2021

based on xESMF

Does this mean conservative regridding and then averaging?

pop-tools has support for various POP grids, but it is not clear how to automatically determine your grid (e.g, "gx1v7") from a POP history file. How does one do this reliably in practice?

I have no idea! I do remember that there was a function or PR that decided a grid name based on number of points (length of nlat,nlon) but that is not robust at all.

@matt-long
Copy link
Collaborator

The pop-tools get grid function can return a SCRIP-format dataset that has the grid corners—so the corners are available.

This doesn't solve the problem of how to determine the grid from a history file, but perhaps we could simply require the user to input this information.

@klindsay28
Copy link
Collaborator

based on xESMF

Does this mean conservative regridding and then averaging?

Yes. I briefly described this approach at the end of #33 (comment).

@mgrover1
Copy link

mgrover1 commented Mar 5, 2021

As of now, I have a working prototype which utilizes xESMF to regrid to a regular lat-lon grid, then averages across longitude. The results are very close to the previous implementation, although there are small differences since a different list of latitudes is used. Would allowing the user to specify how fine of spatial resolution (Ex. 0.25 degree) grid be acceptable? I attached an example of the zonal average across a single ocean basin.
basin_3 0

@matt-long
Copy link
Collaborator

How is the depth dimension handled? One approach is to use remapping weights generated at the surface, then remap a field of ones at each level and renormalize the remapped field by this sum. I do that here:
https://github.com/marbl-ecosys/forcing-Fe-sedflux/blob/96f2a565300ba9d43b373c6ba8537e20d30e4ca1/notebooks/regrid_tools/core.py#L293

@andersy005 andersy005 linked a pull request Mar 18, 2021 that will close this issue
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature New feature or request
Projects
Status: No status
Development

Successfully merging a pull request may close this issue.

9 participants