Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Integrating the proxy into the data viewer - progress update and performance observations and other issues #6

Open
andersy005 opened this issue Jan 25, 2023 · 8 comments

Comments

@andersy005
Copy link
Member

andersy005 commented Jan 25, 2023

@katamartin and I have been making progress in integrating the proxy into the data viewer. Our intention is to use the proxy for on-the-fly rechunking of datasets for visualization purposes. The results are looking promising and the performance is satisfactory (for small datasets and datasets hosted in AWS S3) even without caching on the backend

  • https://storage.googleapis.com/carbonplan-maps/ncview/demo/single_timestep/air_temperature.zarr

Screenshot 2023-01-25 at 10 48 19

  • s3://carbonplan-data-viewer/demo/MURSST.zarr ( the original chunk size is roughly ~ 1.21 GB)
    Screenshot 2023-01-25 at 12 24 27

  • retrieving data from stores hosted outside outside of S3 takes a long time (as expected). the following are timings for gs://ldeo-glaciology/bedmachine/bm.zarr (the original chunk size is roughly ~ 35MB)

Screenshot 2023-01-25 at 11 54 05

there's still more work to do to ensure seamless interoperability with existing zarr clients. To illustrate this, below is a code snippet that demonstrates how the proxy can be used via the zarr Python library.

  • instantiate a zarr store via fsspec
In [21]: url = 'http://127.0.0.1:8000/storage.googleapis.com/ldeo-glaciology/bedmachine/bm.zarr'

In [22]: store = zarr.storage.FSStore(url, client_kwargs={'headers': {"chunks": "10,10"}})

In [23]: store['.zattrs']
Out[23]: b'{"Author":"Mathieu Morlighem","Conventions":"CF-1.7","Data_citation":"Morlighem M. et al., (2019), Deep glacial troughs and stabilizing ridges unveiled beneath the margins of the Antarctic ice sheet, Nature Geoscience (accepted)","Notes":"Data processed at the Department of Earth System Science, University of California, Irvine","Projection":"Polar Stereographic South (71S,0E)","Title":"BedMachine Antarctica","false_easting":[0.0],"false_northing":[0.0],"grid_mapping_name":"polar_stereographic","ice_density (kg m-3)":[917.0],"inverse_flattening":[298.2794050428205],"latitude_of_projection_origin":[-90.0],"license":"No restrictions on access or use","no_data":[-9999.0],"nx":[13333.0],"ny":[13333.0],"proj4":"+init=epsg:3031","sea_water_density (kg m-3)":[1027.0],"semi_major_axis":[6378273.0],"spacing":[500],"standard_parallel":[-71.0],"straight_vertical_longitude_from_pole":[0.0],"version":"05-Nov-2019 (v1.38)","xmin":[-3333000],"ymax":[3333000]}'
  • open an array within the zarr store
In [25]: arr = zarr.open(store, path='/bed')

In [27]: arr
Out[27]: <zarr.core.Array '/bed' (13333, 13333) float32>
  • retrieve some data
In [28]: arr[:10, :10]
Out[28]: 
array([[-5914.538 , -5919.3955, -5924.865 , -5930.3765, -5935.8853,
        -5941.0205, -5945.997 , -5950.359 , -5954.3784, -5958.045 ],
       [-5910.384 , -5915.8296, -5921.3076, -5927.158 , -5932.7554,
        -5938.29  , -5943.1704, -5947.785 , -5951.881 , -5955.54  ],
       [-5906.422 , -5911.8516, -5917.63  , -5923.6133, -5929.573 ,
        -5935.029 , -5940.271 , -5944.9736, -5949.237 , -5952.898 ],
       [-5902.613 , -5908.093 , -5914.061 , -5920.044 , -5925.9707,
        -5931.7017, -5937.0083, -5941.9688, -5946.243 , -5950.265 ],
       [-5899.054 , -5904.7085, -5910.5   , -5916.532 , -5922.4585,
        -5928.2095, -5933.64  , -5938.608 , -5943.3335, -5947.362 ],
       [-5895.9683, -5901.283 , -5907.2   , -5913.2   , -5919.1235,
        -5924.6836, -5930.077 , -5935.3584, -5940.0796, -5944.544 ],
       [-5892.8423, -5898.332 , -5904.08  , -5910.0503, -5915.838 ,
        -5921.344 , -5926.583 , -5931.785 , -5936.9224, -5941.452 ],
       [-5890.067 , -5895.4604, -5901.1587, -5906.9365, -5912.6836,
        -5918.2617, -5923.3687, -5928.1724, -5933.3447, -5937.538 ],
       [-5887.37  , -5892.716 , -5898.2046, -5903.9224, -5909.691 ,
        -5915.144 , -5920.3755, -5925.193 , -5928.876 , -5933.021 ],
       [-5884.786 , -5890.015 , -5895.455 , -5900.958 , -5906.5366,
        -5912.1353, -5917.4043, -5921.5264, -5925.1343, -5928.5483]],
      dtype=float32)

if we attempt to access a variable whose dimensionality does not match the specified chunks in the HTTP headers, it causes issues or failure

. for instance, in our store, x is 1D, and the chunks we specified earlier are 10,10 as defined in zarr.storage.FSStore(url, client_kwargs={'headers': {"chunks": "10,10"}})

In [29]: store['x/.zarray']
Out[29]: b'{"chunks":[10,10],"compressor":null,"dtype":"<i4","fill_value":null,"filters":[],"order":"C","shape":[13333],"zarr_format":2}'

In [30]: store['x/0']
---------------------------------------------------------------------------
ClientResponseError                       Traceback (most recent call last)
Cell In[30], line 1
----> 1 store['x/0']

ClientResponseError: 500, message='Internal Server Error', url=URL('http://127.0.0.1:8000/storage.googleapis.com/ldeo-glaciology/bedmachine/bm.zarr/x/0')

It would be nice if there's a way to override the headers via fsspec.

I am also CC-ing some folks (@freeman-lab, @norlandrhagen, @jhamman, @rabernat) who might be interested in this, to keep them in the loop of our progress

@andersy005 andersy005 changed the title Integrating the proxy into the data viewer - progress update and performance observations on small datasets and datasets hosted in AWS S3 and other issues Integrating the proxy into the data viewer - progress update and performance observations and other issues Jan 25, 2023
@andersy005
Copy link
Member Author

I wanted to make a note that the timings and screenshots above were obtained while running the zarr-proxy via AWS Lambda.

@rabernat
Copy link
Member

The problem is that a single chunk shape header is being provided to the entire group.

I see two high level ways of resolving this:

Only Proxy Arrays

If we just never attempt to open groups, we don't have this problem. The sequence look like this

  1. First open the consolidated metadata to discover all of the variables, shapes and chunks. (With no chunk header provided, the proxy should pass the chunks through unchanged from the underlying store.) This only needs to be done one time, when the data viewer session is being set up.
  2. Based on this information, the client decides what chunking it want to receive from the proxy for each array
  3. Now it's time to get arrays arrays. Construct a request for each array with the desired chunk header and open those paths directly.

This is not compatible with how we tend to use Xarray, Zarr, and FSSpec from Python. There we tend to open the group and thus can't specialize the headers be different for different arrays. But it would work fine in plain Zarr. And it may be feasible from javascript land.

Is Xarray support required here?

Scope the header to specific arrays

We could to scope the header specify different chunks for different object. Instead of

{"chunks": "10,10"}

what about

{
    "chunks":
    {
        "storage.googleapis.com/ldeo-glaciology/bedmachine/bm.zarr/bed": "10,10"
    }
}

The steps to set up reading would be as followed. The first two are the same as above.

  1. First open the consolidated metadata to discover all of the variables, shapes and chunks. (With no chunk header provided, the proxy should pass the chunks through unchanged from the underlying store.) This only needs to be done one time, when the data viewer session is being set up.
  2. Based on this information, the client decides what chunking it want to receive from the proxy for each array
  3. Now the client constructs this more complex header, and re-opens the consolidated metadata with chunks specified for each array within the group.

The tricky bit here is aligning the paths specified in the header with the paths specified in the URL. But this method should also work with Xarray.

@andersy005
Copy link
Member Author

thank you for chiming in, @rabernat. I've implemented a more complex chunks header in #7 and @katamartin and i are wondering if we need the full path in the header key or if the keys can be relative to the path?

{
		"chunks": 
		{
				"bed" "10,10", 
				"x": 5,
		}
                
}

@andersy005
Copy link
Member Author

After tinkering with the new approach for specifying chunks headers in #7, I'm happy to report that everything seems to be working with both Xarray and Zarr. The key piece here is that we are now accepting chunks headers along the .zmetadata route. When chunks are specified, we modify the 'chunks' for the specified variables and override the compressor by setting it to None for all variables, since we are sending raw bytes.

In [5]: import xarray as xr, zarr

In [6]: chunks='bed=10,10,mask=20,20'

In [7]: url = 'http://127.0.0.1:8000/storage.googleapis.com/ldeo-glaciology/bedmachine/bm.zarr'

In [8]: store = zarr.storage.FSStore(url, client_kwargs={'headers': {"chunks": chunks}})

In [9]: ds = xr.open_dataset(store, engine='zarr', chunks={})

In [10]: ds
Out[10]: 
<xarray.Dataset>
Dimensions:    (y: 13333, x: 13333)
Coordinates:
  * x          (x) int32 -3333000 -3332500 -3332000 ... 3332000 3332500 3333000
  * y          (y) int32 3333000 3332500 3332000 ... -3332000 -3332500 -3333000
Data variables:
    bed        (y, x) float32 dask.array<chunksize=(10, 10), meta=np.ndarray>
    errbed     (y, x) float32 dask.array<chunksize=(3000, 3000), meta=np.ndarray>
    firn       (y, x) float32 dask.array<chunksize=(3000, 3000), meta=np.ndarray>
    geoid      (y, x) int16 dask.array<chunksize=(3000, 3000), meta=np.ndarray>
    mask       (y, x) int8 dask.array<chunksize=(20, 20), meta=np.ndarray>
    source     (y, x) int8 dask.array<chunksize=(3000, 3000), meta=np.ndarray>
    surface    (y, x) float32 dask.array<chunksize=(3000, 3000), meta=np.ndarray>
    thickness  (y, x) float32 dask.array<chunksize=(3000, 3000), meta=np.ndarray>
Attributes: (12/25)
    Author:                                 Mathieu Morlighem
    Conventions:                            CF-1.7
    Data_citation:                          Morlighem M. et al., (2019), Deep...
    Notes:                                  Data processed at the Department ...
    Projection:                             Polar Stereographic South (71S,0E)
    Title:                                  BedMachine Antarctica
    ...                                     ...
    spacing:                                [500]
    standard_parallel:                      [-71.0]
    straight_vertical_longitude_from_pole:  [0.0]
    version:                                05-Nov-2019 (v1.38)
    xmin:                                   [-3333000]
    ymax:                                   [3333000]
In [12]: ds.isel(x=range(2), y=range(2)).bed.compute()
Out[12]: 
<xarray.DataArray 'bed' (y: 2, x: 2)>
array([[-5914.538 , -5919.3955],
       [-5910.384 , -5915.8296]], dtype=float32)
Coordinates:
  * x        (x) int32 -3333000 -3332500
  * y        (y) int32 3333000 3332500
Attributes:
    grid_mapping:   mapping
    long_name:      bed topography
    source:         IBCSO and Mathieu Morlighem
    standard_name:  bedrock_altitude
    units:          meters

@rabernat
Copy link
Member

rabernat commented Feb 2, 2023

Is there any live demo I could peek at?

@katamartin
Copy link
Collaborator

@rabernat yeah, you should be able to play around with this: https://756xnpgrdy6om3hgr5wxyxvnzm0ecwcg.lambda-url.us-west-2.on.aws

@rabernat
Copy link
Member

rabernat commented Feb 2, 2023

I guess i meant an actual map. 😉

@katamartin
Copy link
Collaborator

Aha yeah the link for the map is https://ncview-js.staging.carbonplan.org/, but the app is definitely not stable 😅. We're currently troubleshooting the integration with the newly added validations.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants