Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

xarray.Dataset should NOT be recognized as a valid tabular data type #3146

Open
seisman opened this issue Mar 28, 2024 · 6 comments
Open

xarray.Dataset should NOT be recognized as a valid tabular data type #3146

seisman opened this issue Mar 28, 2024 · 6 comments
Labels
discussions Need more discussion before taking further actions

Comments

@seisman
Copy link
Member

seisman commented Mar 28, 2024

Currently, xarray.Dataset is recognized as a valid tabular data type in some places. For example:

if hasattr(data, "data_vars") and len(data.data_vars) < 3: # xr.Dataset
raise GMTInvalidInput("data must provide x, y, and z columns.")

pygmt/pygmt/clib/session.py

Lines 1531 to 1547 in dbbc168

>>> from pygmt.helpers import GMTTempFile
>>> import xarray as xr
>>> data = xr.Dataset(
... coords=dict(index=[0, 1, 2]),
... data_vars=dict(
... x=("index", [9, 8, 7]),
... y=("index", [6, 5, 4]),
... z=("index", [3, 2, 1]),
... ),
... )
>>> with Session() as ses:
... with ses.virtualfile_in(check_kind="vector", data=data) as fin:
... # Send the output to a file so that we can read it
... with GMTTempFile() as fout:
... ses.call_module("info", fin + " ->" + fout.name)
... print(fout.read().strip())
<vector memory>: N = 3 <7/9> <4/6> <1/3>

But I think it should NOT be like that. Here are the reasons.

1. xarray.Dataset is more like a collection of xarray.DataArrays, rather than a pandas.DataFrame:

As the official docs says:

A dataset resembles an in-memory representation of a NetCDF file, and consists of variables, coordinates and attributes which together form a self describing dataset.

Dataset implements the mapping interface with keys given by variable names and values given by DataArray objects for each variable name.

xarray.Dataset can represent tabular data, but it's more commonly used as a data structure to hold multiple xarray.DataArray objects.

2. It's unclear what/how data are passed.

Here is an example from the official documentation:

>>> import numpy as np
>>> import xarray as xr
>>> import pandas as pd
>>> np.random.seed(0)
>>> temperature = 15 + 8 * np.random.randn(2, 2, 3)
>>> precipitation = 10 * np.random.rand(2, 2, 3)
>>> lon = [[-99.83, -99.32], [-99.79, -99.23]]
>>> lat = [[42.25, 42.21], [42.63, 42.59]]
>>> time = pd.date_range("2014-09-06", periods=3)
>>> reference_time = pd.Timestamp("2014-09-05")

>>> ds = xr.Dataset(
...     data_vars=dict(
...         temperature=(["x", "y", "time"], temperature),
...         precipitation=(["x", "y", "time"], precipitation),
...     ),
...     coords=dict(
...         lon=(["x", "y"], lon),
...         lat=(["x", "y"], lat),
...         time=time,
...         reference_time=reference_time,
...     ),
...     attrs=dict(description="Weather related data."),
... )
>>> ds
<xarray.Dataset> Size: 288B
Dimensions:         (x: 2, y: 2, time: 3)
Coordinates:
    lon             (x, y) float64 32B -99.83 -99.32 -99.79 -99.23
    lat             (x, y) float64 32B 42.25 42.21 42.63 42.59
  * time            (time) datetime64[ns] 24B 2014-09-06 2014-09-07 2014-09-08
    reference_time  datetime64[ns] 8B 2014-09-05
Dimensions without coordinates: x, y
Data variables:
    temperature     (x, y, time) float64 96B 29.11 18.2 22.83 ... 16.15 26.63
    precipitation   (x, y, time) float64 96B 5.68 9.256 0.7104 ... 4.615 7.805
Attributes:
    description:  Weather related data.

Each data variable is a xarray.DataArray object:

>>> ds.temperature
<xarray.DataArray 'temperature' (x: 2, y: 2, time: 3)> Size: 96B
array([[[29.11241877, 18.20125767, 22.82990387],
        [32.92714559, 29.94046392,  7.18177696]],

       [[22.60070734, 13.78914233, 14.17424919],
        [18.28478802, 16.15234857, 26.63418806]]])
Coordinates:
    lon             (x, y) float64 32B -99.83 -99.32 -99.79 -99.23
    lat             (x, y) float64 32B 42.25 42.21 42.63 42.59
  * time            (time) datetime64[ns] 24B 2014-09-06 2014-09-07 2014-09-08
    reference_time  datetime64[ns] 8B 2014-09-05
Dimensions without coordinates: x, y

Then, in virtualfile_in, a list of multi-dimensional (3-D in this example) xarray.DataArray objects are passed to GMT modules which expects a list of 1-D arrays instead. It works without errors because in Session.put_vector, we pass the pointer of the 2-D array to the GMT C API function, but it likely won't work if the data is not C-contiguous (e.g., a slice of a dataset). So, the actual behavior is not well defined.

So, I think xarray.Dataset should not be recognized as a valid tabular data type, which not only makes more sense but also can simplify our codes/tests.

@seisman seisman added the discussions Need more discussion before taking further actions label Mar 28, 2024
@weiji14
Copy link
Member

weiji14 commented Mar 29, 2024

For context, allowing xr.Dataset inputs was added in #619, and the background reason I wanted to enable this was because there are certain file formats like HDF5 that can be read directly into an xr.Dataset (via xr.open_dataset) but not into a pandas.DataFrame, and I wanted to be able to plot 1D arrays stored as xr.DataArray data variables in an xr.Dataset without needing to convert them to pandas as an intermediate step.

For example, with ICESat-2 ATL06 altimetry data, the data is stored as a HDF5 file, with 1D data variables like height for each altimetry measurement.

import xarray as xr

# https://n5eil01u.ecs.nsidc.org/ATLAS/ATL06.006/2018.10.14/ATL06_20181014001049_02350102_006_02.h5
ds: xr.Dataset = xr.open_dataset(
    "ATL06_20181014001049_02350102_006_02.h5", group="gt2l/land_ice_segments"
)
print(ds)

produces an xr.Dataset like:

<xarray.Dataset> Size: 320kB
Dimensions:                (delta_time: 7103)
Coordinates:
  * delta_time             (delta_time) datetime64[ns] 57kB 2018-10-14T00:17:...
    latitude               (delta_time) float64 57kB ...
    longitude              (delta_time) float64 57kB ...
Data variables:
    atl06_quality_summary  (delta_time) int8 7kB ...
    h_li                   (delta_time) float32 28kB ...
    h_li_sigma             (delta_time) float32 28kB ...
    segment_id             (delta_time) float64 57kB ...
    sigma_geo_h            (delta_time) float32 28kB ...
Attributes:
    Description:  The land_ice_height group contains the primary set of deriv...
    data_rate:    Data within this group are sparse.  Data values are provide...

I can make an along-track plot with PyGMT directly by passing in the xr.Dataset like so:

import pygmt
ds_height = ds[["segment_id", "h_li"]]  # along-track index and height

fig = pygmt.Figure()
fig.plot(
    data=ds_height, frame=["xaf+lsegment_id", "yaf+lHeight (m)"], style="c0.05c"
)
fig.show()

produces:
alongtrack_plot

That said, I can see your point that we currently don't check whether the xr.DataArray variables are 1D arrays or 2D arrays, which results in undefined behaviour:

Then, in virtualfile_in, a list of multi-dimensional (3-D in this example) xarray.DataArray objects are passed to GMT modules which expects a list of 1-D arrays instead. It works without errors because in Session.put_vector, we pass the pointer of the 2-D array to the GMT C API function, but it likely won't work if the data is not C-contiguous (e.g., a slice of a dataset). So, the actual behavior is not well defined.

This sounds similar to the issue mentioned at #3086 where xyz2grd doesn't check if the inputs passed to the data parameter have the expected number of columns. Perhaps we need to have better checks on the input dimensions/shape before passing the data into virtualfile_from_vectors.

@seisman
Copy link
Member Author

seisman commented Mar 29, 2024

In your example, it's unclear to me (from a user's perspective) what data are passed to GMT. Users may expect to pass three columns (time_delta, segment_id and h_li, because segment_id and h_li are a function of time_delta), but actual only two columns (segment_id and h_li) are passed.

@weiji14
Copy link
Member

weiji14 commented Mar 29, 2024

PyGMT is only able to read from an xr.Dataset's data_vars, not from the coords, so delta_time cannot actually be used or plotted in this case, unless that coordinate is turned into a data variable (using .reset_coords()), though that actually isn't possible in this case because delta_time is an index coordinate (denoted with the asterisk *).

We could theoretically apply .reset_coords() on 'longitude' and 'latitude' (since they are not index coords), and pass in 'longitude', 'latitude', 'h_li' to plot3d for example to make a 3D scatterplot.

@seisman
Copy link
Member Author

seisman commented Mar 29, 2024

I didn't mean we should pass the coordinates to GMT. I feel the behavior of passing an xr.Dataset to PyGMT is ambiguous (coordinates are passed or not) and may confuse users. Instead, passing a pandas.DataFrame or ndarray is more easier to understand.

@weiji14
Copy link
Member

weiji14 commented Mar 29, 2024

Maybe we can add a section for xr.Dataset inputs under https://www.pygmt.org/v0.11.0/get_started/04_table_inputs.html, as a follow-up to #2722? Xref #1268.

@seisman
Copy link
Member Author

seisman commented Mar 29, 2024

Maybe we can add a section for xr.Dataset inputs under https://www.pygmt.org/v0.11.0/get_started/04_table_inputs.html, as a follow-up to #2722? Xref #1268.

OK. Also need to update the codes to check if the dataset only has 1-D variables.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
discussions Need more discussion before taking further actions
Projects
None yet
Development

No branches or pull requests

2 participants