Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

to_xarray() result is incorrect when one of multi-index levels is not sorted #4186

Closed
pzhlobi opened this issue Jun 27, 2020 · 15 comments · Fixed by #4184
Closed

to_xarray() result is incorrect when one of multi-index levels is not sorted #4186

pzhlobi opened this issue Jun 27, 2020 · 15 comments · Fixed by #4184
Labels

Comments

@pzhlobi
Copy link

pzhlobi commented Jun 27, 2020

What happened:
to_xarray() sorts multi-index level values and returns result for the sorted values but it doesn't sort levels or expects levels to be sorted resulting in completely incorrect order of data for the displayed coordinates.

df:
            C1  C2
lev1 lev2        
b    foo    0   1
a    foo    2   3 

df.to_xarray():
 <xarray.Dataset>
Dimensions:  (lev1: 2, lev2: 1)
Coordinates:
  * lev1     (lev1) object 'b' 'a'
  * lev2     (lev2) object 'foo'
Data variables:
    C1       (lev1, lev2) int64 2 0
    C2       (lev1, lev2) int64 3 1 

What you expected to happen:
Should account for the order of levels in the original index.

Minimal Complete Verifiable Example:

import pandas as pd
df = pd.concat(
    {
        'b': pd.DataFrame([[0, 1]], index=['foo'], columns=['C1', 'C2']),
        'a': pd.DataFrame([[2, 3]], index=['foo'], columns=['C1', 'C2']),
    }
).rename_axis(['lev1', 'lev2'])
print('df:\n', df, '\n')
print('df.to_xarray():\n', df.to_xarray(), '\n')
print('df.index.levels[0]:\n', df.index.levels[0])

Anything else we need to know?:

Environment:

Output of xr.show_versions()

INSTALLED VERSIONS

commit: None
python: 3.8.2 | packaged by conda-forge | (default, Apr 24 2020, 08:20:52)
[GCC 7.3.0]
python-bits: 64
OS: Linux
OS-release: 5.4.7-100.fc30.x86_64
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: None
LANG: en_GB.UTF-8
LOCALE: en_GB.UTF-8
libhdf5: 1.10.4
libnetcdf: 4.6.3
xarray: 0.15.1
pandas: 1.0.5
numpy: 1.19.0
scipy: 1.5.0
netCDF4: 1.5.3
pydap: None
h5netcdf: None
h5py: None
Nio: None
zarr: None
cftime: 1.1.3
nc_time_axis: None
PseudoNetCDF: None
rasterio: None
cfgrib: None
iris: None
bottleneck: 1.3.2
dask: 2.19.0
distributed: 2.19.0
matplotlib: 3.2.2
cartopy: None
seaborn: 0.10.1
numbagg: installed
setuptools: 46.3.0.post20200513
pip: 20.1
conda: None
pytest: 5.4.3
IPython: 7.15.0
sphinx: None

@Li9htmare
Copy link

It seems the problem here is in Dataset.from_dataframe the dims and coords are created with df.index.levels which is unsorted:

xarray/xarray/core/dataset.py

Lines 4642 to 4643 in 732750a

for dim, lev in zip(dims, idx.levels):
obj[dim] = (dim, lev)

Then in Dataset._set_numpy_data_from_dataframe, the pd.MultiIndex.from_product and dataframe.reindex unintentionally sort the dataframe by index:

xarray/xarray/core/dataset.py

Lines 4588 to 4589 in 732750a

full_idx = pd.MultiIndex.from_product(idx.levels, names=idx.names)
dataframe = dataframe.reindex(full_idx)

Besides the perf improvement it provides, #4184 seems also have a nice side effect fixing this issue.

@shoyer
Copy link
Member

shoyer commented Jun 29, 2020

Hi @pzhlobi @Li9htmare -- thanks for raising this issue.

Could you kindly clarify for me exactly what behavior you think xarray should do? The results are indeed reordered currently, but as far as I can tell the pairing between coordinators and values remains consistent.

When I test this myself, I see the same behavior (documented in the first post) either with or without my changes from #4184.

@Li9htmare
Copy link

Li9htmare commented Jun 29, 2020

Hi @shoyer, sorry I got you confused, I should have run your code at first place. You code removes the problematic dataframe.reindex in Dataset._set_numpy_data_from_dataframe, but there is indeed another place causing the problem, which is actually already fixed (but not released yet) by https://github.com/pydata/xarray/pull/3953/files#diff-921db548d18a549f6381818ed08298c9L4607-L4608

Using pzhlobi's example df with xarray 0.15.1 (incorrect result):

df.to_xarray()
<xarray.Dataset>
Dimensions:  (lev1: 2, lev2: 1)
Coordinates:
  * lev1     (lev1) object 'b' 'a'
  * lev2     (lev2) object 'foo'
Data variables:
    C1       (lev1, lev2) int64 2 0
    C2       (lev1, lev2) int64 3 1

Using the same df with both #3953 and #4184 (correct result):

df.to_xarray()
<xarray.Dataset>
Dimensions:  (lev1: 2, lev2: 1)
Coordinates:
  * lev1     (lev1) object 'a' 'b'
  * lev2     (lev2) object 'foo'
Data variables:
    C1       (lev1, lev2) int64 2 0
    C2       (lev1, lev2) int64 3 1

@shoyer
Copy link
Member

shoyer commented Jun 29, 2020

Thanks for clarifying!

This raises an interesting question for #4184: do we want to keep @fujiisoup's fix from #3953 or not?

If we remove @fujiisoup's fix, then the output we see is:

df.to_xarray()
 <xarray.Dataset>
Dimensions:  (lev1: 2, lev2: 1)
Coordinates:
  * lev1     (lev1) object 'b' 'a'
  * lev2     (lev2) object 'foo'
Data variables:
    C1       (lev1, lev2) int64 0 2
    C2       (lev1, lev2) int64 1 3

This is also correct -- coordinates match up with values -- but the order of the result is different from what is currently on master.

@fujiisoup
Copy link
Member

I think the #3953 fixes the case where the multiindex has unused levels.
I had no better idea than #3953, but if it works without #3953, it would be better ;)

@shoyer
Copy link
Member

shoyer commented Jun 30, 2020

I verified that #4184 fixes the tests added for #3953 even after removing the call to remove_unused_levels_categories().

The main question is what behavior we want to do have: Should from_dataframe preserve index levels exactly, or should it sort them first?

I think it's better to not to sort (but of course it's better to sort than to get the wrong order).

@fujiisoup
Copy link
Member

I agree that it's better not to sort.

@shoyer
Copy link
Member

shoyer commented Jun 30, 2020

Actually, I realize now that this is basically the same issue as #2619

If I remove the use of removed_unused_levels_categories from from_dataframe, then I get the same behavior that we considered a bug in that issue:

In [5]: ds.isel(xy=ds['x'] < 4).to_pandas().to_xarray()
Out[5]:
<xarray.DataArray (x: 8, y: 5)>
array([[ 0.,  1.,  2.,  3.,  4.],
       [ 5.,  6.,  7.,  8.,  9.],
       [10., 11., 12., 13., 14.],
       [15., 16., 17., 18., 19.],
       [nan, nan, nan, nan, nan],
       [nan, nan, nan, nan, nan],
       [nan, nan, nan, nan, nan],
       [nan, nan, nan, nan, nan]])
Coordinates:
  * x        (x) int64 0 1 2 3 4 5 6 7
  * y        (y) int64 0 1 2 3 4

So maybe it is more consistent to keep calling remove_unused_levels(), which somewhat surprisingly sorts MultiIndex levels.

@shoyer
Copy link
Member

shoyer commented Jun 30, 2020

The sorting seems to be a separate matter, caused by dataframe.set_index() inside our remove_unused_levels_categories function. I think we can remove that, which will fix the sorting issue when removing unused levels. Then the result will be the desired:

df.to_xarray()
 <xarray.Dataset>
Dimensions:  (lev1: 2, lev2: 1)
Coordinates:
  * lev1     (lev1) object 'b' 'a'
  * lev2     (lev2) object 'foo'
Data variables:
    C1       (lev1, lev2) int64 0 2
    C2       (lev1, lev2) int64 1 3

@shoyer shoyer added the bug label Jun 30, 2020
@Li9htmare
Copy link

Hi @shoyer , without dataframe.set_index(), dataframe.index can potentially be different from idx returned by remove_unused_levels_categories, this will lead to other problems. One example is the following df:

df = pd.DataFrame(
    {
        'lev1': pd.Series(
            ['b', 'a'], dtype=pd.CategoricalDtype(['c', 'b', 'a'], ordered=True)
        ),
        'lev2': 'foo',
        'C1': [0, 2],
        'C2': [1, 3],
    }
).set_index(['lev1', 'lev2'])

I agree it will be better if we can maintain the order from df to xr.Dataset, but I think we should never work with a copy of idx which is different from dataframe.index, as this will lead to hard to debug problems due to "surprising" behavior pandas does.

@shoyer
Copy link
Member

shoyer commented Jun 30, 2020

@Li9htmare I'm not sure I follow your example. #4184 does remove the use of DataFrame.set_index(), but it also removes any subsequent use of dataframe.index -- it always uses the separately processed index.

Is there something specific that you are worried about going wrong with your latest example? For what it's worth, here's what to_xarray() does with the current version of #4184:

In [4]: df.to_xarray()
Out[4]:
<xarray.Dataset>
Dimensions:  (lev1: 2, lev2: 1)
Coordinates:
  * lev1     (lev1) object 'b' 'a'
  * lev2     (lev2) object 'foo'
Data variables:
    C1       (lev1, lev2) int64 0 2
    C2       (lev1, lev2) int64 1 3

In [5]: df.to_xarray().indexes
Out[5]:
lev1: CategoricalIndex(['b', 'a'], categories=['b', 'a'], ordered=True, name='lev1', dtype='category')
lev2: Index(['foo'], dtype='object', name='lev2')

I think this is doing the right thing already?

@Li9htmare
Copy link

Sorry @shoyer, I didn't notice you have pushed new commits to #4184 and thought you meant to just remove the DataFrame.set_index. Your latest commits indeed give the correct result. My concern was when another person works on this and didn't get the context that idx might be different from dataframe.index and new bugs could potentially be introduced. Though consider the limited scope where we are maintaining both idx and dataframe, I guess it should be fine.

@shoyer
Copy link
Member

shoyer commented Jun 30, 2020

My concern was when another person works on this and didn't get the context that idx might be different from dataframe.index and new bugs could potentially be introduced

Let me see if I can rewrite the helper functions to avoid passing around a DataFrame

@shoyer
Copy link
Member

shoyer commented Jun 30, 2020

My concern was when another person works on this and didn't get the context that idx might be different from dataframe.index and new bugs could potentially be introduced

Let me see if I can rewrite the helper functions to avoid passing around a DataFrame

This was a good suggestion. Done in 96b544b

@Li9htmare
Copy link

This intention of variables used constructing the Dataset looks a lot clearer now. Many thanks Stephan!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants