to_xarray() result is incorrect when one of multi-index levels is not sorted #4186

pzhlobi · 2020-06-27T16:58:29Z

What happened:
to_xarray() sorts multi-index level values and returns result for the sorted values but it doesn't sort levels or expects levels to be sorted resulting in completely incorrect order of data for the displayed coordinates.

df:
            C1  C2
lev1 lev2        
b    foo    0   1
a    foo    2   3 

df.to_xarray():
 <xarray.Dataset>
Dimensions:  (lev1: 2, lev2: 1)
Coordinates:
  * lev1     (lev1) object 'b' 'a'
  * lev2     (lev2) object 'foo'
Data variables:
    C1       (lev1, lev2) int64 2 0
    C2       (lev1, lev2) int64 3 1

What you expected to happen:
Should account for the order of levels in the original index.

Minimal Complete Verifiable Example:

import pandas as pd
df = pd.concat(
    {
        'b': pd.DataFrame([[0, 1]], index=['foo'], columns=['C1', 'C2']),
        'a': pd.DataFrame([[2, 3]], index=['foo'], columns=['C1', 'C2']),
    }
).rename_axis(['lev1', 'lev2'])
print('df:\n', df, '\n')
print('df.to_xarray():\n', df.to_xarray(), '\n')
print('df.index.levels[0]:\n', df.index.levels[0])

Anything else we need to know?:

Environment:

Output of xr.show_versions()

INSTALLED VERSIONS

commit: None
python: 3.8.2 | packaged by conda-forge | (default, Apr 24 2020, 08:20:52)
[GCC 7.3.0]
python-bits: 64
OS: Linux
OS-release: 5.4.7-100.fc30.x86_64
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: None
LANG: en_GB.UTF-8
LOCALE: en_GB.UTF-8
libhdf5: 1.10.4
libnetcdf: 4.6.3
xarray: 0.15.1
pandas: 1.0.5
numpy: 1.19.0
scipy: 1.5.0
netCDF4: 1.5.3
pydap: None
h5netcdf: None
h5py: None
Nio: None
zarr: None
cftime: 1.1.3
nc_time_axis: None
PseudoNetCDF: None
rasterio: None
cfgrib: None
iris: None
bottleneck: 1.3.2
dask: 2.19.0
distributed: 2.19.0
matplotlib: 3.2.2
cartopy: None
seaborn: 0.10.1
numbagg: installed
setuptools: 46.3.0.post20200513
pip: 20.1
conda: None
pytest: 5.4.3
IPython: 7.15.0
sphinx: None

The text was updated successfully, but these errors were encountered:

Li9htmare · 2020-06-28T11:37:20Z

It seems the problem here is in Dataset.from_dataframe the dims and coords are created with df.index.levels which is unsorted:

xarray/xarray/core/dataset.py

Lines 4642 to 4643 in 732750a

    
           for dim, lev in zip(dims, idx.levels): 
        
               obj[dim] = (dim, lev)

Then in Dataset._set_numpy_data_from_dataframe, the pd.MultiIndex.from_product and dataframe.reindex unintentionally sort the dataframe by index:

xarray/xarray/core/dataset.py

Lines 4588 to 4589 in 732750a

    
           full_idx = pd.MultiIndex.from_product(idx.levels, names=idx.names) 
        
           dataframe = dataframe.reindex(full_idx)

Besides the perf improvement it provides, #4184 seems also have a nice side effect fixing this issue.

shoyer · 2020-06-29T22:28:00Z

Hi @pzhlobi @Li9htmare -- thanks for raising this issue.

Could you kindly clarify for me exactly what behavior you think xarray should do? The results are indeed reordered currently, but as far as I can tell the pairing between coordinators and values remains consistent.

When I test this myself, I see the same behavior (documented in the first post) either with or without my changes from #4184.

Li9htmare · 2020-06-29T23:40:41Z

Hi @shoyer, sorry I got you confused, I should have run your code at first place. You code removes the problematic dataframe.reindex in Dataset._set_numpy_data_from_dataframe, but there is indeed another place causing the problem, which is actually already fixed (but not released yet) by https://github.com/pydata/xarray/pull/3953/files#diff-921db548d18a549f6381818ed08298c9L4607-L4608

Using pzhlobi's example df with xarray 0.15.1 (incorrect result):

df.to_xarray()
<xarray.Dataset>
Dimensions:  (lev1: 2, lev2: 1)
Coordinates:
  * lev1     (lev1) object 'b' 'a'
  * lev2     (lev2) object 'foo'
Data variables:
    C1       (lev1, lev2) int64 2 0
    C2       (lev1, lev2) int64 3 1

Using the same df with both #3953 and #4184 (correct result):

df.to_xarray()
<xarray.Dataset>
Dimensions:  (lev1: 2, lev2: 1)
Coordinates:
  * lev1     (lev1) object 'a' 'b'
  * lev2     (lev2) object 'foo'
Data variables:
    C1       (lev1, lev2) int64 2 0
    C2       (lev1, lev2) int64 3 1

shoyer · 2020-06-29T23:51:49Z

Thanks for clarifying!

This raises an interesting question for #4184: do we want to keep @fujiisoup's fix from #3953 or not?

If we remove @fujiisoup's fix, then the output we see is:

df.to_xarray()
 <xarray.Dataset>
Dimensions:  (lev1: 2, lev2: 1)
Coordinates:
  * lev1     (lev1) object 'b' 'a'
  * lev2     (lev2) object 'foo'
Data variables:
    C1       (lev1, lev2) int64 0 2
    C2       (lev1, lev2) int64 1 3

This is also correct -- coordinates match up with values -- but the order of the result is different from what is currently on master.

fujiisoup · 2020-06-30T00:21:43Z

I think the #3953 fixes the case where the multiindex has unused levels.
I had no better idea than #3953, but if it works without #3953, it would be better ;)

shoyer · 2020-06-30T01:03:40Z

I verified that #4184 fixes the tests added for #3953 even after removing the call to remove_unused_levels_categories().

The main question is what behavior we want to do have: Should from_dataframe preserve index levels exactly, or should it sort them first?

I think it's better to not to sort (but of course it's better to sort than to get the wrong order).

fujiisoup · 2020-06-30T01:06:34Z

I agree that it's better not to sort.

shoyer · 2020-06-30T01:14:45Z

Actually, I realize now that this is basically the same issue as #2619

If I remove the use of removed_unused_levels_categories from from_dataframe, then I get the same behavior that we considered a bug in that issue:

In [5]: ds.isel(xy=ds['x'] < 4).to_pandas().to_xarray()
Out[5]:
<xarray.DataArray (x: 8, y: 5)>
array([[ 0.,  1.,  2.,  3.,  4.],
       [ 5.,  6.,  7.,  8.,  9.],
       [10., 11., 12., 13., 14.],
       [15., 16., 17., 18., 19.],
       [nan, nan, nan, nan, nan],
       [nan, nan, nan, nan, nan],
       [nan, nan, nan, nan, nan],
       [nan, nan, nan, nan, nan]])
Coordinates:
  * x        (x) int64 0 1 2 3 4 5 6 7
  * y        (y) int64 0 1 2 3 4

So maybe it is more consistent to keep calling remove_unused_levels(), which somewhat surprisingly sorts MultiIndex levels.

shoyer · 2020-06-30T01:41:36Z

The sorting seems to be a separate matter, caused by dataframe.set_index() inside our remove_unused_levels_categories function. I think we can remove that, which will fix the sorting issue when removing unused levels. Then the result will be the desired:

df.to_xarray()
 <xarray.Dataset>
Dimensions:  (lev1: 2, lev2: 1)
Coordinates:
  * lev1     (lev1) object 'b' 'a'
  * lev2     (lev2) object 'foo'
Data variables:
    C1       (lev1, lev2) int64 0 2
    C2       (lev1, lev2) int64 1 3

Li9htmare · 2020-06-30T09:24:13Z

Hi @shoyer , without dataframe.set_index(), dataframe.index can potentially be different from idx returned by remove_unused_levels_categories, this will lead to other problems. One example is the following df:

df = pd.DataFrame(
    {
        'lev1': pd.Series(
            ['b', 'a'], dtype=pd.CategoricalDtype(['c', 'b', 'a'], ordered=True)
        ),
        'lev2': 'foo',
        'C1': [0, 2],
        'C2': [1, 3],
    }
).set_index(['lev1', 'lev2'])

I agree it will be better if we can maintain the order from df to xr.Dataset, but I think we should never work with a copy of idx which is different from dataframe.index, as this will lead to hard to debug problems due to "surprising" behavior pandas does.

shoyer · 2020-06-30T16:29:10Z

@Li9htmare I'm not sure I follow your example. #4184 does remove the use of DataFrame.set_index(), but it also removes any subsequent use of dataframe.index -- it always uses the separately processed index.

Is there something specific that you are worried about going wrong with your latest example? For what it's worth, here's what to_xarray() does with the current version of #4184:

In [4]: df.to_xarray()
Out[4]:
<xarray.Dataset>
Dimensions:  (lev1: 2, lev2: 1)
Coordinates:
  * lev1     (lev1) object 'b' 'a'
  * lev2     (lev2) object 'foo'
Data variables:
    C1       (lev1, lev2) int64 0 2
    C2       (lev1, lev2) int64 1 3

In [5]: df.to_xarray().indexes
Out[5]:
lev1: CategoricalIndex(['b', 'a'], categories=['b', 'a'], ordered=True, name='lev1', dtype='category')
lev2: Index(['foo'], dtype='object', name='lev2')

I think this is doing the right thing already?

Li9htmare · 2020-06-30T19:02:28Z

Sorry @shoyer, I didn't notice you have pushed new commits to #4184 and thought you meant to just remove the DataFrame.set_index. Your latest commits indeed give the correct result. My concern was when another person works on this and didn't get the context that idx might be different from dataframe.index and new bugs could potentially be introduced. Though consider the limited scope where we are maintaining both idx and dataframe, I guess it should be fine.

shoyer · 2020-06-30T20:13:44Z

My concern was when another person works on this and didn't get the context that idx might be different from dataframe.index and new bugs could potentially be introduced

Let me see if I can rewrite the helper functions to avoid passing around a DataFrame

shoyer · 2020-06-30T20:44:00Z

My concern was when another person works on this and didn't get the context that idx might be different from dataframe.index and new bugs could potentially be introduced

Let me see if I can rewrite the helper functions to avoid passing around a DataFrame

This was a good suggestion. Done in 96b544b

Li9htmare · 2020-06-30T21:48:33Z

This intention of variables used constructing the Dataset looks a lot clearer now. Many thanks Stephan!

shoyer mentioned this issue Jun 28, 2020

Improve the speed of from_dataframe with a MultiIndex (by 40x!) #4184

Merged

3 tasks

shoyer added the bug label Jun 30, 2020

shoyer closed this as completed in #4184 Jul 2, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

to_xarray() result is incorrect when one of multi-index levels is not sorted #4186

to_xarray() result is incorrect when one of multi-index levels is not sorted #4186

pzhlobi commented Jun 27, 2020 •

edited

INSTALLED VERSIONS

Li9htmare commented Jun 28, 2020

shoyer commented Jun 29, 2020

Li9htmare commented Jun 29, 2020 •

edited

shoyer commented Jun 29, 2020

fujiisoup commented Jun 30, 2020

shoyer commented Jun 30, 2020

fujiisoup commented Jun 30, 2020

shoyer commented Jun 30, 2020

shoyer commented Jun 30, 2020

Li9htmare commented Jun 30, 2020

shoyer commented Jun 30, 2020 •

edited

Li9htmare commented Jun 30, 2020

shoyer commented Jun 30, 2020

shoyer commented Jun 30, 2020

Li9htmare commented Jun 30, 2020

to_xarray() result is incorrect when one of multi-index levels is not sorted #4186

to_xarray() result is incorrect when one of multi-index levels is not sorted #4186

Comments

pzhlobi commented Jun 27, 2020 • edited

INSTALLED VERSIONS

Li9htmare commented Jun 28, 2020

shoyer commented Jun 29, 2020

Li9htmare commented Jun 29, 2020 • edited

shoyer commented Jun 29, 2020

fujiisoup commented Jun 30, 2020

shoyer commented Jun 30, 2020

fujiisoup commented Jun 30, 2020

shoyer commented Jun 30, 2020

shoyer commented Jun 30, 2020

Li9htmare commented Jun 30, 2020

shoyer commented Jun 30, 2020 • edited

Li9htmare commented Jun 30, 2020

shoyer commented Jun 30, 2020

shoyer commented Jun 30, 2020

Li9htmare commented Jun 30, 2020

pzhlobi commented Jun 27, 2020 •

edited

Li9htmare commented Jun 29, 2020 •

edited

shoyer commented Jun 30, 2020 •

edited