Stack + to_array before to_xarray is much faster that a simple to_xarray #2459

max-sixty · 2018-10-02T16:13:26Z

I was seeing some slow performance around to_xarray() on MultiIndexed series, and found that unstacking one of the dimensions before running to_xarray(), and then restacking with to_array() was ~30x faster. This time difference is consistent with larger data sizes.

To reproduce:

Create a series with a MultiIndex, ensuring the MultiIndex isn't a simple product:

s = pd.Series(
    np.random.rand(100000), 
    index=pd.MultiIndex.from_product([
        list('abcdefhijk'),
        list('abcdefhijk'),
        pd.DatetimeIndex(start='2000-01-01', periods=1000, freq='B'),
    ]))

cropped = s[::3]
cropped.index=pd.MultiIndex.from_tuples(cropped.index, names=list('xyz'))

cropped.head()

# x  y  z         
# a  a  2000-01-03    0.993989
#      2000-01-06    0.850518
#      2000-01-11    0.068944
#      2000-01-14    0.237197
#      2000-01-19    0.784254
# dtype: float64

Two approaches for getting this into xarray;
1 - Simple .to_xarray():

# current_method = cropped.to_xarray()

<xarray.DataArray (x: 10, y: 10, z: 1000)>
array([[[0.993989,      nan, ...,      nan, 0.721663],
        [     nan,      nan, ..., 0.58224 ,      nan],
        ...,
        [     nan, 0.369382, ...,      nan,      nan],
        [0.98558 ,      nan, ...,      nan, 0.403732]],

       [[     nan,      nan, ..., 0.493711,      nan],
        [     nan, 0.126761, ...,      nan,      nan],
        ...,
        [0.976758,      nan, ...,      nan, 0.816612],
        [     nan,      nan, ..., 0.982128,      nan]],

       ...,

       [[     nan, 0.971525, ...,      nan,      nan],
        [0.146774,      nan, ...,      nan, 0.419806],
        ...,
        [     nan,      nan, ..., 0.700764,      nan],
        [     nan, 0.502058, ...,      nan,      nan]],

       [[0.246768,      nan, ...,      nan, 0.079266],
        [     nan,      nan, ..., 0.802297,      nan],
        ...,
        [     nan, 0.636698, ...,      nan,      nan],
        [0.025195,      nan, ...,      nan, 0.629305]]])
Coordinates:
  * x        (x) object 'a' 'b' 'c' 'd' 'e' 'f' 'h' 'i' 'j' 'k'
  * y        (y) object 'a' 'b' 'c' 'd' 'e' 'f' 'h' 'i' 'j' 'k'
  * z        (z) datetime64[ns] 2000-01-03 2000-01-04 ... 2003-10-30 2003-10-31

This takes 536 ms

2 - unstack in pandas first, and then use to_array to do the equivalent of a restack:

proposed_version = (
    cropped
    .unstack('y')
    .to_xarray()
    .to_array('y')
)

This takes 17.3 ms

To confirm these are identical:

proposed_version_adj = (
    proposed_version
    .assign_coords(y=proposed_version['y'].astype(object))
    .transpose(*current_version.dims)
)

proposed_version_adj.equals(current_version)
# True

Problem description

A default operation is much slower than a (potentially) equivalent operation that's not the default.

I need to look more at what's causing the issues. I think it's to do with the .reindex(full_idx), but I'm unclear why it's so much faster in the alternative route, and whether there's a fix that we can make to make the default path fast.

Output of `xr.show_versions()`

INSTALLED VERSIONS ------------------ commit: None python: 2.7.14.final.0 python-bits: 64 OS: Linux OS-release: 4.9.93-linuxkit-aufs machine: x86_64 processor: x86_64 byteorder: little LC_ALL: en_US.UTF-8 LANG: en_US.utf8 LOCALE: None.None

xarray: 0.10.9
pandas: 0.23.4
numpy: 1.15.2
scipy: 1.1.0
netCDF4: None
h5netcdf: None
h5py: None
Nio: None
zarr: None
cftime: None
PseudonetCDF: None
rasterio: None
iris: None
bottleneck: 1.2.1
cyordereddict: None
dask: None
distributed: None
matplotlib: 2.2.3
cartopy: 0.16.0
seaborn: 0.9.0
setuptools: 40.4.3
pip: 18.0
conda: None
pytest: 3.8.1
IPython: 5.8.0
sphinx: None

The text was updated successfully, but these errors were encountered:

shoyer · 2018-10-02T19:20:04Z

Here are the top entries I see with %prun cropped.to_xarray():

         308597 function calls (308454 primitive calls) in 0.651 seconds

   Ordered by: internal time

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
   100000    0.255    0.000    0.275    0.000 datetimes.py:606(<lambda>)
        1    0.165    0.165    0.165    0.165 {built-in method pandas._libs.lib.is_datetime_with_singletz_array}
        1    0.071    0.071    0.634    0.634 {method 'get_indexer' of 'pandas._libs.index.BaseMultiIndexCodesEngine' objects}
        1    0.054    0.054    0.054    0.054 {pandas._libs.lib.fast_zip}
        1    0.029    0.029    0.304    0.304 {pandas._libs.lib.map_infer}
   100009    0.011    0.000    0.011    0.000 datetimelike.py:232(freq)
        9    0.010    0.001    0.010    0.001 {pandas._libs.lib.infer_dtype}
   100021    0.010    0.000    0.010    0.000 datetimes.py:684(tz)
        1    0.009    0.009    0.009    0.009 {built-in method pandas._libs.tslib.array_to_datetime}
        2    0.008    0.004    0.008    0.004 {method 'get_indexer' of 'pandas._libs.index.IndexEngine' objects}
        1    0.008    0.008    0.651    0.651 dataarray.py:1827(from_series)
    66/65    0.005    0.000    0.005    0.000 {built-in method numpy.core.multiarray.array}
    24/22    0.001    0.000    0.362    0.016 base.py:677(_values)
       17    0.001    0.000    0.001    0.000 {built-in method numpy.core.multiarray.empty}
    19/18    0.001    0.000    0.189    0.010 base.py:4914(_ensure_index)
        5    0.001    0.000    0.001    0.000 {method 'repeat' of 'numpy.ndarray' objects}
        2    0.001    0.000    0.001    0.000 {method 'tolist' of 'numpy.ndarray' objects}
        2    0.001    0.000    0.001    0.000 {pandas._libs.algos.take_1d_object_object}
        4    0.001    0.000    0.001    0.000 {pandas._libs.algos.take_1d_int64_int64}
     1846    0.001    0.000    0.001    0.000 {built-in method builtins.isinstance}
       16    0.001    0.000    0.001    0.000 {method 'reduce' of 'numpy.ufunc' objects}
        1    0.001    0.001    0.001    0.001 {method 'get_indexer' of 'pandas._libs.index.DatetimeEngine' objects}

There seems to be a suspiciously large amount of effort applying a function to individual datetime objects.

max-sixty · 2018-10-02T19:57:20Z

When I stepped through, it was by-and-large all taken up by https://github.com/pydata/xarray/blob/master/xarray/core/dataset.py#L3121. That's where the boxing & unboxing of the datetimes is from.

I haven't yet discovered how the alternative path avoids this work. If anyone has priors please lmk!

max-sixty · 2018-10-03T01:30:07Z

It's 3x faster to unstack & stack all-but-one level, vs reindexing over a filled-out index (and I think always produces the same result).

Our current code takes the slow path.

I could make that change, but that strongly feels like I don't understand the root cause. I haven't spent much time with reshaping code - lmk if anyone has ideas.

idx = cropped.index
full_idx = pd.MultiIndex.from_product(idx.levels, names=idx.names)

reindexed = cropped.reindex(full_idx)

%timeit reindexed = cropped.reindex(full_idx)
# 1 loop, best of 3: 278 ms per loop

%%timeit
stack_unstack = (
    cropped
    .unstack(list('yz'))
    .stack(list('yz'),dropna=False)
)
# 10 loops, best of 3: 80.8 ms per loop

stack_unstack.equals(reindexed)
# True

max-sixty · 2018-10-03T13:55:40Z

My working hypothesis is that pandas has a set of fast routines in C, such that it can stack without reindexing to the full index. The routines only work in 1-2 dimensions.

So without some hackery (i.e. converting multi-dimensional arrays to pandas' size and back), the current implementation is reasonable*. Next step would be to write our own routines that can operate on multiple dimensions (numbagg!).

Is that consistent with others' views, particularly those who know this area well?

'* one small fix that would improve performance of series.to_xarray() only, is the comment above. Lmk if you think worth making that change

shoyer · 2018-10-03T15:50:32Z

The vast majority of the time in xarray's current implementation seems to be spent in DataFrame.reindex(), but I see no reason why this operations needs to be so slow. I expect we could probably optimize this significantly on the pandas side.

See these results from line-profiler:

In [8]: %lprun -f xarray.Dataset.from_dataframe cropped.to_xarray()
Timer unit: 1e-06 s

Total time: 0.727191 s
File: /Users/shoyer/dev/xarray/xarray/core/dataset.py
Function: from_dataframe at line 3094

Line #      Hits         Time  Per Hit   % Time  Line Contents
==============================================================
  3094                                               @classmethod
  3095                                               def from_dataframe(cls, dataframe):
  3096                                                   """Convert a pandas.DataFrame into an xarray.Dataset
  3097
  3098                                                   Each column will be converted into an independent variable in the
  3099                                                   Dataset. If the dataframe's index is a MultiIndex, it will be expanded
  3100                                                   into a tensor product of one-dimensional indices (filling in missing
  3101                                                   values with NaN). This method will produce a Dataset very similar to
  3102                                                   that on which the 'to_dataframe' method was called, except with
  3103                                                   possibly redundant dimensions (since all dataset variables will have
  3104                                                   the same dimensionality).
  3105                                                   """
  3106                                                   # TODO: Add an option to remove dimensions along which the variables
  3107                                                   # are constant, to enable consistent serialization to/from a dataframe,
  3108                                                   # even if some variables have different dimensionality.
  3109
  3110         1        352.0    352.0      0.0          if not dataframe.columns.is_unique:
  3111                                                       raise ValueError(
  3112                                                           'cannot convert DataFrame with non-unique columns')
  3113
  3114         1          3.0      3.0      0.0          idx = dataframe.index
  3115         1        356.0    356.0      0.0          obj = cls()
  3116
  3117         1          2.0      2.0      0.0          if isinstance(idx, pd.MultiIndex):
  3118                                                       # it's a multi-index
  3119                                                       # expand the DataFrame to include the product of all levels
  3120         1       4524.0   4524.0      0.6              full_idx = pd.MultiIndex.from_product(idx.levels, names=idx.names)
  3121         1     717008.0 717008.0     98.6              dataframe = dataframe.reindex(full_idx)
  3122         1          3.0      3.0      0.0              dims = [name if name is not None else 'level_%i' % n
  3123         1         20.0     20.0      0.0                      for n, name in enumerate(idx.names)]
  3124         4          9.0      2.2      0.0              for dim, lev in zip(dims, idx.levels):
  3125         3       2973.0    991.0      0.4                  obj[dim] = (dim, lev)
  3126         1         37.0     37.0      0.0              shape = [lev.size for lev in idx.levels]
  3127                                                   else:
  3128                                                       dims = (idx.name if idx.name is not None else 'index',)
  3129                                                       obj[dims[0]] = (dims, idx)
  3130                                                       shape = -1
  3131
  3132         2        350.0    175.0      0.0          for name, series in iteritems(dataframe):
  3133         1         33.0     33.0      0.0              data = np.asarray(series).reshape(shape)
  3134         1       1520.0   1520.0      0.2              obj[name] = (dims, data)
  3135         1          1.0      1.0      0.0          return obj

shoyer · 2018-10-03T15:57:28Z

@max-sixty nevermind, you seem to have already discovered that :)

tqfjo · 2020-02-14T02:25:25Z

I've run into this twice. This time I'm seeing a difference of very roughly 100x or more just using a transpose -- I can't test or time it properly right now, but this is what it looks like:

ipdb> df
x              a       b  ...    c      d
y              0       0  ...    7      7
z                         ...            
0       0.000000     0.0  ...  0.0    0.0
1      -0.000416     0.0  ...  0.0    0.0

[2 rows x 2932 columns]
ipdb> df.to_xarray()

<I quit out because it takes at least 30s>

ipdb> df.T.to_xarray()

<Finishes instantly>

crusaderky · 2020-02-14T07:50:08Z

@tqfjo unrelated. You're comparing the creation of a dataset with 2 variables with the creation of one with 3000. Unsurprisingly, the latter will take 1500x. If your dataset doesn't functionally contain 3000 variables but just a single two-dimensional variable, use xarray.DataArray(ds).

tqfjo · 2020-02-15T04:31:54Z

@crusaderky Thanks for the pointer to xarray.DataArray(df) -- that makes my life a ton easier.

That said, if it helps anyone to know, I did just want a DataArray, but figured there was no alternative to first running the rather singular to_xarray. I also still find the runtime surprising, though I know nothing about xarray's internals.

kefirbandi · 2020-02-29T20:27:20Z

I know this is not a recent thread but I found no resolution, and we just ran in the same issue recently. In our case we had a pandas series of roughly 15 milliion entries, with a 3-level multi-index which had to be converted to an xarray.DataArray. The .to_xarray took almost 2 minutes. Unstack + to_array took it down to roughly 3 seconds, provided the last level of the multi index was unstacked.

However a much faster solution was through numpy array. The below code is based on the idea of Igor Raush

(In this case df is a dataframe with a single column, or a series)

arr = np.full(df.index.levshape, np.nan)
arr[tuple(df.index.codes)] = df.values.flat
da = xr.DataArray(arr,dims=df.index.names,coords=dict(zip(df.index.names, df.index.levels)))

brey · 2020-06-24T09:55:00Z

Hi All. I stumble across the same issue trying to convert a 5000 column dataframe to xarray (it was never going to happen...).
I found a workaround and I am posting the test below. Hope it helps.

import xarray as xr
import pandas as pd
import numpy as np

xr.__version__

    '0.15.1'

pd.__version__

    '1.0.5'

df = pd.DataFrame(np.random.randn(200, 500))

%%time
one = df.to_xarray()

    CPU times: user 29.6 s, sys: 60.4 ms, total: 29.6 s
    Wall time: 29.7 s

%%time
dic={}
for name in df.columns:
    dic.update({name:(['index'],df[name].values)})

two = xr.Dataset(dic, coords={'index': ('index', df.index.values)})         

    CPU times: user 17.6 ms, sys: 158 µs, total: 17.8 ms
    Wall time: 17.8 ms

one.equals(two)

    True

Fixes pydataGH-2459 Before: pandas.MultiIndexSeries.time_to_xarray ======= ========= ========== -- subset ------- -------------------- dtype True False ======= ========= ========== int 505±0ms 37.1±0ms float 485±0ms 38.3±0ms ======= ========= ========== After: pandas.MultiIndexSeries.time_to_xarray ======= ========= ========== -- subset ------- -------------------- dtype True False ======= ========= ========== int 11.5±0ms 39.2±0ms float 12.5±0ms 26.6±0ms ======= ========= ========== There are still some cases where we have to fall back to the existing slow implementation, but hopefully they should now be relatively rare.

shoyer · 2020-06-27T20:50:45Z

However a much faster solution was through numpy array. The below code is based on the idea of Igor Raush

Thanks for sharing! This is a great tip indeed.

I've reimplemented from_dataframe to make use of in #4184, and it indeed makes things much, much faster! The original example in this thread is now 40x faster.

kefirbandi · 2020-06-30T19:53:46Z

I've reimplemented from_dataframe to make use of in #4184, and it indeed makes things much, much faster! The original example in this thread is now 40x faster.

Very good news!
Thanks for implementing it!

* Add MultiIndexSeries.time_to_xarray() benchmark * Improve the speed of from_dataframe with a MultiIndex Fixes GH-2459 Before: pandas.MultiIndexSeries.time_to_xarray ======= ========= ========== -- subset ------- -------------------- dtype True False ======= ========= ========== int 505±0ms 37.1±0ms float 485±0ms 38.3±0ms ======= ========= ========== After: pandas.MultiIndexSeries.time_to_xarray ======= ========= ========== -- subset ------- -------------------- dtype True False ======= ========= ========== int 11.5±0ms 39.2±0ms float 12.5±0ms 26.6±0ms ======= ========= ========== There are still some cases where we have to fall back to the existing slow implementation, but hopefully they should now be relatively rare. * remove unused import * Simplify converting MultiIndex dataframes * remove comments * remove types with NA * more multiindex dataframe tests * add whats new note * Preserve order of MultiIndex levels in from_dataframe * Add todo note * Rewrite from_dataframe to avoid passing around a dataframe * Require that MultiIndexes are unique even with sparse=True * clarify comment

dcherian added the topic-performance label Jan 8, 2019

shoyer mentioned this issue Jun 26, 2020

Improve the speed of from_dataframe with a MultiIndex (by 40x!) #4184

Merged

3 tasks

shoyer closed this as completed in #4184 Jul 2, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Stack + to_array before to_xarray is much faster that a simple to_xarray #2459

Stack + to_array before to_xarray is much faster that a simple to_xarray #2459

max-sixty commented Oct 2, 2018

shoyer commented Oct 2, 2018

max-sixty commented Oct 2, 2018

max-sixty commented Oct 3, 2018

max-sixty commented Oct 3, 2018 •

edited

shoyer commented Oct 3, 2018

shoyer commented Oct 3, 2018

tqfjo commented Feb 14, 2020

crusaderky commented Feb 14, 2020 •

edited

tqfjo commented Feb 15, 2020

kefirbandi commented Feb 29, 2020

brey commented Jun 24, 2020

shoyer commented Jun 27, 2020

kefirbandi commented Jun 30, 2020

Stack + to_array before to_xarray is much faster that a simple to_xarray #2459

Stack + to_array before to_xarray is much faster that a simple to_xarray #2459

Comments

max-sixty commented Oct 2, 2018

Problem description

Output of xr.show_versions()

shoyer commented Oct 2, 2018

max-sixty commented Oct 2, 2018

max-sixty commented Oct 3, 2018

max-sixty commented Oct 3, 2018 • edited

shoyer commented Oct 3, 2018

shoyer commented Oct 3, 2018

tqfjo commented Feb 14, 2020

crusaderky commented Feb 14, 2020 • edited

tqfjo commented Feb 15, 2020

kefirbandi commented Feb 29, 2020

brey commented Jun 24, 2020

shoyer commented Jun 27, 2020

kefirbandi commented Jun 30, 2020

Output of `xr.show_versions()`

max-sixty commented Oct 3, 2018 •

edited

crusaderky commented Feb 14, 2020 •

edited