REGR: Performance regression on RollingGroupby #38038

cpmbailey · 2020-11-24T12:49:16Z

pd.DataFrame({'a': np.random.randn(10000000), 'b': 1}).groupby('b').rolling(3).mean() is approximately 10x slower between 1.0.5 and 1.1.x

jorisvandenbossche · 2020-11-24T16:08:53Z

@cpmbailey thanks for the report! I can confirm this; for me, the snippet (without the dataframe creation) takes 2s on pandas 1.0.5 and 20s on pandas 1.1.4.

@mroeschke do you know if this is something "known"?

From looking at the profile, it seems now the _apply is taking a lot more time, and it is #34052 that introduced a RollingGroupby specific _apply implementation (although the PR indicates it was for performance reasons)

And it seems that on master it slowed down a bit further

jorisvandenbossche · 2020-11-24T16:18:18Z

Looking at a profile, it seems that most of the time comes from constructing the MultiIndex. Illustrating this with the line_profiler:

In [8]: df = pd.DataFrame({'a': np.random.randn(10000000), 'b': 1})

In [9]: %lprun -f pd.core.window.rolling.RollingGroupby._apply df.groupby('b').rolling(3).mean()

Total time: 91.4188 s
File: /home/joris/scipy/pandas/pandas/core/window/rolling.py
Function: _apply at line 758

Line #      Hits         Time  Per Hit   % Time  Line Contents
==============================================================
   758                                               def _apply(
   759                                                   self,
   760                                                   func: Callable[..., Any],
   761                                                   name: Optional[str] = None,
   762                                                   numba_cache_key: Optional[Tuple[Callable, str]] = None,
   763                                                   **kwargs,
   764                                               ) -> FrameOrSeries:
   765         1          6.0      6.0      0.0          result = super()._apply(
   766         1          3.0      3.0      0.0              func,
   767         1          2.0      2.0      0.0              name,
   768         1          2.0      2.0      0.0              numba_cache_key,
   769         1    1793069.0 1793069.0      2.0              **kwargs,
   770                                                   )
   771                                                   # Reconstruct the resulting MultiIndex from tuples
   772                                                   # 1st set of levels = group by labels
   773                                                   # 2nd set of levels = original index
   774                                                   # Ignore 2nd set of levels if a group by label include an index level
   775                                                   result_index_names = [
   776         1         11.0     11.0      0.0              grouping.name for grouping in self._groupby.grouper._groupings
   777                                                   ]
   778         1          1.0      1.0      0.0          grouped_object_index = None
   779                                           
   780                                                   column_keys = [
   781         1          2.0      2.0      0.0              key
   782         1         14.0     14.0      0.0              for key in result_index_names
   783                                                       if key not in self.obj.index.names or key is None
   784                                                   ]
   785                                           
   786         1          3.0      3.0      0.0          if len(column_keys) == len(result_index_names):
   787         1          1.0      1.0      0.0              grouped_object_index = self.obj.index
   788         1          3.0      3.0      0.0              grouped_index_name = [*grouped_object_index.names]
   789         1          1.0      1.0      0.0              result_index_names += grouped_index_name
   790                                                   else:
   791                                                       # Our result will have still kept the column in the result
   792                                                       result = result.drop(columns=column_keys, errors="ignore")
   793                                           
   794         1          1.0      1.0      0.0          result_index_data = []
   795         2          8.0      4.0      0.0          for key, values in self._groupby.grouper.indices.items():
   796  10000001   10311686.0      1.0     11.3              for value in values:
   797                                                           data = [
   798  10000000   15486335.0      1.5     16.9                      *com.maybe_make_list(key),
   799  10000000    8523837.0      0.9      9.3                      *com.maybe_make_list(
   800                                                                   grouped_object_index[value]
   801  10000000   34071689.0      3.4     37.3                          if grouped_object_index is not None
   802                                                                   else []
   803                                                               ),
   804                                                           ]
   805  10000000   10521752.0      1.1     11.5                  result_index_data.append(tuple(data))
   806                                           
   807         1          7.0      7.0      0.0          result_index = MultiIndex.from_tuples(
   808         1   10705244.0 10705244.0     11.7              result_index_data, names=result_index_names
   809                                                   )
   810         1       5165.0   5165.0      0.0          result.index = result_index
   811         1          4.0      4.0      0.0          return resul

(the profiler adds a lot of overhead (ca x3/4 in total time), so the relative numbers are not necessarily reliable, but the overall picture is certainly interesting)

phofl · 2020-11-24T22:23:37Z

Edit: Messed something up.

MultiIndex takes way longer

pd.DataFrame({'a': np.random.randn(10000000), 'b': 1, "c": 1, "d": 1}).set_index(['b', "c"]).groupby("d").rolling(3).mean()

takes 40 Seconds on my machine, the other example took 18-19.

mroeschke · 2020-11-24T22:35:58Z

My initial performance test case was only using 1000 points unlike in this issue's example where 10000000 points are used: #34052 (comment)

I didn't anticipate, but it makes sense, that the bottleneck at this scale is the creation of the resulting MultiIndex creation is dominating the timings.

phofl · 2020-11-24T22:39:03Z

We could maybe avoid the inner loop, but that reduces the time only by 50%, does not get us anywhere near 2 seconds

jreback · 2020-11-24T23:39:02Z

you can likely use MultiIndex.from_arrays here

jorisvandenbossche · 2020-11-25T07:19:34Z

Indeed, my guess is that we should be able to reduce most of the time taken by the index creation by avoiding creating all the tuples

jorisvandenbossche · 2020-11-25T09:04:28Z

I did an attempt rewriting this index creation, see #38057. Now, it has some different behaviour in a few corner cases (like dtype of empty index, NaNs in the levels vs in the codes) ..
I am curious how this was done before (in 1.0), but couldn't directly figure out

jorisvandenbossche · 2020-11-25T09:05:36Z

But it brings the time of the snippet above back to 2s, so as it was for pandas 1.0

rhshadrach changed the title ~~Performance regression on RollingGroupby~~ REGR: Performance regression on RollingGroupby Nov 24, 2020

rhshadrach added Performance Memory or execution speed performance Regression Functionality that used to work in a prior pandas version Window rolling, ewma, expanding labels Nov 24, 2020

jorisvandenbossche added this to the 1.1.5 milestone Nov 24, 2020

jorisvandenbossche mentioned this issue Nov 25, 2020

PERF: fix regression in creation of resulting index in RollingGroupby #38057

Merged

simonjayhawkins closed this as completed in #38057 Dec 1, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

REGR: Performance regression on RollingGroupby #38038

REGR: Performance regression on RollingGroupby #38038

cpmbailey commented Nov 24, 2020

jorisvandenbossche commented Nov 24, 2020

jorisvandenbossche commented Nov 24, 2020

phofl commented Nov 24, 2020 •

edited

mroeschke commented Nov 24, 2020

phofl commented Nov 24, 2020 •

edited

jreback commented Nov 24, 2020

jorisvandenbossche commented Nov 25, 2020

jorisvandenbossche commented Nov 25, 2020

jorisvandenbossche commented Nov 25, 2020

REGR: Performance regression on RollingGroupby #38038

REGR: Performance regression on RollingGroupby #38038

Comments

cpmbailey commented Nov 24, 2020

jorisvandenbossche commented Nov 24, 2020

jorisvandenbossche commented Nov 24, 2020

phofl commented Nov 24, 2020 • edited

mroeschke commented Nov 24, 2020

phofl commented Nov 24, 2020 • edited

jreback commented Nov 24, 2020

jorisvandenbossche commented Nov 25, 2020

jorisvandenbossche commented Nov 25, 2020

jorisvandenbossche commented Nov 25, 2020

phofl commented Nov 24, 2020 •

edited

phofl commented Nov 24, 2020 •

edited