Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

REGR: Performance regression on RollingGroupby #38038

Closed
cpmbailey opened this issue Nov 24, 2020 · 9 comments · Fixed by #38057
Closed

REGR: Performance regression on RollingGroupby #38038

cpmbailey opened this issue Nov 24, 2020 · 9 comments · Fixed by #38057
Labels
Performance Memory or execution speed performance Regression Functionality that used to work in a prior pandas version Window rolling, ewma, expanding
Milestone

Comments

@cpmbailey
Copy link

pd.DataFrame({'a': np.random.randn(10000000), 'b': 1}).groupby('b').rolling(3).mean() is approximately 10x slower between 1.0.5 and 1.1.x

@rhshadrach rhshadrach changed the title Performance regression on RollingGroupby REGR: Performance regression on RollingGroupby Nov 24, 2020
@rhshadrach rhshadrach added Performance Memory or execution speed performance Regression Functionality that used to work in a prior pandas version Window rolling, ewma, expanding labels Nov 24, 2020
@jorisvandenbossche jorisvandenbossche added this to the 1.1.5 milestone Nov 24, 2020
@jorisvandenbossche
Copy link
Member

@cpmbailey thanks for the report! I can confirm this; for me, the snippet (without the dataframe creation) takes 2s on pandas 1.0.5 and 20s on pandas 1.1.4.

@mroeschke do you know if this is something "known"?

From looking at the profile, it seems now the _apply is taking a lot more time, and it is #34052 that introduced a RollingGroupby specific _apply implementation (although the PR indicates it was for performance reasons)

And it seems that on master it slowed down a bit further

@jorisvandenbossche
Copy link
Member

Looking at a profile, it seems that most of the time comes from constructing the MultiIndex. Illustrating this with the line_profiler:

In [8]: df = pd.DataFrame({'a': np.random.randn(10000000), 'b': 1})

In [9]: %lprun -f pd.core.window.rolling.RollingGroupby._apply df.groupby('b').rolling(3).mean()

Total time: 91.4188 s
File: /home/joris/scipy/pandas/pandas/core/window/rolling.py
Function: _apply at line 758

Line #      Hits         Time  Per Hit   % Time  Line Contents
==============================================================
   758                                               def _apply(
   759                                                   self,
   760                                                   func: Callable[..., Any],
   761                                                   name: Optional[str] = None,
   762                                                   numba_cache_key: Optional[Tuple[Callable, str]] = None,
   763                                                   **kwargs,
   764                                               ) -> FrameOrSeries:
   765         1          6.0      6.0      0.0          result = super()._apply(
   766         1          3.0      3.0      0.0              func,
   767         1          2.0      2.0      0.0              name,
   768         1          2.0      2.0      0.0              numba_cache_key,
   769         1    1793069.0 1793069.0      2.0              **kwargs,
   770                                                   )
   771                                                   # Reconstruct the resulting MultiIndex from tuples
   772                                                   # 1st set of levels = group by labels
   773                                                   # 2nd set of levels = original index
   774                                                   # Ignore 2nd set of levels if a group by label include an index level
   775                                                   result_index_names = [
   776         1         11.0     11.0      0.0              grouping.name for grouping in self._groupby.grouper._groupings
   777                                                   ]
   778         1          1.0      1.0      0.0          grouped_object_index = None
   779                                           
   780                                                   column_keys = [
   781         1          2.0      2.0      0.0              key
   782         1         14.0     14.0      0.0              for key in result_index_names
   783                                                       if key not in self.obj.index.names or key is None
   784                                                   ]
   785                                           
   786         1          3.0      3.0      0.0          if len(column_keys) == len(result_index_names):
   787         1          1.0      1.0      0.0              grouped_object_index = self.obj.index
   788         1          3.0      3.0      0.0              grouped_index_name = [*grouped_object_index.names]
   789         1          1.0      1.0      0.0              result_index_names += grouped_index_name
   790                                                   else:
   791                                                       # Our result will have still kept the column in the result
   792                                                       result = result.drop(columns=column_keys, errors="ignore")
   793                                           
   794         1          1.0      1.0      0.0          result_index_data = []
   795         2          8.0      4.0      0.0          for key, values in self._groupby.grouper.indices.items():
   796  10000001   10311686.0      1.0     11.3              for value in values:
   797                                                           data = [
   798  10000000   15486335.0      1.5     16.9                      *com.maybe_make_list(key),
   799  10000000    8523837.0      0.9      9.3                      *com.maybe_make_list(
   800                                                                   grouped_object_index[value]
   801  10000000   34071689.0      3.4     37.3                          if grouped_object_index is not None
   802                                                                   else []
   803                                                               ),
   804                                                           ]
   805  10000000   10521752.0      1.1     11.5                  result_index_data.append(tuple(data))
   806                                           
   807         1          7.0      7.0      0.0          result_index = MultiIndex.from_tuples(
   808         1   10705244.0 10705244.0     11.7              result_index_data, names=result_index_names
   809                                                   )
   810         1       5165.0   5165.0      0.0          result.index = result_index
   811         1          4.0      4.0      0.0          return resul

(the profiler adds a lot of overhead (ca x3/4 in total time), so the relative numbers are not necessarily reliable, but the overall picture is certainly interesting)

@phofl
Copy link
Member

phofl commented Nov 24, 2020

Edit: Messed something up.

MultiIndex takes way longer

pd.DataFrame({'a': np.random.randn(10000000), 'b': 1, "c": 1, "d": 1}).set_index(['b', "c"]).groupby("d").rolling(3).mean()

takes 40 Seconds on my machine, the other example took 18-19.

@mroeschke
Copy link
Member

My initial performance test case was only using 1000 points unlike in this issue's example where 10000000 points are used: #34052 (comment)

I didn't anticipate, but it makes sense, that the bottleneck at this scale is the creation of the resulting MultiIndex creation is dominating the timings.

@phofl
Copy link
Member

phofl commented Nov 24, 2020

We could maybe avoid the inner loop, but that reduces the time only by 50%, does not get us anywhere near 2 seconds

@jreback
Copy link
Contributor

jreback commented Nov 24, 2020

you can likely use MultiIndex.from_arrays here

@jorisvandenbossche
Copy link
Member

Indeed, my guess is that we should be able to reduce most of the time taken by the index creation by avoiding creating all the tuples

@jorisvandenbossche
Copy link
Member

I did an attempt rewriting this index creation, see #38057. Now, it has some different behaviour in a few corner cases (like dtype of empty index, NaNs in the levels vs in the codes) ..
I am curious how this was done before (in 1.0), but couldn't directly figure out

@jorisvandenbossche
Copy link
Member

But it brings the time of the snippet above back to 2s, so as it was for pandas 1.0

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Performance Memory or execution speed performance Regression Functionality that used to work in a prior pandas version Window rolling, ewma, expanding
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants