Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PERF: fix regression in creation of resulting index in RollingGroupby #38057

Merged

Conversation

jorisvandenbossche
Copy link
Member

@jorisvandenbossche jorisvandenbossche commented Nov 25, 2020

Closes #38038

TODO: fix corner cases, add benchmark, ..

@simonjayhawkins simonjayhawkins added this to the 1.1.5 milestone Nov 25, 2020
@jreback jreback added Groupby Performance Memory or execution speed performance Window rolling, ewma, expanding labels Nov 25, 2020
Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cc @mroeschke

do we have an asv that covers this case?

@mroeschke
Copy link
Member

We do but the ASV only has 1000 data points while the original example has 10000000

@jreback
Copy link
Contributor

jreback commented Nov 28, 2020

We do but the ASV only has 1000 data points while the original example has 10000000

does this still show a decent time diff? (and should increase that asv to 10k or so)

@jreback
Copy link
Contributor

jreback commented Nov 29, 2020

@jorisvandenbossche i something is actually failing here. can you merge master and see if you can fixup.

@jorisvandenbossche
Copy link
Member Author

We do but the ASV only has 1000 data points while the original example has 10000000

I added an additional benchmark, instead of changing the existing one to include larger data. Because for the case of the benchmarks, we actually had a nice performance improvement in 1.0->1.1 (instead of regression). So it is clearly capturing different aspects.
The newly added (and the report in #38038) is basically one with a limited number of groups, where the rolling apply calculation itself was quite cheap, and so the construction of the MulitIndex of the result dominated the profile.

asv_bench/benchmarks/rolling.py Outdated Show resolved Hide resolved
@@ -225,6 +225,19 @@ def time_rolling_offset(self, method):
getattr(self.groupby_roll_offset, method)()


class Groupby2:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you rename to GroupbyLowNumberOfGroups or similar

pandas/core/window/rolling.py Show resolved Hide resolved
codes = [c.take(indexer) for c in codes]

if grouped_object_index is not None:
if isinstance(grouped_object_index, MultiIndex):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe more clear

idx = grouped_object_index.take(indexer)
if not isinstance(grouped_object_index, MultiIndex):
    idx = MultiIndex.from_arrays(idx)

@jorisvandenbossche
Copy link
Member Author

So a summary of the failing tests.

First case is on an empty dataframe, where the dtype of the level of the MultiIndex changed:

df = pd.DataFrame({"s1": []})
result = df.groupby("s1").rolling(window=1).sum()
expected = df.copy()
expected.index = pd.MultiIndex.from_tuples([], names=["s1", None])
In [5]: df.index
Out[5]: RangeIndex(start=0, stop=0, step=1)

In [6]: result.index.levels[0]
Out[6]: Float64Index([], dtype='float64', name='s1')

In [7]: expected.index.levels[0]
Out[7]: Index([], dtype='object', name='s1')

In [9]: result.index.levels[1]
Out[9]: Int64Index([], dtype='int64')

In [10]: expected.index.levels[1]
Out[10]: Index([], dtype='object')

But for example, when doing a normal groupby on this empty dataframe, you also get back an empty float index:

In [12]: df.groupby("s1").sum().index
Out[12]: Float64Index([], dtype='float64', name='s1')

So at least this is consistent with groupby itself (or, if it's a bug, it's probably one in groupby, and not in the rolling code)


The second case is with groupby with dropna=False, and the difference is about the NaN being included in the levels vs in the codes of the resulting MultiIndex:

df = pd.DataFrame(
    {"group": ["g1", np.nan, "g1", "g2", np.nan], "B": [0, 1, 2, 3, 4]}
)
result = df.groupby("group", dropna=False).rolling(1, min_periods=1).mean()
expected = pd.DataFrame(
    {"B": [0.0, 2.0, 3.0, 1.0, 4.0]},
    index=pd.MultiIndex.from_tuples(
        [("g1", 0), ("g1", 2), ("g2", 3), (np.nan, 1), (np.nan, 4)],
        names=["group", None],
    )
)
In [16]: result.index.levels[0]
Out[16]: Index(['g1', 'g2', nan], dtype='object', name='group')

In [17]: result.index.codes[0]
Out[17]: array([0, 0, 1, 2, 2], dtype=int8)

In [18]: expected.index.levels[0]
Out[18]: Index(['g1', 'g2'], dtype='object', name='group')

In [19]: expected.index.codes[0]
Out[19]: array([ 0,  0,  1, -1, -1], dtype=int8)

Again, this seems to be caused by groupby, as the same behaviour can be noticed there:

# adding a group column so we can get a MultiIndex as result
In [20]: df["group2"] = 1

In [21]: result2 = df.groupby(["group", "group2"], dropna=False).mean()

In [22]: result2
Out[22]: 
                B
group group2     
g1    1       1.0
g2    1       3.0
NaN   1       2.5

In [23]: result2.index.levels[0]
Out[23]: Index(['g1', 'g2', nan], dtype='object', name='group')

In [24]: result2.index.codes[0]
Out[24]: array([0, 1, 2], dtype=int8)

I don't know if we have a general rule about where missing values should be included? (in the levels vs the codes?)
I would expect that we generally prefer to have them in the codes, though (but again, this is then more a bug in groupby, and not rolling)

@jreback
Copy link
Contributor

jreback commented Nov 30, 2020

we likely don't support dropna=False in 1.1.x at all (or its just wrong), wouldn't worry about the 2nd case for now. (but can create an issue on master for this).

@jorisvandenbossche
Copy link
Member Author

Indeed, that second case was only introduced in #36842, which fixed dropna=False case (so on master, not yet in a released version).
And I can confirm that on pandas 1.1.4, the NaNs were incorrectly dropped for the example above. So changing the test won't change behaviour compared to previous releases, so will update the test.

@jorisvandenbossche jorisvandenbossche marked this pull request as ready for review November 30, 2020 14:28
@jorisvandenbossche
Copy link
Member Author

I also added a few tests specifically related to the resulting MultiIndex (it's a bit hard to see to what extent the existing tests already cover this or not, but it are the examples I was using to test while implementing the fix).
I wrote the examples and expected result using pandas 1.0.5, and also tested that they are passing on master (without this change) as well.

@jorisvandenbossche
Copy link
Member Author

I changed the expected result for the first case as well (rolling operation on groupby). Since the original index of the dataframe is an (empty) RangeIndex, using a numeric index level seems actually more correct than an object type. A bit a corner case, though ..

@simonjayhawkins simonjayhawkins mentioned this pull request Nov 30, 2020
Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm, very minor comments (ok with merging as is though).

cc @mroeschke if you can have a quick look

tm.assert_frame_equal(result, expected)

expected = DataFrame({"s1": [], "s2": []})
result = expected.groupby(["s1", "s2"]).rolling(window=1).sum()
expected.index = MultiIndex.from_tuples([], names=["s1", "s2", None])
# expected.index = MultiIndex.from_tuples([], names=["s1", "s2", None])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you remove the commented code :-<

@@ -418,12 +426,24 @@ def test_groupby_rolling_empty_frame(self):
# GH 36197
expected = DataFrame({"s1": []})
result = expected.groupby("s1").rolling(window=1).sum()
expected.index = MultiIndex.from_tuples([], names=["s1", None])
# GH-38057 from_tupes gives empty object dtype, we now get float/int levels
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

from_tuples

Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we need to add a note that the return index is now 'fixed'?

@jorisvandenbossche
Copy link
Member Author

do we need to add a note that the return index is now 'fixed'?

For the dropna case certainly not, since that was only something in master. For the empty it could be mentioned, but not fully sure it is worth it (I am also not fully convinced it is completely fixed, I think groupby should maybe give an empty int64 index instead of empty float64 index).

But the performance fix is certainly worth noting, will add a whatsnew for that.

@jreback
Copy link
Contributor

jreback commented Dec 1, 2020

lgtm. just wanted to have @mroeschke give a glance.


group_indices = self._groupby.grouper.indices.values()
if group_indices:
indexer = np.concatenate(list(self._groupby.grouper.indices.values()))
Copy link
Member

@mroeschke mroeschke Dec 1, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: I think this can simply be indexer = np.concatenate(list(group_indices))

Copy link
Member

@mroeschke mroeschke left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One minor comment otherwise LGTM (can merge without it).

Thanks for tracking this down @jorisvandenbossche!

@simonjayhawkins
Copy link
Member

test failure unrelated and seen on other PRs

=========================== short test summary info ============================
FAILED pandas/tests/series/indexing/test_indexing.py::test_loc_setitem_2d_to_1d_raises
FAILED pandas/tests/util/test_show_versions.py::test_show_versions - Assertio...
= 2 failed, 133017 passed, 13911 skipped, 1390 xfailed, 41 xpassed in 1053.70s (0:17:33) =

@simonjayhawkins simonjayhawkins merged commit 6d0dab4 into pandas-dev:master Dec 1, 2020
@simonjayhawkins
Copy link
Member

@meeseeksdev backport 1.1.x

@lumberbot-app

This comment has been minimized.

@jorisvandenbossche jorisvandenbossche deleted the perf-rolling-groupby branch December 1, 2020 09:30
simonjayhawkins pushed a commit to simonjayhawkins/pandas that referenced this pull request Dec 1, 2020
simonjayhawkins added a commit that referenced this pull request Dec 1, 2020
…ex in RollingGroupby (#38211)

Co-authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com>
@cpmbailey
Copy link

Thanks for the quick fix on this one, and I hope my simplified example wasn't misleading. I originally spotted this on a real life example with in the order of five times as much data and about 200 groups. Do you think this still classes as a low number of groups?

@jorisvandenbossche
Copy link
Member Author

@cpmbailey you're welcome (and thank you for reporting a easy reproducible example).
It's not the exact number of groups, but mostly the relative number compared to the total size that triggers the issue. So for large data, 200 groups can be relatively small.

@cpmbailey
Copy link

This is much improved, I only get twice as long against 1.0.5 now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Groupby Performance Memory or execution speed performance Window rolling, ewma, expanding
Projects
None yet
Development

Successfully merging this pull request may close these issues.

REGR: Performance regression on RollingGroupby
5 participants