Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

REGR: Groupby first/last/nth treats None as an observation #38330

Merged
merged 4 commits into from Dec 6, 2020

Conversation

rhshadrach
Copy link
Member

@rhshadrach rhshadrach commented Dec 6, 2020

This is a regression from 1.0.x to 1.1.x, introduced by #33462. Assuming it doesn't get into the 1.2 rc, not sure if it should go into 1.2 during the rc phase or wait until 1.3.

Had to decide on some edge-case behaviors in odd situations with missing values, see #38286 (comment)

@rhshadrach rhshadrach changed the title BUG: Groupby first/last/nth treats None as an observation REGR: Groupby first/last/nth treats None as an observation Dec 6, 2020
@rhshadrach rhshadrach added Groupby Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate Regression Functionality that used to work in a prior pandas version labels Dec 6, 2020
Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pls find out when this was originally changed

this was on purpose

@rhshadrach
Copy link
Member Author

rhshadrach commented Dec 6, 2020

@jreback This was originally changed in #33462 so that

pd.DataFrame({"id": ["a"], "value": [None]}, dtype=object).groupby("id").first()

returns None rather than np.nan. That behavior is retained here.

@jreback
Copy link
Contributor

jreback commented Dec 6, 2020

so you think that the original issues are wrong? this is deleting that test case and reverting

@simonjayhawkins
Copy link
Member

This is a regression from 1.0.x to 1.1.x

if this is the case, should go in 1.1.5.

@rhshadrach
Copy link
Member Author

@jreback I believe this is reverting the fix in #33462 and solving the issue there in the proper manner, along with the regression it caused. The original test case still exists, is the first parameter to test_first_last_with_none

@rhshadrach
Copy link
Member Author

I can restore the original test and add the new one for the regression if that's preferable.

@jreback
Copy link
Contributor

jreback commented Dec 6, 2020

yes if this also doesn't change the original issue then leave the tests

ok for 1.1.5

@rhshadrach
Copy link
Member Author

@jreback test restored, added to whatsnew 1.5

@@ -27,6 +27,7 @@ Fixed regressions
- Fixed regression in :meth:`DataFrame.fillna` not filling ``NaN`` after other operations such as :meth:`DataFrame.pivot` (:issue:`36495`).
- Fixed performance regression in ``df.groupby(..).rolling(..)`` (:issue:`38038`)
- Fixed regression in :meth:`MultiIndex.intersection` returning duplicates when at least one of the indexes had duplicates (:issue:`36915`)
- Fixed regression in :meth:`.GroupBy.first`, :meth:`.GroupBy.last`, and :meth:`.GroupBy.nth` where ``None`` was considered a non-NA value (:issue:`38286`)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this doesn't affect nth right?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed, thanks. Fixed.

@jreback jreback added this to the 1.1.5 milestone Dec 6, 2020
Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can u run some asvs / perf for this
i am worried we are converting to object type

@rhshadrach
Copy link
Member Author

@jreback asv run with asv continuous -f 1.1 upstream/master HEAD -b ^groupby resulted in BENCHMARKS NOT SIGNIFICANTLY CHANGED.

Linux py39 failure is unrelated:

FAILED pandas/tests/io/pytables/test_store.py::TestHDFStore::test_append_with_data_columns
= 1 failed, 143282 passed, 5482 skipped, 1091 xfailed, 4 warnings in 1132.31s (0:18:52) =

@jreback jreback merged commit 55e3bff into pandas-dev:master Dec 6, 2020
@jreback
Copy link
Contributor

jreback commented Dec 6, 2020

thanks for the quick patch @rhshadrach

@simonjayhawkins
Copy link
Member

@meeseeksdev backport 1.1.x

@lumberbot-app

This comment has been minimized.

simonjayhawkins pushed a commit to simonjayhawkins/pandas that referenced this pull request Dec 6, 2020
simonjayhawkins added a commit that referenced this pull request Dec 6, 2020
…servation (#38333)

Co-authored-by: Richard Shadrach <45562402+rhshadrach@users.noreply.github.com>
@amalcgcg
Copy link

amalcgcg commented Dec 7, 2020

We just upgraded to 1.1.5 this morning, and it works great! Thank you for the blazing fast discussion, decision, fix, merge, and backport!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Groupby Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate Regression Functionality that used to work in a prior pandas version
Projects
None yet
Development

Successfully merging this pull request may close these issues.

BUG: GroupBy.first does not skip missing values in string-valued columns
4 participants