Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH: change get_dummies default dtype to bool #48022

Merged
merged 21 commits into from Oct 11, 2022

Conversation

kianelbo
Copy link
Contributor

@kianelbo kianelbo commented Aug 10, 2022

Added a future warning when no dtype is passed to get_dummies stating the the default dtype will change to bool from np.uint8

Copy link
Member

@MarcoGorelli MarcoGorelli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is off to a good start!

There's some CI failures in the doc build, could you fix them up please? e.g.

Warning in /home/runner/work/pandas/pandas/doc/source/whatsnew/v0.13.0.rst at block ending on line 526
Specify :okwarning: as an option in the ipython:: block to suppress this message

This would also require a whatsnew note

pandas/core/reshape/encoding.py Outdated Show resolved Hide resolved
pandas/tests/reshape/test_get_dummies.py Outdated Show resolved Hide resolved
Copy link
Member

@MarcoGorelli MarcoGorelli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice, just suggesting a minor re-wording, rest looks good to me

cc @bashtage @WillAyd @jreback who'd commented on the issue

doc/source/whatsnew/v1.5.0.rst Outdated Show resolved Hide resolved
kianelbo and others added 3 commits August 10, 2022 20:02
Co-authored-by: Marco Edward Gorelli <marcogorelli@protonmail.com>
@mroeschke mroeschke added the Warnings Warnings that appear or should be added to pandas label Aug 10, 2022
Copy link
Member

@MarcoGorelli MarcoGorelli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks good to me

EDIT: holding off til discussion in pandas-dev meeting

@@ -169,7 +177,7 @@ def test_get_dummies_unicode(self, sparse):
e = "e"
eacute = unicodedata.lookup("LATIN SMALL LETTER E WITH ACUTE")
s = [e, eacute, eacute]
res = get_dummies(s, prefix="letter", sparse=sparse)
res = get_dummies(s, dtype=np.uint8, prefix="letter", sparse=sparse)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wouldn't we rather just catch the warnings for these? Wondering how we remember in the future to go back and update these tests when we make the change to the dtype

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could do, I just thought that would be a lot of warnings to catch

Regarding updating tests - I wouldn't have thought they needed updating, I'd have thought just having a test which called .get_dummies() (without specifying dtype) would be enough

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yea I agree. So that's why I was thinking it is better to catch the warning for now and not change the argument. Otherwise with this in the future we lose testing the behavior of the default argument unless someone comes back and revert what was changed here

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no all tests should be fixed now
and then u have an explicit test of the warning

it's not better to defer fixing something like this

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK good points, thanks for raising

I've added this to the agenda for the next dev meeting

@kianelbo let's hold off further changes til after there's been discussion

# https://github.com/pandas-dev/pandas/issues/45848
msg = "In a future version of pandas the default dtype will change"
with tm.assert_produces_warning(FutureWarning, match=msg):
get_dummies(df)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sorry to go back on the approval, but can we check the return value here?

Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i think this change (even the deprecation) needs discussion as this is quite some long standing behavior

pls schedule it for the next dev meeting

@@ -169,7 +177,7 @@ def test_get_dummies_unicode(self, sparse):
e = "e"
eacute = unicodedata.lookup("LATIN SMALL LETTER E WITH ACUTE")
s = [e, eacute, eacute]
res = get_dummies(s, prefix="letter", sparse=sparse)
res = get_dummies(s, dtype=np.uint8, prefix="letter", sparse=sparse)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK good points, thanks for raising

I've added this to the agenda for the next dev meeting

@kianelbo let's hold off further changes til after there's been discussion

@datapythonista datapythonista mentioned this pull request Sep 8, 2022
Copy link
Member

@MarcoGorelli MarcoGorelli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @kianelbo

Apologies for conflicting instructions here. In the end, at the last dev meeting, we decided it would be best to treat this as a bug, and just change the default dtype without going through the deprecation cycle

Changing unsigned data to signed is unlikely to cause any issues

So, this'd be a much simpler fix on your side - just change the default dtype to bool, and make sure to either update tests or make sure this behaviour is tested

It's too late to get this in to v1.5.0, so for now let's go with v1.6.0 or v1.5.1

@kianelbo kianelbo force-pushed the getdummies-default-dtype branch 4 times, most recently from 2ef8633 to 9f7fbc4 Compare September 22, 2022 16:04
@MarcoGorelli MarcoGorelli changed the title ENH: Warn when dtype is not passed to get_dummies ENH: change get_dummies default dtype to bool Sep 25, 2022
Copy link
Member

@WillAyd WillAyd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm - thanks!

@MarcoGorelli MarcoGorelli modified the milestones: 1.5.1, 1.6 Oct 7, 2022
@kianelbo kianelbo requested review from mroeschke and removed request for jreback October 11, 2022 13:40
@mroeschke mroeschke merged commit bfdf223 into pandas-dev:main Oct 11, 2022
@mroeschke
Copy link
Member

Awesome, thanks for sticking with this @kianelbo

@kianelbo kianelbo deleted the getdummies-default-dtype branch October 11, 2022 16:56
@mroeschke mroeschke modified the milestones: 1.6, 2.0 Oct 13, 2022
noatamir pushed a commit to noatamir/pandas that referenced this pull request Nov 9, 2022
* ENH: Warn when dtype is not passed to get_dummies

* Edit get_dummies' dtype warning

* Add whatsnew entry for issue pandas-dev#45848

* Fix dtype warning test

* Suppress warnings in docs

* Edit whatsnew entry

Co-authored-by: Marco Edward Gorelli <marcogorelli@protonmail.com>

* Fix find_stack_level in get_dummies dtype warning

* Change the default dtype of get_dummies to bool

* Revert dtype(bool) change

* Move the changelog entry to v1.6.0.rst

* Move whatsnew entry to 'Other API changes'

Co-authored-by: Marco Edward Gorelli <marcogorelli@protonmail.com>
Co-authored-by: Marco Edward Gorelli <33491632+MarcoGorelli@users.noreply.github.com>
@tylerjereddy
Copy link
Contributor

FWIW, this does cause some confusion for downstream consumers--for example the common approach of using pd.get_dummies(pd.cut(...)) to get one-hot encoded data (common for ML applications) might more naturally be expected to continue to return a numeric type, which I believe is also the default for sklearn.preprocessing.OneHotEncoder. I actually plucked that approach right out of Wes' book I think.

Well, I'm not really that annoyed minus 90 minutes of debugging, and the fix is trivial for the consuming code (just specify the old default dtype to get_dummies), but I'll just place this comment here in case it helps others adapt their downstream code.

tylerjereddy added a commit to tylerjereddy/darshan that referenced this pull request Feb 28, 2023
Fixes darshan-hpc#909

* make the library compatible with both `pandas 1.5.x` and `pandas
2.0.0rc0` by pinning the dtype we use for one-hot encoding our heatmap
data

* see related upstream comment and release notes (`get_dummies()`
change):
  - pandas-dev/pandas#48022 (comment)
  - https://pandas.pydata.org/pandas-docs/version/2.0/whatsnew/v2.0.0.html
@bashtage
Copy link
Contributor

bashtage commented Mar 1, 2023

It also lead to some not obvious errors in statsmodels when testing against the pre-prelease. It wasn't that hard to change since I know about this change, but would have been hard to determine had one not. When making the changes I half wondered if the default should have become double which would have always prevented the math issues that this was desired to fix, albeit at the cost of 8x storage (although one could always choose bool if storage was more important than simplicity of use).

galipremsagar added a commit to rapidsai/cudf that referenced this pull request Apr 22, 2023
This PR changes the default dtype for get_dummies to bool from uint8 to match pandas-2.0: pandas-dev/pandas#48022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Reshaping Concat, Merge/Join, Stack/Unstack, Explode
Projects
None yet
Development

Successfully merging this pull request may close these issues.

ENH: pd.get_dummies should not default to dtype np.uint8
7 participants