API: numeric_only consistency #46560

rhshadrach · 2022-03-29T13:16:09Z

Tracking issue for API consistency between all the different ops that have a numeric_only argument. This includes the dtype of the return when object types are present. I plan to report on the current state after the issues listed below are closed and prior to the 1.5 release.

jreback · 2022-06-05T22:15:07Z

@rhshadrach if you can update #30228 to account for these (there are a couple but not all)

rhshadrach · 2022-06-07T02:33:34Z

@jreback - added 46072 and 46852; those were the only two missing.

pavithraes · 2022-07-06T19:00:49Z

@rhshadrach Thanks for all your work here!

We're working on supporting this for Dask, and we just wanted to confirm if for the intermediate-term, i.e., from 1.5.0 to when numeric_only=False by default, the default will be numeric_only=True?

We were looking at this changelog entry (also where we found this issue) and weren't sure about the defaults in 1.5.0. :)

rhshadrach · 2022-07-06T21:58:23Z

@pavithraes - the defaults in 1.5.0 should be backwards compatible with those in the rest of the 1.x releases. As remarked in the changelog you linked to, the defaults are currently inconsistent - some True, some False, some None. The only difference in 1.5.0 should be that users get a deprecation messages if (a) the defaults are used and (b) the result would change based on the new default (False) in 2.0.

Any ops that gained a numeric_only argument as part of 1.5.0 will default to False.

I hope this is helpful - if I can provide any specific information please let me know. Are there particular ops (e.g. sum, mean, std) on particular pandas objects (Series, DataFrame, SeriesGroupBy, DataFrameGroupBy, Resampler, Rolling) that you are interested in? Also, if there are any concerns or issues noticed, please let me know!

pavithraes · 2022-07-07T13:11:29Z

@rhshadrach Got it, thank you for clarifying! Just one more quick question:

Any ops that gained a numeric_only argument as part of 1.5.0 will default to False.

We noticed DataFrame.corr() was set to True on the dev branch. So, will this be updated to False, or are we missing something?

rhshadrach · 2022-07-11T21:08:02Z

Thanks @pavithraes, indeed my comment above was incorrect. Each op has a default even if it doesn't have the argument; this is just it's behavior when no argument is provided. So to maintain backwards compatibility, we need to have the default agree with the behavior in the earlier 1.x releases.

The table below compares all ops across DataFrame, DataFrameGroupBy, DatetimeIndexResampler, Rolling, Expanding and ExponentialMovingWindow. The results shown include only rows where "has_arg" or "default" differs between main and 1.4.x branches.

It all looks correct except for two DatetimeIndexResampler ops. I plan to put up a PR to fix these.

                                 has_arg_main  default_main  has_arg_1.4.x  default_1.4.x
kind                   kernel                                                            
DataFrame              idxmax            True         False          False          False
                       idxmin            True         False          False          False
                       corrwith          True          True          False           True
                       corr              True          True          False           True
                       cov               True          True          False           True
DataFrameGroupBy       idxmax            True          True          False           True
                       idxmin            True          True          False           True
                       quantile          True          True          False           True
                       corrwith          True          True          False           True
                       var               True          True          False           True
                       std               True          True          False           True
                       sem               True          True          False           True
                       corr              True          True          False           True
                       cov               True          True          False           True
DatetimeIndexResampler last              True         False          False          False
                       first             True         False          False          False
                       quantile          True          True          False           True
                       var               True         False           True           True
                       std               True         False           True           True
                       median            True          True          False           True
                       sem               True          True          False           True
Rolling                count             True         False          False          False
Expanding              count             True         False          False          False

rhshadrach · 2022-07-12T20:17:17Z

The table below details the interaction between numeric_only and axis=1. It excludes DataFrameGroupBy.skew and
DataFrameGroupBy.corrwith because these fail when given axis=1 even when all columns are numeric. I'll raise this in a separate issue if it hasn't been already.

The table was generated by running ops like

df = pd.DataFrame({'a': [1, 1, 2], 'b': [3, 4, 5], 'c': list('xyz'), 'd': [6, 7, 8]})
df.sum(axis=1)

The columns in the table below are as follows.

kind: Type of object the kernel is on.
kernel: Operation (e.g. sum, mean, etc).
default: Default behavior of numeric_only.
warns: Whether a warning is generated with the default behavior. When this is "NA", it is because the operation raised. warns should be True if and only if default is True.
has_arg: Whether the kernel has numeric_only as an argument. This is determined by trying to pass numeric_only=True/False in case it is implemented via kwargs.
numeric_only_true: Whether numeric_only=True is correctly implemented.
numeric_only_false: Whether numeric_only=False is correctly implemented.

numeric_only and axis=1 table

                                   default  warns  has_arg numeric_only_true numeric_only_false
kind                    kernel                                                                 
DataFrame               prod         False  False     True              True               True
                        idxmax       False     NA     True              True               True
                        nunique      False  False    False                NA                 NA
                        sum           True   True     True              True               True
                        idxmin       False     NA     True              True               True
                        all          False  False    False                NA                 NA
                        skew          True   True     True              True               True
                        quantile      True   True     True              True               True
                        max           True   True     True              True               True
                        min           True   True     True              True               True
                        any          False  False    False                NA                 NA
                        mean          True   True     True              True               True
                        corrwith      True   True     True              True              False
                        var           True   True     True              True               True
                        std           True   True     True              True               True
                        median        True   True     True              True               True
                        sem           True   True     True              True               True
                        count        False  False     True              True               True
                        ffill        False  False    False                NA                 NA
                        backfill     False  False    False                NA                 NA
                        cumsum       False     NA    False                NA                 NA
                        pad          False  False    False                NA                 NA
                        pct_change   False     NA    False                NA                 NA
                        cummax       False     NA    False                NA                 NA
                        cumprod      False  False    False                NA                 NA
                        bfill        False  False    False                NA                 NA
                        fillna       False  False    False                NA                 NA
                        rank          True   True     True              True               True
                        diff         False     NA    False                NA                 NA
                        cummin       False     NA    False                NA                 NA
                        shift        False  False    False                NA                 NA
                        take         False  False    False                NA                 NA
                        sample       False  False    False                NA                 NA
DataFrameGroupBy        idxmax       False     NA     True              True               True
                        idxmin       False     NA     True              True               True
                        cumsum       False     NA    False                NA                 NA
                        pct_change   False     NA    False                NA                 NA
                        cummax       False     NA     True             False               True
                        cumprod      False  False    False                NA                 NA
                        fillna       False  False    False                NA                 NA
                        rank         False     NA    False                NA                 NA
                        diff         False     NA    False                NA                 NA
                        cummin       False     NA     True             False               True
                        shift        False  False    False                NA                 NA
                        take         False   True    False                NA                 NA
Rolling                 skew          True   True     True              True              True
                        quantile      True   True     True              True              True
                        median        True   True     True              True              True
                        rank          True   True     True              True              True
                        corr         False     NA     True              True               True
                        cov          False     NA     True              True               True
Expanding               skew          True   True     True              True              True
                        quantile      True   True     True              True              True
                        median        True   True     True              True              True
                        rank          True   True     True              True              True
                        corr         False     NA     True              True               True
                        cov          False     NA     True              True               True
ExponentialMovingWindow corr         False     NA     True              True               True
                        cov          False     NA     True              True               True

The rows in the above table where issues are identified are:

                         default warns  has_arg numeric_only_true numeric_only_false
kind             kernel                                                              
DataFrame        corrwith    True  True     True              True              False
DataFrameGroupBy cummax     False    NA     True             False               True
                 cummin     False    NA     True             False               True
                 take       False  True    False                NA                 NA

I believe all rows above correspond to methods that gained the numeric_only operation as part of 1.5.0. The one exception to this is DataFrameGroupBy.take; this operation warns about group_keys when it shouldn't from #34998, which is also a part of 1.5.0.

cc @mroeschke

mroeschke · 2022-07-12T21:24:53Z

The table below details the interaction between numeric_only and axis=1

Good to see the most common ops were implemented correctly!

rhshadrach · 2022-07-16T11:32:56Z

#46560 (comment) previously reported rolling/expanding ops not correctly implementing numeric_only=False. On further inspection, these emit a warning that they will no longer drop nuisance columns in the future so everything is set here. I've updated the comment above to remove these.

df = pd.DataFrame({'a': [1, 1, 2], 'b': [3, 4, 5], 'c': list('xyz'), 'd': [6, 7, 8]})
rolling = df_dt.rolling(2, min_periods=0)
print(rolling.median(numeric_only=False, axis=1).head())
# FutureWarning: Dropping of nuisance columns in rolling operations is deprecated; in a future version this will raise TypeError. Select only valid columns before calling the operation. Dropped columns were Index(['c'], dtype='object')

rhshadrach added API - Consistency Internal Consistency of API/Behavior Nuisance Columns Identifying/Dropping nuisance columns in reductions, groupby.add, DataFrame.apply labels Mar 29, 2022

rhshadrach added this to the 1.5 milestone Mar 29, 2022

rhshadrach self-assigned this Mar 29, 2022

rhshadrach mentioned this issue Mar 29, 2022

ENH: Add numeric_only to groupby frame ops #46524

Closed

14 tasks

This was referenced Apr 8, 2022

DEPR: mad #46707

Merged

ENH: Add numeric_only to frame methods #46708

Merged

rhshadrach added the Master Tracker High level tracker for similar issues label Apr 10, 2022

rhshadrach mentioned this issue Apr 10, 2022

ENH: Add numeric_only to certain groupby ops #46728

Merged

4 tasks

This was referenced Apr 19, 2022

ENH: Add numeric_only to resampler methods #46792

Merged

DEPR: numeric_only default on DataFrame ops #46852

Closed

rhshadrach mentioned this issue May 21, 2022

BUG: resample ops includes sampled column #47079

Closed

This was referenced May 30, 2022

DEPR: numeric_only default in resampler ops #47177

Merged

BUG: Different behavior from .agg("mean") and .agg(["mean"]) on a grouby df with a datetime64[ns] column #47166

Closed

rhshadrach mentioned this issue Jun 7, 2022

ENH: Add numeric_only to window ops #47265

Merged

4 tasks

This was referenced Jun 23, 2022

BUG: Fix issues with numeric_only deprecation #47481

Merged

API: numeric_only in Series ops behavior #47500

Closed

ncclementi mentioned this issue Jul 12, 2022

Upstream fix for numeric_only default [WIP] dask/dask#9241

Closed

3 tasks

This was referenced Jul 14, 2022

BUG: numeric_only with axis=1 in DataFrame.corrwith and DataFrameGroupBy.cummin/max #47724

Merged

BUG: reset_index after a group_by raise a ValueError for empty dataframe #43767

Closed

rhshadrach mentioned this issue Jul 16, 2022

BUG: Correct numeric_only default for resample var and std #47749

Merged

5 tasks

mroeschke closed this as completed in #47749 Jul 16, 2022

jrbourbeau mentioned this issue Sep 7, 2022

numeric_only compatibility with pandas=1.5 dask/dask#9471

Open

ncclementi mentioned this issue Sep 15, 2022

Filter out numeric_only warnings from pandas dask/dask#9496

Merged

This was referenced Jul 15, 2023

Inconsistencies in groupby aggregation with non-numeric types #13416

Closed

Decimal fields dropped in group by with more than one column #22275

Closed

BUG: groupby.sum removed columns in case sequential calls of several groupby.sum #44132

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

API: numeric_only consistency #46560

API: numeric_only consistency #46560

rhshadrach commented Mar 29, 2022 •

edited

jreback commented Jun 5, 2022

rhshadrach commented Jun 7, 2022

pavithraes commented Jul 6, 2022

rhshadrach commented Jul 6, 2022 •

edited

pavithraes commented Jul 7, 2022

rhshadrach commented Jul 11, 2022 •

edited

rhshadrach commented Jul 12, 2022 •

edited

mroeschke commented Jul 12, 2022

rhshadrach commented Jul 16, 2022

API: numeric_only consistency #46560

API: numeric_only consistency #46560

Comments

rhshadrach commented Mar 29, 2022 • edited

jreback commented Jun 5, 2022

rhshadrach commented Jun 7, 2022

pavithraes commented Jul 6, 2022

rhshadrach commented Jul 6, 2022 • edited

pavithraes commented Jul 7, 2022

rhshadrach commented Jul 11, 2022 • edited

rhshadrach commented Jul 12, 2022 • edited

mroeschke commented Jul 12, 2022

rhshadrach commented Jul 16, 2022

rhshadrach commented Mar 29, 2022 •

edited

rhshadrach commented Jul 6, 2022 •

edited

rhshadrach commented Jul 11, 2022 •

edited

rhshadrach commented Jul 12, 2022 •

edited