Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

API: numeric_only consistency #46560

Closed
9 tasks done
rhshadrach opened this issue Mar 29, 2022 · 9 comments · Fixed by #47749
Closed
9 tasks done

API: numeric_only consistency #46560

rhshadrach opened this issue Mar 29, 2022 · 9 comments · Fixed by #47749
Assignees
Labels
API - Consistency Internal Consistency of API/Behavior Master Tracker High level tracker for similar issues Nuisance Columns Identifying/Dropping nuisance columns in reductions, groupby.add, DataFrame.apply
Milestone

Comments

@rhshadrach
Copy link
Member

rhshadrach commented Mar 29, 2022

Tracking issue for API consistency between all the different ops that have a numeric_only argument. This includes the dtype of the return when object types are present. I plan to report on the current state after the issues listed below are closed and prior to the 1.5 release.

@rhshadrach rhshadrach added API - Consistency Internal Consistency of API/Behavior Nuisance Columns Identifying/Dropping nuisance columns in reductions, groupby.add, DataFrame.apply labels Mar 29, 2022
@rhshadrach rhshadrach added this to the 1.5 milestone Mar 29, 2022
@rhshadrach rhshadrach self-assigned this Mar 29, 2022
This was referenced Apr 8, 2022
@rhshadrach rhshadrach added the Master Tracker High level tracker for similar issues label Apr 10, 2022
@jreback
Copy link
Contributor

jreback commented Jun 5, 2022

@rhshadrach if you can update #30228 to account for these (there are a couple but not all)

@rhshadrach
Copy link
Member Author

@jreback - added 46072 and 46852; those were the only two missing.

@pavithraes
Copy link

@rhshadrach Thanks for all your work here!

We're working on supporting this for Dask, and we just wanted to confirm if for the intermediate-term, i.e., from 1.5.0 to when numeric_only=False by default, the default will be numeric_only=True?

We were looking at this changelog entry (also where we found this issue) and weren't sure about the defaults in 1.5.0. :)

@rhshadrach
Copy link
Member Author

rhshadrach commented Jul 6, 2022

@pavithraes - the defaults in 1.5.0 should be backwards compatible with those in the rest of the 1.x releases. As remarked in the changelog you linked to, the defaults are currently inconsistent - some True, some False, some None. The only difference in 1.5.0 should be that users get a deprecation messages if (a) the defaults are used and (b) the result would change based on the new default (False) in 2.0.

Any ops that gained a numeric_only argument as part of 1.5.0 will default to False.

I hope this is helpful - if I can provide any specific information please let me know. Are there particular ops (e.g. sum, mean, std) on particular pandas objects (Series, DataFrame, SeriesGroupBy, DataFrameGroupBy, Resampler, Rolling) that you are interested in? Also, if there are any concerns or issues noticed, please let me know!

@pavithraes
Copy link

@rhshadrach Got it, thank you for clarifying! Just one more quick question:

Any ops that gained a numeric_only argument as part of 1.5.0 will default to False.

We noticed DataFrame.corr() was set to True on the dev branch. So, will this be updated to False, or are we missing something?

@rhshadrach
Copy link
Member Author

rhshadrach commented Jul 11, 2022

Thanks @pavithraes, indeed my comment above was incorrect. Each op has a default even if it doesn't have the argument; this is just it's behavior when no argument is provided. So to maintain backwards compatibility, we need to have the default agree with the behavior in the earlier 1.x releases.

The table below compares all ops across DataFrame, DataFrameGroupBy, DatetimeIndexResampler, Rolling, Expanding and ExponentialMovingWindow. The results shown include only rows where "has_arg" or "default" differs between main and 1.4.x branches.

It all looks correct except for two DatetimeIndexResampler ops. I plan to put up a PR to fix these.

                                 has_arg_main  default_main  has_arg_1.4.x  default_1.4.x
kind                   kernel                                                            
DataFrame              idxmax            True         False          False          False
                       idxmin            True         False          False          False
                       corrwith          True          True          False           True
                       corr              True          True          False           True
                       cov               True          True          False           True
DataFrameGroupBy       idxmax            True          True          False           True
                       idxmin            True          True          False           True
                       quantile          True          True          False           True
                       corrwith          True          True          False           True
                       var               True          True          False           True
                       std               True          True          False           True
                       sem               True          True          False           True
                       corr              True          True          False           True
                       cov               True          True          False           True
DatetimeIndexResampler last              True         False          False          False
                       first             True         False          False          False
                       quantile          True          True          False           True
                       var               True         False           True           True
                       std               True         False           True           True
                       median            True          True          False           True
                       sem               True          True          False           True
Rolling                count             True         False          False          False
Expanding              count             True         False          False          False

@rhshadrach
Copy link
Member Author

rhshadrach commented Jul 12, 2022

The table below details the interaction between numeric_only and axis=1. It excludes DataFrameGroupBy.skew and
DataFrameGroupBy.corrwith because these fail when given axis=1 even when all columns are numeric. I'll raise this in a separate issue if it hasn't been already.

The table was generated by running ops like

df = pd.DataFrame({'a': [1, 1, 2], 'b': [3, 4, 5], 'c': list('xyz'), 'd': [6, 7, 8]})
df.sum(axis=1)

The columns in the table below are as follows.

  • kind: Type of object the kernel is on.
  • kernel: Operation (e.g. sum, mean, etc).
  • default: Default behavior of numeric_only.
  • warns: Whether a warning is generated with the default behavior. When this is "NA", it is because the operation raised. warns should be True if and only if default is True.
  • has_arg: Whether the kernel has numeric_only as an argument. This is determined by trying to pass numeric_only=True/False in case it is implemented via kwargs.
  • numeric_only_true: Whether numeric_only=True is correctly implemented.
  • numeric_only_false: Whether numeric_only=False is correctly implemented.
numeric_only and axis=1 table
                                   default  warns  has_arg numeric_only_true numeric_only_false
kind                    kernel                                                                 
DataFrame               prod         False  False     True              True               True
                        idxmax       False     NA     True              True               True
                        nunique      False  False    False                NA                 NA
                        sum           True   True     True              True               True
                        idxmin       False     NA     True              True               True
                        all          False  False    False                NA                 NA
                        skew          True   True     True              True               True
                        quantile      True   True     True              True               True
                        max           True   True     True              True               True
                        min           True   True     True              True               True
                        any          False  False    False                NA                 NA
                        mean          True   True     True              True               True
                        corrwith      True   True     True              True              False
                        var           True   True     True              True               True
                        std           True   True     True              True               True
                        median        True   True     True              True               True
                        sem           True   True     True              True               True
                        count        False  False     True              True               True
                        ffill        False  False    False                NA                 NA
                        backfill     False  False    False                NA                 NA
                        cumsum       False     NA    False                NA                 NA
                        pad          False  False    False                NA                 NA
                        pct_change   False     NA    False                NA                 NA
                        cummax       False     NA    False                NA                 NA
                        cumprod      False  False    False                NA                 NA
                        bfill        False  False    False                NA                 NA
                        fillna       False  False    False                NA                 NA
                        rank          True   True     True              True               True
                        diff         False     NA    False                NA                 NA
                        cummin       False     NA    False                NA                 NA
                        shift        False  False    False                NA                 NA
                        take         False  False    False                NA                 NA
                        sample       False  False    False                NA                 NA
DataFrameGroupBy        idxmax       False     NA     True              True               True
                        idxmin       False     NA     True              True               True
                        cumsum       False     NA    False                NA                 NA
                        pct_change   False     NA    False                NA                 NA
                        cummax       False     NA     True             False               True
                        cumprod      False  False    False                NA                 NA
                        fillna       False  False    False                NA                 NA
                        rank         False     NA    False                NA                 NA
                        diff         False     NA    False                NA                 NA
                        cummin       False     NA     True             False               True
                        shift        False  False    False                NA                 NA
                        take         False   True    False                NA                 NA
Rolling                 skew          True   True     True              True              True
                        quantile      True   True     True              True              True
                        median        True   True     True              True              True
                        rank          True   True     True              True              True
                        corr         False     NA     True              True               True
                        cov          False     NA     True              True               True
Expanding               skew          True   True     True              True              True
                        quantile      True   True     True              True              True
                        median        True   True     True              True              True
                        rank          True   True     True              True              True
                        corr         False     NA     True              True               True
                        cov          False     NA     True              True               True
ExponentialMovingWindow corr         False     NA     True              True               True
                        cov          False     NA     True              True               True

The rows in the above table where issues are identified are:

                         default warns  has_arg numeric_only_true numeric_only_false
kind             kernel                                                              
DataFrame        corrwith    True  True     True              True              False
DataFrameGroupBy cummax     False    NA     True             False               True
                 cummin     False    NA     True             False               True
                 take       False  True    False                NA                 NA

I believe all rows above correspond to methods that gained the numeric_only operation as part of 1.5.0. The one exception to this is DataFrameGroupBy.take; this operation warns about group_keys when it shouldn't from #34998, which is also a part of 1.5.0.

cc @mroeschke

@mroeschke
Copy link
Member

The table below details the interaction between numeric_only and axis=1

Good to see the most common ops were implemented correctly!

@rhshadrach
Copy link
Member Author

#46560 (comment) previously reported rolling/expanding ops not correctly implementing numeric_only=False. On further inspection, these emit a warning that they will no longer drop nuisance columns in the future so everything is set here. I've updated the comment above to remove these.

df = pd.DataFrame({'a': [1, 1, 2], 'b': [3, 4, 5], 'c': list('xyz'), 'd': [6, 7, 8]})
rolling = df_dt.rolling(2, min_periods=0)
print(rolling.median(numeric_only=False, axis=1).head())
# FutureWarning: Dropping of nuisance columns in rolling operations is deprecated; in a future version this will raise TypeError. Select only valid columns before calling the operation. Dropped columns were Index(['c'], dtype='object')

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
API - Consistency Internal Consistency of API/Behavior Master Tracker High level tracker for similar issues Nuisance Columns Identifying/Dropping nuisance columns in reductions, groupby.add, DataFrame.apply
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants