Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH: Add numeric_only to groupby frame ops #46524

Closed
8 of 14 tasks
Tracked by #46560
rhshadrach opened this issue Mar 26, 2022 · 4 comments · Fixed by #46728
Closed
8 of 14 tasks
Tracked by #46560

ENH: Add numeric_only to groupby frame ops #46524

rhshadrach opened this issue Mar 26, 2022 · 4 comments · Fixed by #46728
Assignees
Labels
Deprecate Functionality to remove in pandas Enhancement Groupby Needs Discussion Requires discussion from core team before further action
Milestone

Comments

@rhshadrach
Copy link
Member

rhshadrach commented Mar 26, 2022

Once #46072 is implemented, many groupby ops will be defaulting to numeric_only=False in 2.0. However there are a number of group ops which can only ever work on numeric data. For API consistency, I believe a user trying to operate on non-numeric columns with these ops should raise. Consider the example

df = pd.DataFrame({'a': [1, 1], 'b': [3, 4], 'c': [5, 6]})
df['c'] = df['c'].astype(object)
gb = df.groupby('a')
print(gb.mean())

which gives the output

     b
a     
1  3.5

If a user has a numeric column that accidentally ends up as object dtype, the result will be silently missing expected columns. This is why I think we should run the op with all provided data, regardless if it is numeric or not.

The following groupby ops have no numeric_only argument and act like numeric_only=True, but only make sense on numeric data.

The following groupby ops have no numeric_only argument and act like numeric_only=True, but make sense on non-numeric data.

For both groups of ops, I propose we add the numeric_only argument defaulting to True in 1.5, which emits a warning message that it will default to False in the future. The warning would only be emitted if setting numeric_only to True/False would give rise to different output; i.e. if there are non-numeric columns that could have been operated on.

It's not ideal to add an argument and deprecate the default value in the same minor release (assuming 1.5 is the last minor release in the 1.x series), however I believe it will be of minor impact to users. The alternatives would be not carrying out the deprecation of numeric_only=True or to leave these ops behaving as if numeric_only=True (with no numeric_only argument). Both of these seem like worse alternatives to me.

cc @jreback @jbrockmendel @jorisvandenbossche @simonjayhawkins @Dr-Irv

@rhshadrach rhshadrach added Enhancement Groupby Deprecate Functionality to remove in pandas Needs Discussion Requires discussion from core team before further action labels Mar 26, 2022
@rhshadrach rhshadrach changed the title Add numeric_only to groupby ops ENH: Add numeric_only to groupby ops Mar 26, 2022
@Dr-Irv
Copy link
Contributor

Dr-Irv commented Mar 28, 2022

I like the idea. Have you checked that there would then be consistency between Series.op and Series.groupby().op where op is mean, std, cummax, etc. in terms of when we raise and how we handle numeric and non-numeric data?

@simonjayhawkins
Copy link
Member

or to leave these ops behaving as if numeric_only=True (with no numeric_only argument). Both of these seem like worse alternatives to me.

makes sense to add the keyword for consistency with DataFrame ops and also makes sense to make the default False again for consistency.

If the reason is to allow users to operate on object columns with numeric data then we should probably also be consistent with the return type for the different ops as there seems to currently be an inconsistency. either keep the object columns with numeric data as object dtype as with quantile or cast eg. float as mean does

>>> pd.Series([5, 7], dtype=object).quantile([0.1, 0.5])
0.1    5.2
0.5    6.0
dtype: object
>>> 
>>> pd.DataFrame({"a": [5, 7]}, dtype=object).quantile(0.1)
<stdin>:1: FutureWarning: In future versions of pandas, numeric_only will be set to False by default, and the datetime/timedelta columns will be considered in the results. To not consider these columnsspecify numeric_only=True.
Series([], Name: 0.1, dtype: float64)
>>> 
>>> pd.DataFrame({"a": [5, 7]}, dtype=object).quantile([0.1, 0.2]).dtypes
Series([], dtype: object)
>>> 
>>> pd.DataFrame({"a": [5, 7]}, dtype=object).mean(numeric_only=False)
a    6.0
dtype: float64
>>> 
>>> pd.DataFrame({"a": [5, 7]}, dtype=object).quantile(0.1, numeric_only=False)
a    5.2
Name: 0.1, dtype: object
>>> 
>>> result = (
...     pd.DataFrame({"gb": [1, 1], "a": [5, 7]}, dtype=object)
...     .groupby("gb")
...     .mean(numeric_only=False)
... )

>>> print(result)
      a
gb     
1   6.0
>>> 
>>> print(result.dtypes)
a    float64
dtype: object
>>> 

@jbrockmendel
Copy link
Member

Side-note: quantile's pre-processing is a bit of a mess. That doesn't invalidate anything said about it here, but I'm optimistic at least the clarity of the situation will improve.

Generally +1 on the OP's idea to make things more consistent, but I wonder if making the eventual-default numeric_only=False to match effectively everything else?

@rhshadrach rhshadrach self-assigned this Mar 28, 2022
@rhshadrach rhshadrach changed the title ENH: Add numeric_only to groupby ops ENH: Add numeric_only to groupby frame ops Mar 29, 2022
@rhshadrach
Copy link
Member Author

rhshadrach commented Mar 29, 2022

@Dr-Irv

Have you checked that there would then be consistency between Series.op and Series.groupby().op where op is mean, std, cummax, etc. in terms of when we raise and how we handle numeric and non-numeric data?

Thanks - I was under the impression that these ops would always work as numeric_only=False, but I am seeing some odd behavior. I've opened #46560 to track and plan to investigate this after.

@simonjayhawkins

If the reason is to allow users to operate on object columns with numeric data then we should probably also be consistent with the return type for the different ops as there seems to currently be an inconsistency.

Agreed, I've added this to #46560

@jbrockmendel

but I wonder if making the eventual-default numeric_only=False to match effectively everything else?

I think we're on the same page here - the plan is to default to True in 1.5 (so as to make it non-breaking) with a FutureWarning that they will default to False in 2.0.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Deprecate Functionality to remove in pandas Enhancement Groupby Needs Discussion Requires discussion from core team before further action
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants