Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: Different behavior from .agg("mean") and .agg(["mean"]) on a grouby df with a datetime64[ns] column #47166

Closed
3 tasks done
leodtprojectsd opened this issue May 30, 2022 · 4 comments
Labels
Bug Duplicate Report Duplicate issue or pull request Groupby Nuisance Columns Identifying/Dropping nuisance columns in reductions, groupby.add, DataFrame.apply

Comments

@leodtprojectsd
Copy link

leodtprojectsd commented May 30, 2022

Pandas version checks

  • I have checked that this issue has not already been reported.

  • I have confirmed this bug exists on the latest version of pandas.

  • I have confirmed this bug exists on the main branch of pandas.

Reproducible Example

from pandas import Timestamp
import pandas as pd 

print ("pandas Version:",  pd.__version__)
#dataframe
df = pd.DataFrame.from_dict({'filename': ['03_', '03_', '03_', '05_', '05_', '05_', '05_', '05_', '08_', '08_'], 
 'date_time': [Timestamp('2022-05-24 12:10:56'), Timestamp('2022-05-24 12:11:24'), Timestamp('2022-05-24 12:11:51'), 
               Timestamp('2022-05-24 12:41:54'), Timestamp('2022-05-24 12:42:21'), Timestamp('2022-05-24 12:42:49'),
               Timestamp('2022-05-24 12:43:16'), Timestamp('2022-05-24 12:43:44'), Timestamp('2022-05-24 12:57:30'), 
               Timestamp('2022-05-24 12:57:58')],
  'r': [80466.36, 71467.12, 72641.21, 76961.35, 86747.23, 81995.81, 74451.46, 69401.51, 73670.12, 78180.65]})

print ("df column types: ", df.info(),)

print ('\nWorks with: df.groupby(["filename"]).agg(["mean"])\n', df.groupby(["filename"]).agg(["mean"]))
print ('\nNot working with: df.groupby(["filename"]).agg("mean")\n', df.groupby(["filename"]).agg("mean"))
print ('\nNot working with: df.groupby(["filename"]).mean()\n', df.groupby(["filename"]).mean())


OUT: 
pandas Version: 1.3.5
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10 entries, 0 to 9
Data columns (total 3 columns):
 #   Column     Non-Null Count  Dtype         
---  ------     --------------  -----         
 0   filename   10 non-null     object        
 1   date_time  10 non-null     datetime64[ns]
 2   r          10 non-null     float64       
dtypes: datetime64[ns](1), float64(1), object(1)
memory usage: 368.0+ bytes
df column types:  None

Works with: df.groupby(["filename"]).agg(["mean"]) #See date_time column appearing
                              date_time          r
                                  mean       mean
filename                                         
03_      2022-05-24 12:11:23.666666752  74858.230
05_      2022-05-24 12:42:48.800000000  77911.472
08_      2022-05-24 12:57:44.000000000  75925.385

Not working with: df.groupby(["filename"]).agg("mean") #date_time column is gone
                   r
filename           
03_       74858.230
05_       77911.472
08_       75925.385

Not working with: df.groupby(["filename"]).mean() #date_time column is gone
                   r
filename           
03_       74858.230
05_       77911.472
08_       75925.385

Issue Description

I expected the same behavior from

  • df.groupby(["filename"]).agg(["mean"])
  • df.groupby(["filename"]).agg("mean")
  • df.groupby(["filename"]).mean()

Instead, when used with a df that has a column with datetime64[ns] data, only .agg(["mean"]) works, while .agg("mean") and .mean() drop the datetime64[ns] column

Expected Behavior

I expect that agg(["mean"]), agg("mean"), and mean(), behave the same.

Installed Versions

INSTALLED VERSIONS ------------------ commit : 66e3805 python : 3.7.13.final.0 python-bits : 64 OS : Linux OS-release : 5.4.188+ Version : #1 SMP Sun Apr 24 10:03:06 PDT 2022 machine : x86_64 processor : x86_64 byteorder : little LC_ALL : None LANG : en_US.UTF-8 LOCALE : en_US.UTF-8

pandas : 1.3.5
numpy : 1.21.6
pytz : 2022.1
dateutil : 2.8.2
pip : 21.1.3
setuptools : 57.4.0
Cython : 0.29.30
pytest : 3.6.4
hypothesis : None
sphinx : 1.8.6
blosc : None
feather : 0.4.1
xlsxwriter : None
lxml.etree : 4.2.6
html5lib : 1.0.1
pymysql : None
psycopg2 : 2.7.6.1 (dt dec pq3 ext lo64)
jinja2 : 2.11.3
IPython : 5.5.0
pandas_datareader: 0.9.0
bs4 : 4.6.3
bottleneck : 1.3.4
fsspec : None
fastparquet : None
gcsfs : None
matplotlib : 3.2.2
numexpr : 2.8.1
odfpy : None
openpyxl : 3.0.10
pandas_gbq : 0.13.3
pyarrow : 6.0.1
pyxlsb : None
s3fs : None
scipy : 1.4.1
sqlalchemy : 1.4.36
tables : 3.7.0
tabulate : 0.8.9
xarray : 0.20.2
xlrd : 1.1.0
xlwt : 1.3.0
numba : 0.51.2
None

@leodtprojectsd leodtprojectsd added Bug Needs Triage Issue that has not been reviewed by a pandas team member labels May 30, 2022
@leodtprojectsd leodtprojectsd changed the title BUG: BUG: Different behavior from .agg("mean") and .agg(["mean"]) on a grouby df with a datetime64[ns] column May 30, 2022
@guyrt
Copy link
Contributor

guyrt commented May 31, 2022

FWIW, in v1.2.5 none of these options operate on the datatime64[ns] column! @leodtprojectsd are you working on a PR for this bug? If not, I'd like to work on one.

@rhshadrach
Copy link
Member

rhshadrach commented May 31, 2022

Thanks for the report! When using list or dict in agg, the DataFrame is broken up into Series before each function is applied. What you're seeing is the difference in numeric_only between DataFrame.groupby(...).mean and Series.groupby(...).mean. See:

https://pandas.pydata.org/pandas-docs/dev/user_guide/groupby.html#automatic-exclusion-of-nuisance-columns
https://pandas.pydata.org/docs/reference/api/pandas.core.groupby.GroupBy.mean.html

You can get the same result with

print(df.groupby("filename").agg("mean", numeric_only=False))

So I think that makes this a duplicate of #46560.

@rhshadrach rhshadrach added Groupby Nuisance Columns Identifying/Dropping nuisance columns in reductions, groupby.add, DataFrame.apply Duplicate Report Duplicate issue or pull request and removed Needs Triage Issue that has not been reviewed by a pandas team member labels May 31, 2022
@rhshadrach
Copy link
Member

I'm going to close this as a duplicate - @guyrt and @leodtprojectsd please reply here if you believe I've missed something and happy to reopen.

@rhshadrach rhshadrach closed this as not planned Won't fix, can't repro, duplicate, stale May 31, 2022
@leodtprojectsd
Copy link
Author

@guyrt, wasn't working on it, but I think @rhshadrach reply covers it, thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Duplicate Report Duplicate issue or pull request Groupby Nuisance Columns Identifying/Dropping nuisance columns in reductions, groupby.add, DataFrame.apply
Projects
None yet
Development

No branches or pull requests

3 participants