Inconsistencies in groupby aggregation with non-numeric types #13416

pijucha · 2016-06-09T22:13:27Z

Some of the issues and inconsistencies I noticed.
(Apologies for a bit lengthy input.)

Settings:

df = pd.DataFrame({"A": [2, 1, 2, 2],
                   "B": [3, 3, 4, 4],
                   "C1": pd.Categorical([5, 6, 6, 6], [5, 6, 7]),
                   "C2": pd.Categorical(list("aaab"), list("abc")),
                   "D": pd.date_range('2011-11-11', periods=4),
                   "E": [10, 20, 30, 40]})
df
Out[7]: 
   A  B C1 C2          D   E
0  2  3  5  a 2011-11-11  10
1  1  3  6  a 2011-11-12  20
2  2  4  6  a 2011-11-13  30
3  2  4  6  b 2011-11-14  40

df.dtypes
Out[8]: 
A              int64
B              int64
C1          category
C2          category
D     datetime64[ns]
E              int64
dtype: object

df_ac = df[['A', 'C1', 'C2']]
df_ad = df[['A', 'D']]
df_abc = df[['A', 'B', 'C1', 'C2']]
df_abd = df[['A', 'B', 'D']]
df_abe = df[['A', 'B', 'E']]
df_acd = df[['A', 'C1', 'C2', 'D']]
df_abcd = df[['A', 'B', 'C1', 'C2', 'D']]

Usually, non-numeric types are skipped in aggregation functions:

df.groupby('A').mean()
Out[16]: 
          B          E
A                     
1  3.000000  20.000000
2  3.666667  26.666667

df.groupby('A').sum()
Out[17]: 
    B   E
A        
1   3  20
2  11  80

df.groupby(['A', 'B']).mean()
Out[18]: 
      E
A B    
1 3  20
2 3  10
  4  35

df.groupby(['A', 'B'], as_index=False).sum()
Out[19]: 
   A  B   E
0  1  3  20
1  2  3  10
2  2  4  70

Issues:

But if there are no numeric types, an output varies. (I use here subframes df_xxx of the original data frame.)

# .mean() always raises.
df_ac.groupby('A').mean()                               # (1)
pandas.core.base.DataError: No numeric types to aggregate

# .sum() adds categoricals
df_ac.groupby('A').sum()                                # (2)
Out[21]: 
   C1   C2
A         
1   6    a
2  17  aab

# and tries to do something with datetimes.
df_ad.groupby('A').sum()                                # (3)
Out[22]: 
           D
A           
1 2011-11-12
2        NaT

df_acd.groupby('A').sum()                               # (4)
Out[23]: 
    C1   C2                    D
A                               
1  6.0    a  2011-11-12 00:00:00
2  NaN  NaN                  NaN

# It's even worse for multiple groupers.
df_abcd.groupby(['A', 'B']).sum()                       # (5)
Out[24]: 
A  B    
1  3  C1                      6
      C2                      a
      D     2011-11-12 00:00:00
2  3  C1                      5
      C2                      a
      D     2011-11-11 00:00:00
dtype: object

df_abcd.groupby(['A', 'B'], as_index=False).sum()       # (6)
Out[25]: 
0  A                       1
   B                       3
   C1                      6
   C2                      a
   D     2011-11-12 00:00:00
1  A                       2
   B                       3
   C1                      5
   C2                      a
   D     2011-11-11 00:00:00
2  A                       4
   B                       8
dtype: object

# Additionally, the index is not reset here and the grouper columns are transformed.
df_abc.groupby(['A', 'B'], as_index=False).sum()        # (7)
Out[26]: 
     A  B  C1  C2
A B              
1 3  1  3   6   a
2 3  2  3   5   a
  4  4  8  12  ab

# Sometimes an empty data frame is returned  (a reasonable output):
df_abd.groupby(['A', 'B']).sum()                        # (8)
Out[27]: 
Empty DataFrame
Columns: []
Index: []

# but not always:
df_abd.groupby(['A', 'B'], as_index=False).sum()        # (9)
Out[28]: 
0  A                      1
   B                      3
   D    2011-11-12 00:00:00
1  A                      2
   B                      3
   D    2011-11-11 00:00:00
2  A                      4
   B                      8
dtype: object

df_abcd.groupby(['A', 'B']).sum()                       # (10)
Out[29]: 
A  B    
1  3  C1                      6
      C2                      a
      D     2011-11-12 00:00:00
2  3  C1                      5
      C2                      a
      D     2011-11-11 00:00:00
dtype: object

Some other issues:

Multiple groupers with a categrical one (it's already addressed in #13204).

df.groupby(['A', 'C1'], as_index=False).sum()           # (11)
Out[36]: 
       A  C1   B   E
A C1                
1 5  NaN NaN NaN NaN
  6  NaN NaN NaN NaN
  7  NaN NaN NaN NaN
2 5  NaN NaN NaN NaN
  6  NaN NaN NaN NaN
  7  NaN NaN NaN NaN

apply

# mean() as expected:
df.groupby(['A', 'B']).apply(lambda x: np.mean(x))
Out[30]: 
       A    B     E
A B                
1 3  1.0  3.0  20.0
2 3  2.0  3.0  10.0
  4  2.0  4.0  35.0

# sum() not:
df.groupby(['A', 'B']).apply(lambda x: np.sum(x))       # (12)
Out[31]: 
A  B    
1  3  A                       1
      B                       3
      C1                      6
      C2                      a
      D     2011-11-12 00:00:00
      E                      20
2  3  A                       2
      B                       3
      C1                      5
      C2                      a
      D     2011-11-11 00:00:00
      E                      10
   4  A                       4
      B                       8
      E                      70
dtype: object

transform

# it's ok with numeric types only
df_abe.groupby(['A', 'B']).transform('sum')
Out[32]: 
    E
0  10
1  20
2  70
3  70

# Doesn't transform anything with mixed types:
df.groupby(['A', 'B']).transform('sum')                 # (13)
Out[33]: 
  C1 C2          D   E
0  5  a 2011-11-11  10
1  6  a 2011-11-12  20
2  6  a 2011-11-13  30
3  6  b 2011-11-14  40

# but someimtes transforms categoricals
df_abc.groupby(['A', 'B']).transform('sum')             # (14)
Out[34]: 
   C1  C2
0   5   a
1   6   a
2  12  ab
3  12  ab

# and sometimes not:
df_abc.groupby(['A', 'B'], as_index=False).transform('sum')     # (15)
Out[35]: 
  C1 C2
0  5  a
1  6  a
2  6  a
3  6  b

What should be the expected output?

Some ideas for aggregation (with sum()) when there's no numeric types (1)-(10):

always raise (as .mean() does)
return an empty DataFrame as in (8)
(but (a) should .mean() do the same? (b) should groupers be returned when as_index=False?)
???

output of `pd.show_versions()`

INSTALLED VERSIONS
------------------
commit: None
python: 3.5.1.final.0
python-bits: 64
OS: Linux
machine: x86_64

pandas: 0.18.1
Cython: 0.24
numpy: 1.11.0
IPython: 4.0.1

Exactly the same output with:

pandas: 0.18.1+119.gd405bf2

Just a thought about a possible approach:

A call to _python_apply_general inside _python_agg_general
https://github.com/pydata/pandas/blob/master/pandas/core/groupby.py#L795
seems to trigger some of the issues. (It doesn't exclude grouper columns when as_index=False, treats categoricals as their underlying types, and possibly just replicates some actions from earlier part of _python_agg_general.)

The following change affects (and possibly solves, at least partially)
issues (5), (6), (7), (9), (10), (14):

diff --git a/pandas/core/groupby.py b/pandas/core/groupby.py
index bea62e9..0d401e2 100644
--- a/pandas/core/groupby.py
+++ b/pandas/core/groupby.py
@@ -791,10 +791,7 @@ class _GroupBy(PandasObject, SelectionMixin):
             except TypeError:
                 continue

-        if len(output) == 0:
-            return self._python_apply_general(f)
-
-        if self.grouper._filter_empty_groups:
+        if len(output) > 0 and self.grouper._filter_empty_groups:

             mask = counts.ravel() > 0
             for name, result in compat.iteritems(output):

The text was updated successfully, but these errors were encountered:

jreback · 2016-06-10T12:44:32Z

hmm, yeah this look suspect. although .sum() can be an aggregator (for string columns), and so maybe we should leave it, this should be predictable, so I would make all of these be the same (e.g. raise).

Note I am only referring to the inbuilt methods .sum()/.mean() directly on a grouper, you can always do .apply(lambda x: x.sum()) if you want to operate on a string column.

jreback · 2016-06-10T12:44:40Z

@sinhrks @jorisvandenbossche

sinhrks · 2016-06-13T00:06:33Z

Normal aggregation also has inconsistencies. It excludes categorical if it has timestamp column, includes otherwise.

df.sum()
# A      7
# B     14
# E    100
# dtype: int64

# exclude timestamp col
df[['A', 'B', 'C1', 'C2', 'E']].sum()
# A        7
# B       14
# C1      23
# C2    aaab
# E      100
# dtype: object

Maybe it should be excluded by default even if categorical internal is numeric, and included if numeric_only=True? Thus, groupby agg also should have numeric_only kwd.

pijucha · 2016-06-13T01:44:59Z

As for normal aggregation, there's also this:

df[['C1']].sum()
Out[10]: 
C1    23
dtype: int64

df['C1'].sum()
...
TypeError: Categorical cannot perform the operation sum

saddy001 · 2017-05-31T06:32:31Z

Please also consider the following case:
I have a DataFrame of class "A" instances (object type). Instances of A behave much like floats (addition, division, etc. possible), but they don't implement __float__, because I want to evaluate them lazily (very expensive calculation). For this, I call A().GetValue() which evaluates the calculation function.
So I rely on pandas to not enforce converting the A()s to floats. This is working fine for DataFrame.sum(), min() and max(), but not for mean():

If I call DataFrame.mean(axis=1), all values are NaN and the dtype becomes float64, because _ensure_numeric is called that raises "TypeError: Could not convert A(..) to numeric".

rhshadrach · 2023-07-15T11:17:19Z

Every op in OP with non-numeric data now raises due to #46560

jreback added Groupby Dtype Conversions Unexpected or buggy dtype conversions API Design Difficulty Advanced labels Jun 10, 2016

jreback added this to the Next Major Release milestone Jun 10, 2016

jreback mentioned this issue Aug 5, 2016

BUG: Columns used in sum on 1 row dataframe depend on values instead of dtype #13912

Closed

jreback mentioned this issue Aug 16, 2016

SeriesGroupby.cumsum raises on object dtype #13992

Open

jreback mentioned this issue Apr 29, 2018

GroupBy handles non-numeric data differently for different aggregation functions #20871

Closed

rswgnu mentioned this issue Apr 26, 2019

BUG: DataFrame.sum() changes int columns to float when only int and float columns #26219

Closed

adamhooper mentioned this issue Sep 26, 2019

GroupBy(..., as_index=True).agg() drops index when #28641

Closed

jbrockmendel removed Effort Medium labels Oct 21, 2019

mroeschke added Bug and removed API Design labels May 1, 2021

mroeschke removed this from the Contributions Welcome milestone Oct 13, 2022

rhshadrach closed this as completed Jul 15, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Inconsistencies in groupby aggregation with non-numeric types #13416

Inconsistencies in groupby aggregation with non-numeric types #13416

pijucha commented Jun 9, 2016 •

edited by jreback

jreback commented Jun 10, 2016

jreback commented Jun 10, 2016

sinhrks commented Jun 13, 2016

pijucha commented Jun 13, 2016

saddy001 commented May 31, 2017 •

edited

rhshadrach commented Jul 15, 2023

Inconsistencies in groupby aggregation with non-numeric types #13416

Inconsistencies in groupby aggregation with non-numeric types #13416

Comments

pijucha commented Jun 9, 2016 • edited by jreback

Settings:

Issues:

Some other issues:

What should be the expected output?

output of pd.show_versions()

Just a thought about a possible approach:

jreback commented Jun 10, 2016

jreback commented Jun 10, 2016

sinhrks commented Jun 13, 2016

pijucha commented Jun 13, 2016

saddy001 commented May 31, 2017 • edited

rhshadrach commented Jul 15, 2023

pijucha commented Jun 9, 2016 •

edited by jreback

output of `pd.show_versions()`

saddy001 commented May 31, 2017 •

edited