Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Inconsistencies in groupby aggregation with non-numeric types #13416

Closed
pijucha opened this issue Jun 9, 2016 · 6 comments
Closed

Inconsistencies in groupby aggregation with non-numeric types #13416

pijucha opened this issue Jun 9, 2016 · 6 comments
Labels
Bug Dtype Conversions Unexpected or buggy dtype conversions Groupby

Comments

@pijucha
Copy link
Contributor

pijucha commented Jun 9, 2016

xref #13992

Some of the issues and inconsistencies I noticed.
(Apologies for a bit lengthy input.)

Settings:
df = pd.DataFrame({"A": [2, 1, 2, 2],
                   "B": [3, 3, 4, 4],
                   "C1": pd.Categorical([5, 6, 6, 6], [5, 6, 7]),
                   "C2": pd.Categorical(list("aaab"), list("abc")),
                   "D": pd.date_range('2011-11-11', periods=4),
                   "E": [10, 20, 30, 40]})
df
Out[7]: 
   A  B C1 C2          D   E
0  2  3  5  a 2011-11-11  10
1  1  3  6  a 2011-11-12  20
2  2  4  6  a 2011-11-13  30
3  2  4  6  b 2011-11-14  40

df.dtypes
Out[8]: 
A              int64
B              int64
C1          category
C2          category
D     datetime64[ns]
E              int64
dtype: object

df_ac = df[['A', 'C1', 'C2']]
df_ad = df[['A', 'D']]
df_abc = df[['A', 'B', 'C1', 'C2']]
df_abd = df[['A', 'B', 'D']]
df_abe = df[['A', 'B', 'E']]
df_acd = df[['A', 'C1', 'C2', 'D']]
df_abcd = df[['A', 'B', 'C1', 'C2', 'D']]

Usually, non-numeric types are skipped in aggregation functions:

df.groupby('A').mean()
Out[16]: 
          B          E
A                     
1  3.000000  20.000000
2  3.666667  26.666667

df.groupby('A').sum()
Out[17]: 
    B   E
A        
1   3  20
2  11  80

df.groupby(['A', 'B']).mean()
Out[18]: 
      E
A B    
1 3  20
2 3  10
  4  35

df.groupby(['A', 'B'], as_index=False).sum()
Out[19]: 
   A  B   E
0  1  3  20
1  2  3  10
2  2  4  70

Issues:

But if there are no numeric types, an output varies. (I use here subframes df_xxx of the original data frame.)

# .mean() always raises.
df_ac.groupby('A').mean()                               # (1)
pandas.core.base.DataError: No numeric types to aggregate

# .sum() adds categoricals
df_ac.groupby('A').sum()                                # (2)
Out[21]: 
   C1   C2
A         
1   6    a
2  17  aab

# and tries to do something with datetimes.
df_ad.groupby('A').sum()                                # (3)
Out[22]: 
           D
A           
1 2011-11-12
2        NaT

df_acd.groupby('A').sum()                               # (4)
Out[23]: 
    C1   C2                    D
A                               
1  6.0    a  2011-11-12 00:00:00
2  NaN  NaN                  NaN

# It's even worse for multiple groupers.
df_abcd.groupby(['A', 'B']).sum()                       # (5)
Out[24]: 
A  B    
1  3  C1                      6
      C2                      a
      D     2011-11-12 00:00:00
2  3  C1                      5
      C2                      a
      D     2011-11-11 00:00:00
dtype: object

df_abcd.groupby(['A', 'B'], as_index=False).sum()       # (6)
Out[25]: 
0  A                       1
   B                       3
   C1                      6
   C2                      a
   D     2011-11-12 00:00:00
1  A                       2
   B                       3
   C1                      5
   C2                      a
   D     2011-11-11 00:00:00
2  A                       4
   B                       8
dtype: object

# Additionally, the index is not reset here and the grouper columns are transformed.
df_abc.groupby(['A', 'B'], as_index=False).sum()        # (7)
Out[26]: 
     A  B  C1  C2
A B              
1 3  1  3   6   a
2 3  2  3   5   a
  4  4  8  12  ab

# Sometimes an empty data frame is returned  (a reasonable output):
df_abd.groupby(['A', 'B']).sum()                        # (8)
Out[27]: 
Empty DataFrame
Columns: []
Index: []

# but not always:
df_abd.groupby(['A', 'B'], as_index=False).sum()        # (9)
Out[28]: 
0  A                      1
   B                      3
   D    2011-11-12 00:00:00
1  A                      2
   B                      3
   D    2011-11-11 00:00:00
2  A                      4
   B                      8
dtype: object

df_abcd.groupby(['A', 'B']).sum()                       # (10)
Out[29]: 
A  B    
1  3  C1                      6
      C2                      a
      D     2011-11-12 00:00:00
2  3  C1                      5
      C2                      a
      D     2011-11-11 00:00:00
dtype: object
Some other issues:

Multiple groupers with a categrical one (it's already addressed in #13204).

df.groupby(['A', 'C1'], as_index=False).sum()           # (11)
Out[36]: 
       A  C1   B   E
A C1                
1 5  NaN NaN NaN NaN
  6  NaN NaN NaN NaN
  7  NaN NaN NaN NaN
2 5  NaN NaN NaN NaN
  6  NaN NaN NaN NaN
  7  NaN NaN NaN NaN

apply

# mean() as expected:
df.groupby(['A', 'B']).apply(lambda x: np.mean(x))
Out[30]: 
       A    B     E
A B                
1 3  1.0  3.0  20.0
2 3  2.0  3.0  10.0
  4  2.0  4.0  35.0

# sum() not:
df.groupby(['A', 'B']).apply(lambda x: np.sum(x))       # (12)
Out[31]: 
A  B    
1  3  A                       1
      B                       3
      C1                      6
      C2                      a
      D     2011-11-12 00:00:00
      E                      20
2  3  A                       2
      B                       3
      C1                      5
      C2                      a
      D     2011-11-11 00:00:00
      E                      10
   4  A                       4
      B                       8
      E                      70
dtype: object

transform

# it's ok with numeric types only
df_abe.groupby(['A', 'B']).transform('sum')
Out[32]: 
    E
0  10
1  20
2  70
3  70

# Doesn't transform anything with mixed types:
df.groupby(['A', 'B']).transform('sum')                 # (13)
Out[33]: 
  C1 C2          D   E
0  5  a 2011-11-11  10
1  6  a 2011-11-12  20
2  6  a 2011-11-13  30
3  6  b 2011-11-14  40

# but someimtes transforms categoricals
df_abc.groupby(['A', 'B']).transform('sum')             # (14)
Out[34]: 
   C1  C2
0   5   a
1   6   a
2  12  ab
3  12  ab

# and sometimes not:
df_abc.groupby(['A', 'B'], as_index=False).transform('sum')     # (15)
Out[35]: 
  C1 C2
0  5  a
1  6  a
2  6  a
3  6  b

What should be the expected output?

Some ideas for aggregation (with sum()) when there's no numeric types (1)-(10):

  1. always raise (as .mean() does)
  2. return an empty DataFrame as in (8)
    (but (a) should .mean() do the same? (b) should groupers be returned when as_index=False?)
  3. ???

output of pd.show_versions()

INSTALLED VERSIONS
------------------
commit: None
python: 3.5.1.final.0
python-bits: 64
OS: Linux
machine: x86_64

pandas: 0.18.1
Cython: 0.24
numpy: 1.11.0
IPython: 4.0.1

Exactly the same output with:

pandas: 0.18.1+119.gd405bf2
Just a thought about a possible approach:

A call to _python_apply_general inside _python_agg_general
https://github.com/pydata/pandas/blob/master/pandas/core/groupby.py#L795
seems to trigger some of the issues. (It doesn't exclude grouper columns when as_index=False, treats categoricals as their underlying types, and possibly just replicates some actions from earlier part of _python_agg_general.)

The following change affects (and possibly solves, at least partially)
issues (5), (6), (7), (9), (10), (14):

diff --git a/pandas/core/groupby.py b/pandas/core/groupby.py
index bea62e9..0d401e2 100644
--- a/pandas/core/groupby.py
+++ b/pandas/core/groupby.py
@@ -791,10 +791,7 @@ class _GroupBy(PandasObject, SelectionMixin):
             except TypeError:
                 continue

-        if len(output) == 0:
-            return self._python_apply_general(f)
-
-        if self.grouper._filter_empty_groups:
+        if len(output) > 0 and self.grouper._filter_empty_groups:

             mask = counts.ravel() > 0
             for name, result in compat.iteritems(output):
@jreback
Copy link
Contributor

jreback commented Jun 10, 2016

hmm, yeah this look suspect. although .sum() can be an aggregator (for string columns), and so maybe we should leave it, this should be predictable, so I would make all of these be the same (e.g. raise).

Note I am only referring to the inbuilt methods .sum()/.mean() directly on a grouper, you can always do .apply(lambda x: x.sum()) if you want to operate on a string column.

@jreback
Copy link
Contributor

jreback commented Jun 10, 2016

@jreback jreback added this to the Next Major Release milestone Jun 10, 2016
@sinhrks
Copy link
Member

sinhrks commented Jun 13, 2016

Normal aggregation also has inconsistencies. It excludes categorical if it has timestamp column, includes otherwise.

df.sum()
# A      7
# B     14
# E    100
# dtype: int64

# exclude timestamp col
df[['A', 'B', 'C1', 'C2', 'E']].sum()
# A        7
# B       14
# C1      23
# C2    aaab
# E      100
# dtype: object

Maybe it should be excluded by default even if categorical internal is numeric, and included if numeric_only=True? Thus, groupby agg also should have numeric_only kwd.

@pijucha
Copy link
Contributor Author

pijucha commented Jun 13, 2016

As for normal aggregation, there's also this:

df[['C1']].sum()
Out[10]: 
C1    23
dtype: int64

df['C1'].sum()
...
TypeError: Categorical cannot perform the operation sum

@saddy001
Copy link

saddy001 commented May 31, 2017

Please also consider the following case:
I have a DataFrame of class "A" instances (object type). Instances of A behave much like floats (addition, division, etc. possible), but they don't implement __float__, because I want to evaluate them lazily (very expensive calculation). For this, I call A().GetValue() which evaluates the calculation function.
So I rely on pandas to not enforce converting the A()s to floats. This is working fine for DataFrame.sum(), min() and max(), but not for mean():

If I call DataFrame.mean(axis=1), all values are NaN and the dtype becomes float64, because _ensure_numeric is called that raises "TypeError: Could not convert A(..) to numeric".

@rhshadrach
Copy link
Member

Every op in OP with non-numeric data now raises due to #46560

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Dtype Conversions Unexpected or buggy dtype conversions Groupby
Projects
None yet
Development

No branches or pull requests

7 participants