Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Decimal fields dropped in group by with more than one column #22275

Closed
drudd opened this issue Aug 10, 2018 · 11 comments
Closed

Decimal fields dropped in group by with more than one column #22275

drudd opened this issue Aug 10, 2018 · 11 comments
Labels
Bug ExtensionArray Extending pandas with custom dtypes or arrays. Groupby

Comments

@drudd
Copy link
Contributor

drudd commented Aug 10, 2018

Code Sample, a copy-pastable example if possible

import pandas as pd
import decimal

df = pd.DataFrame({'a': [decimal.Decimal('4.56')]*6, 'b': range(3, 6)*2, 'c': range(6)})
print df.groupby('b')['a'].sum()
print df.groupby('b')['a', 'c'].sum()
print df.groupby('b').agg({'a': 'sum', 'c': 'sum'})

Output:

b
3    9.12
4    9.12
5    9.12
Name: a, dtype: object
   c
b
3  3
4  5
5  7
      a  c
b
3  9.12  3
4  9.12  5
5  9.12  7

Problem description

The aggregation over column a is dropped when another field is accessed from the groupby object, but works when requested through agg (I'm not sure if 'sum' is exactly equivalent to .sum() in the above).

Expected Output

b
3    9.12
4    9.12
5    9.12
Name: a, dtype: object
      a  c
b
3  9.12  3
4  9.12  5
5  9.12  7
      a  c
b
3  9.12  3
4  9.12  5
5  9.12  7

Output of pd.show_versions()

INSTALLED VERSIONS

commit: None
python: 2.7.15.final.0
python-bits: 64
OS: Darwin
OS-release: 17.7.0
machine: x86_64
processor: i386
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8
LOCALE: None.None

pandas: 0.23.4
pytest: None
pip: 18.0
setuptools: 39.1.0
Cython: None
numpy: 1.14.3
scipy: 1.1.0
pyarrow: 0.8.0
xarray: None
IPython: 5.7.0
sphinx: None
patsy: 0.5.0
dateutil: 2.7.2
pytz: 2018.4
blosc: None
bottleneck: None
tables: None
numexpr: None
feather: None
matplotlib: None
openpyxl: None
xlrd: 1.1.0
xlwt: None
xlsxwriter: 1.0.4
lxml: None
bs4: 4.6.0
html5lib: 1.0.1
sqlalchemy: 1.2.7
pymysql: None
psycopg2: 2.7.4 (dt dec pq3 ext lo64)
jinja2: 2.10
s3fs: None
fastparquet: None
pandas_gbq: None
pandas_datareader: None

@drudd
Copy link
Contributor Author

drudd commented Aug 10, 2018

Tracing through, the 'a' column appears to be dropped by this section of NDFrameGroupBy. _cython_agg_blocks:

        if numeric_only:
            data = data.get_numeric_data(copy=False)

Specifically this line:

data = data.get_numeric_data(copy=False)

This suggests to me that there's a mismatch with the logic that sets numeric_only (True), and .get_numeric_data which skips the Decimal field, encoded as objects.

@WillAyd
Copy link
Member

WillAyd commented Aug 10, 2018

There is no native support for Decimal types in NumPy and therefore it is also generally weakly supported in pandas. In theory a community-contributed Extension would help improve that.

In the meantime though I do agree that it's strange that there is a difference between the Series and DataFrame GroupBy mechanism's. Investigation and PRs to align that behavior are certainly welcome

@drudd
Copy link
Contributor Author

drudd commented Aug 10, 2018

My above comment appears to be correct, numeric_only = True is set for sum when that groupby function is created:

cls.sum = groupby_function('sum', 'add', np.sum, min_count=0)

Which can be seen by using a non-numeric_only group function such as min:

df.groupby('b')['a', 'c'].min()

produces

      a  c
b
3  4.56  0
4  4.56  1
5  4.56  2

Not knowing this api very closely I'm surprised that an exception isn't raised earlier in the process if one of the fields is non-numeric (and would be happy if it did so rather than silently dropping data).

@drudd
Copy link
Contributor Author

drudd commented Aug 10, 2018

Another tactic would be to remove the get_numeric_data call and rely on the try/except later in the loop which should get raised if there is no appropriate aggregation function.

I have tested this and it restores consistency between the two methods. Happy to PR if we can be confident this won't break other logic.

@WillAyd
Copy link
Member

WillAyd commented Aug 11, 2018

If you have something that works and doesn't break other tests locally feel free to submit as a PR

@WillAyd WillAyd added this to the Contributions Welcome milestone Aug 11, 2018
@drudd
Copy link
Contributor Author

drudd commented Aug 11, 2018

Unfortunately it causes a series of test failures, so the solution isn't quite so simple.

@drudd
Copy link
Contributor Author

drudd commented Aug 11, 2018

This bug is similar to #21664 and I think the failing tests may be expecting
http://pandas.pydata.org/pandas-docs/stable/groupby.html#automatic-exclusion-of-nuisance-columns

@jschendel
Copy link
Member

There is no native support for Decimal types in NumPy and therefore it is also generally weakly supported in pandas. In theory a community-contributed Extension would help improve that.

As a side note, we actually have this within our test suite, albeit in a fledgling state:

class DecimalArray(ExtensionArray, ExtensionScalarOpsMixin):
dtype = DecimalDtype()
def __init__(self, values, dtype=None, copy=False):
for val in values:
if not isinstance(val, self.dtype.type):
raise TypeError("All values must be of type " +
str(self.dtype.type))
values = np.asarray(values, dtype=object)
self._data = values
# Some aliases for common attribute names to ensure pandas supports
# these
self._items = self.data = self._data
# those aliases are currently not working due to assumptions
# in internal code (GH-20735)
# self._values = self.values = self.data
@classmethod
def _from_sequence(cls, scalars, dtype=None, copy=False):
return cls(scalars)
@classmethod
def _from_factorized(cls, values, original):
return cls(values)
def __getitem__(self, item):
if isinstance(item, numbers.Integral):
return self._data[item]
else:
return type(self)(self._data[item])
def take(self, indexer, allow_fill=False, fill_value=None):
from pandas.api.extensions import take
data = self._data
if allow_fill and fill_value is None:
fill_value = self.dtype.na_value
result = take(data, indexer, fill_value=fill_value,
allow_fill=allow_fill)
return self._from_sequence(result)
def copy(self, deep=False):
if deep:
return type(self)(self._data.copy())
return type(self)(self)
def __setitem__(self, key, value):
if pd.api.types.is_list_like(value):
value = [decimal.Decimal(v) for v in value]
else:
value = decimal.Decimal(value)
self._data[key] = value
def __len__(self):
return len(self._data)
def __repr__(self):
return 'DecimalArray({!r})'.format(self._data)
@property
def nbytes(self):
n = len(self)
if n:
return n * sys.getsizeof(self[0])
return 0
def isna(self):
return np.array([x.is_nan() for x in self._data], dtype=bool)
@property
def _na_value(self):
return decimal.Decimal('NaN')
@classmethod
def _concat_same_type(cls, to_concat):
return cls(np.concatenate([x._data for x in to_concat]))
DecimalArray._add_arithmetic_ops()
DecimalArray._add_comparison_ops()

This doesn't quite get the job done though, as there's currently no way to dynamically alter ExtensionBlock.is_numeric, which currently is always False, and is what get_numeric_data is ultimately looking at for DecimalArray.

Will open a separate issue for the above though, as it's only tangentially related to this issue, and using DecimalArray would really only resolve this issue for the Decimal case (i.e. a more generic solution for object dtype might be nice?).

@jreback
Copy link
Contributor

jreback commented Sep 19, 2018

I think this will just work after #22762 (if constructed as a DecimalArray)

@jreback jreback added the ExtensionArray Extending pandas with custom dtypes or arrays. label Sep 19, 2018
@adisunw
Copy link

adisunw commented Sep 3, 2019

Any fix for this? Just ran into this same problem.
Edit: Constructing series as a DecimalArray still does not solve.

INSTALLED VERSIONS

commit : None
python : 3.7.3.final.0
python-bits : 64
OS : Darwin
OS-release : 18.6.0
machine : x86_64
processor : i386
byteorder : little
LC_ALL : None
LANG : en_US.UTF-8
LOCALE : en_US.UTF-8

pandas : 0.25.1
numpy : 1.16.4
pytz : 2019.1
dateutil : 2.8.0
pip : 19.0.3
setuptools : 40.8.0
Cython : None
pytest : 5.0.1
hypothesis : None
sphinx : None
blosc : None
feather : None
xlsxwriter : None
lxml.etree : 4.3.3
html5lib : None
pymysql : None
psycopg2 : None
jinja2 : 2.10.1
IPython : 7.5.0
pandas_datareader: None
bs4 : 4.8.0
bottleneck : None
fastparquet : None
gcsfs : None
lxml.etree : 4.3.3
matplotlib : 3.1.1
numexpr : None
odfpy : None
openpyxl : None
pandas_gbq : None
pyarrow : None
pytables : None
s3fs : None
scipy : None
sqlalchemy : 1.3.4
tables : None
xarray : None
xlrd : None
xlwt : None
xlsxwriter : None

@mroeschke mroeschke added the Bug label Jun 21, 2021
@mroeschke mroeschke removed this from the Contributions Welcome milestone Oct 13, 2022
@rhshadrach
Copy link
Member

This was resolved by #46560

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug ExtensionArray Extending pandas with custom dtypes or arrays. Groupby
Projects
None yet
Development

Successfully merging a pull request may close this issue.

7 participants