BUG: resample ops includes sampled column #47079

rhshadrach · 2022-05-21T03:22:37Z

This is similar to an example in the DataFrame.resample docstring

d = {
    'price': [10, 11, 9, 13, 14, 18, 17, 19],
    'volume': [50, 60, 40, 100, 50, 100, 40, 50]
}
df = pd.DataFrame(d)
df['week_starting'] = pd.date_range('01/01/2018', periods=8, freq='W')
print(df.resample('M', on='week_starting').first())

# Gives
               price  volume week_starting
week_starting                             
2018-01-31        10      50    2018-01-07
2018-02-28        14      50    2018-02-04

I am not sure, but it seems to me it's not intended to use the on column in the aggregation. Because of #46560, this is current throwing warnings when .first is replaced by e.g. .sum or .mean and there is no way to safely avoid them.

Marking this for 1.5 because of the introduced warnings.

The text was updated successfully, but these errors were encountered:

selasley · 2022-10-11T21:03:28Z

This change seems to cause a KeyError if the on column is used with agg

d = {
    'price': [10, 11, 9, 13, 14, 18, 17, 19],
    'volume': [50, 60, 40, 100, 50, 100, 40, 50]
}
df = pd.DataFrame(d)
df['week_starting'] = pd.date_range('01/01/2018', periods=8, freq='W')
print(df.resample('M', on='week_starting').agg(ws=('week_starting', 'size')))

pandas 1.4.4 outputs

                   ws
    week_starting    
    2018-01-31      4
    2018-02-28      4

pandas 1.5.0 outputs
---------------------------------------------------------------------------
KeyError Traceback (most recent call last)
File ~/venvs/py310/lib/python3.10/site-packages/pandas/core/resample.py:449, in Resampler._groupby_and_aggregate(self, how, *args, **kwargs)
448 else:
--> 449 result = grouped.aggregate(how, *args, **kwargs)
450 except DataError:
451 # got TypeErrors on aggregation

File ~/venvs/py310/lib/python3.10/site-packages/pandas/core/groupby/generic.py:890, in DataFrameGroupBy.aggregate(self, func, engine, engine_kwargs, *args, **kwargs)
    889 op = GroupByApply(self, func, args, kwargs)
--> 890 result = op.agg()
    891 if not is_dict_like(func) and result is not None:

File ~/venvs/py310/lib/python3.10/site-packages/pandas/core/apply.py:169, in Apply.agg(self)
    168 if is_dict_like(arg):
--> 169     return self.agg_dict_like()
    170 elif is_list_like(arg):
    171     # we require a list, but not a 'str'

File ~/venvs/py310/lib/python3.10/site-packages/pandas/core/apply.py:478, in Apply.agg_dict_like(self)
    476     selection = obj._selection
--> 478 arg = self.normalize_dictlike_arg("agg", selected_obj, arg)
    480 if selected_obj.ndim == 1:
    481     # key only used for output

File ~/venvs/py310/lib/python3.10/site-packages/pandas/core/apply.py:596, in Apply.normalize_dictlike_arg(self, how, obj, func)
    595         cols_sorted = list(safe_sort(list(cols)))
--> 596         raise KeyError(f"Column(s) {cols_sorted} do not exist")
    598 aggregator_types = (list, tuple, dict)

KeyError: "Column(s) ['week_starting'] do not exist"

During handling of the above exception, another exception occurred:

TypeError                                 Traceback (most recent call last)
Cell In [2], line 7
      5 df = pd.DataFrame(d)
      6 df['week_starting'] = pd.date_range('01/01/2018', periods=8, freq='W')
----> 7 print(df.resample('M', on='week_starting').agg(ws=('week_starting', 'size')))

File ~/venvs/py310/lib/python3.10/site-packages/pandas/core/resample.py:356, in Resampler.aggregate(self, func, *args, **kwargs)
    354 if result is None:
    355     how = func
--> 356     result = self._groupby_and_aggregate(how, *args, **kwargs)
    358 result = self._apply_loffset(result)
    359 return result

File ~/venvs/py310/lib/python3.10/site-packages/pandas/core/resample.py:461, in Resampler._groupby_and_aggregate(self, how, *args, **kwargs)
    452     result = grouped.apply(how, *args, **kwargs)
    453 except (AttributeError, KeyError):
    454     # we have a non-reducing function; try to evaluate
    455     # alternatively we want to evaluate only a column of the input
   (...)
    459     #  on Series, raising AttributeError or KeyError
    460     #  (depending on whether the column lookup uses getattr/__getitem__)
--> 461     result = grouped.apply(how, *args, **kwargs)
    463 except ValueError as err:
    464     if "Must produce aggregated value" in str(err):
    465         # raised in _aggregate_named
    466         # see test_apply_without_aggregation, test_apply_with_mutated_index

File ~/venvs/py310/lib/python3.10/site-packages/pandas/core/groupby/groupby.py:1543, in GroupBy.apply(self, func, *args, **kwargs)
   1540         with np.errstate(all="ignore"):
   1541             return func(g, *args, **kwargs)
-> 1543 elif hasattr(nanops, "nan" + func):
   1544     # TODO: should we wrap this in to e.g. _is_builtin_func?
   1545     f = getattr(nanops, "nan" + func)
   1547 else:

TypeError: can only concatenate str (not "NoneType") to str

rhshadrach · 2022-10-12T00:06:58Z

Thanks for reporting this @selasley. For most ops, I don't think having the field used in on is desired. Indeed, due to the upcoming change in numeric_only in #46560, many ops would now raise by default. One can work around this by adding a second field, e.g.

d = {
    'price': [10, 11, 9, 13, 14, 18, 17, 19],
    'volume': [50, 60, 40, 100, 50, 100, 40, 50]
}
df = pd.DataFrame(d)
df['week_starting'] = pd.date_range('01/01/2018', periods=8, freq='W')
df['key'] = df['week_starting']
print(df.resample('M', on='key').agg(ws=('week_starting', 'size')))

rhshadrach added Bug Resample resample method labels May 21, 2022

rhshadrach added this to the 1.5 milestone May 21, 2022

rhshadrach changed the title ~~BUG: resample ops include sampled column~~ BUG: resample ops includes sampled column May 21, 2022

This was referenced May 21, 2022

BUG: validate_docstrings has many warnings #44642

Closed

BUG: Resampler attempts to aggregate the on column #47107

Merged

jreback closed this as completed in #47107 May 27, 2022

zmoon mentioned this issue Dec 31, 2022

Regulatory calculations do not work with latest pandas version NOAA-CSL/MELODIES-MONET#153

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BUG: resample ops includes sampled column #47079

BUG: resample ops includes sampled column #47079

rhshadrach commented May 21, 2022

selasley commented Oct 11, 2022

rhshadrach commented Oct 12, 2022

BUG: resample ops includes sampled column #47079

BUG: resample ops includes sampled column #47079

Comments

rhshadrach commented May 21, 2022

selasley commented Oct 11, 2022

rhshadrach commented Oct 12, 2022