Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: resample ops includes sampled column #47079

Closed
rhshadrach opened this issue May 21, 2022 · 2 comments · Fixed by #47107
Closed

BUG: resample ops includes sampled column #47079

rhshadrach opened this issue May 21, 2022 · 2 comments · Fixed by #47107
Labels
Bug Resample resample method
Milestone

Comments

@rhshadrach
Copy link
Member

This is similar to an example in the DataFrame.resample docstring

d = {
    'price': [10, 11, 9, 13, 14, 18, 17, 19],
    'volume': [50, 60, 40, 100, 50, 100, 40, 50]
}
df = pd.DataFrame(d)
df['week_starting'] = pd.date_range('01/01/2018', periods=8, freq='W')
print(df.resample('M', on='week_starting').first())

# Gives
               price  volume week_starting
week_starting                             
2018-01-31        10      50    2018-01-07
2018-02-28        14      50    2018-02-04

I am not sure, but it seems to me it's not intended to use the on column in the aggregation. Because of #46560, this is current throwing warnings when .first is replaced by e.g. .sum or .mean and there is no way to safely avoid them.

Marking this for 1.5 because of the introduced warnings.

@rhshadrach rhshadrach added Bug Resample resample method labels May 21, 2022
@rhshadrach rhshadrach added this to the 1.5 milestone May 21, 2022
@rhshadrach rhshadrach changed the title BUG: resample ops include sampled column BUG: resample ops includes sampled column May 21, 2022
@selasley
Copy link
Contributor

This change seems to cause a KeyError if the on column is used with agg

d = {
    'price': [10, 11, 9, 13, 14, 18, 17, 19],
    'volume': [50, 60, 40, 100, 50, 100, 40, 50]
}
df = pd.DataFrame(d)
df['week_starting'] = pd.date_range('01/01/2018', periods=8, freq='W')
print(df.resample('M', on='week_starting').agg(ws=('week_starting', 'size')))

pandas 1.4.4 outputs

                   ws
    week_starting    
    2018-01-31      4
    2018-02-28      4

pandas 1.5.0 outputs
---------------------------------------------------------------------------
KeyError Traceback (most recent call last)
File ~/venvs/py310/lib/python3.10/site-packages/pandas/core/resample.py:449, in Resampler._groupby_and_aggregate(self, how, *args, **kwargs)
448 else:
--> 449 result = grouped.aggregate(how, *args, **kwargs)
450 except DataError:
451 # got TypeErrors on aggregation

File ~/venvs/py310/lib/python3.10/site-packages/pandas/core/groupby/generic.py:890, in DataFrameGroupBy.aggregate(self, func, engine, engine_kwargs, *args, **kwargs)
    889 op = GroupByApply(self, func, args, kwargs)
--> 890 result = op.agg()
    891 if not is_dict_like(func) and result is not None:

File ~/venvs/py310/lib/python3.10/site-packages/pandas/core/apply.py:169, in Apply.agg(self)
    168 if is_dict_like(arg):
--> 169     return self.agg_dict_like()
    170 elif is_list_like(arg):
    171     # we require a list, but not a 'str'

File ~/venvs/py310/lib/python3.10/site-packages/pandas/core/apply.py:478, in Apply.agg_dict_like(self)
    476     selection = obj._selection
--> 478 arg = self.normalize_dictlike_arg("agg", selected_obj, arg)
    480 if selected_obj.ndim == 1:
    481     # key only used for output

File ~/venvs/py310/lib/python3.10/site-packages/pandas/core/apply.py:596, in Apply.normalize_dictlike_arg(self, how, obj, func)
    595         cols_sorted = list(safe_sort(list(cols)))
--> 596         raise KeyError(f"Column(s) {cols_sorted} do not exist")
    598 aggregator_types = (list, tuple, dict)

KeyError: "Column(s) ['week_starting'] do not exist"

During handling of the above exception, another exception occurred:

TypeError                                 Traceback (most recent call last)
Cell In [2], line 7
      5 df = pd.DataFrame(d)
      6 df['week_starting'] = pd.date_range('01/01/2018', periods=8, freq='W')
----> 7 print(df.resample('M', on='week_starting').agg(ws=('week_starting', 'size')))

File ~/venvs/py310/lib/python3.10/site-packages/pandas/core/resample.py:356, in Resampler.aggregate(self, func, *args, **kwargs)
    354 if result is None:
    355     how = func
--> 356     result = self._groupby_and_aggregate(how, *args, **kwargs)
    358 result = self._apply_loffset(result)
    359 return result

File ~/venvs/py310/lib/python3.10/site-packages/pandas/core/resample.py:461, in Resampler._groupby_and_aggregate(self, how, *args, **kwargs)
    452     result = grouped.apply(how, *args, **kwargs)
    453 except (AttributeError, KeyError):
    454     # we have a non-reducing function; try to evaluate
    455     # alternatively we want to evaluate only a column of the input
   (...)
    459     #  on Series, raising AttributeError or KeyError
    460     #  (depending on whether the column lookup uses getattr/__getitem__)
--> 461     result = grouped.apply(how, *args, **kwargs)
    463 except ValueError as err:
    464     if "Must produce aggregated value" in str(err):
    465         # raised in _aggregate_named
    466         # see test_apply_without_aggregation, test_apply_with_mutated_index

File ~/venvs/py310/lib/python3.10/site-packages/pandas/core/groupby/groupby.py:1543, in GroupBy.apply(self, func, *args, **kwargs)
   1540         with np.errstate(all="ignore"):
   1541             return func(g, *args, **kwargs)
-> 1543 elif hasattr(nanops, "nan" + func):
   1544     # TODO: should we wrap this in to e.g. _is_builtin_func?
   1545     f = getattr(nanops, "nan" + func)
   1547 else:

TypeError: can only concatenate str (not "NoneType") to str

@rhshadrach
Copy link
Member Author

Thanks for reporting this @selasley. For most ops, I don't think having the field used in on is desired. Indeed, due to the upcoming change in numeric_only in #46560, many ops would now raise by default. One can work around this by adding a second field, e.g.

d = {
    'price': [10, 11, 9, 13, 14, 18, 17, 19],
    'volume': [50, 60, 40, 100, 50, 100, 40, 50]
}
df = pd.DataFrame(d)
df['week_starting'] = pd.date_range('01/01/2018', periods=8, freq='W')
df['key'] = df['week_starting']
print(df.resample('M', on='key').agg(ws=('week_starting', 'size')))

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Resample resample method
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants