-
-
Notifications
You must be signed in to change notification settings - Fork 1.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
numeric_only
compatibility with pandas=1.5
#9471
Comments
I'm guessing that it is not easy / feasible to separate off internal vs non-internal use of pandas, as otherwise you'd be able to wrap internal uses with |
@jrbourbeau - not sure if this is important, but all the deprecations involving |
Thanks for the heads up @rhshadrach! I'm seeing some test failures in our upstream build (xref #9736) that I think are related. I know you lead the |
Happy to assist in summarizing/understanding the pandas changes, and fixing pandas if something is off, but I don't have the bandwidth to make changes to dask. |
No worries @rhshadrach -- totally understand @charlesbluca is this something you have interest in / bandwidth to work on? |
I slightly modified the code from pandas-dev/pandas#46072, to compare pandas 2.0 vs 1.5 behavior with aggregations on Show the codeimport pandas as pd
import warnings
import itertools as it
if __name__ == "__main__":
numeric = [1, 1]
nonnumeric_noagg = [object, object]
nonnumeric_agg = ["2", "2"]
for agg_func in ('sum', 'mean', 'std'):
output = []
print("=" * 60)
print(f"Agg: {agg_func}")
for has_numeric, has_nonnumeric_agg, has_nonnumeric_noagg in it.product(['x', ''], repeat=3):
for numeric_only in [True, False, None]:
row = [has_numeric, has_nonnumeric_agg, has_nonnumeric_noagg, numeric_only, "", "", ""]
df = pd.DataFrame({"A": [1, 1]})
if has_numeric:
df["B"] = numeric
if has_nonnumeric_agg:
df["C"] = nonnumeric_agg
if has_nonnumeric_noagg:
df["D"] = nonnumeric_noagg
warning_msg = ""
try:
with warnings.catch_warnings(record=True) as w:
result = getattr(df.groupby("A"), agg_func)(numeric_only=numeric_only)
if len(w) > 0:
warning_msg = {str(x.message)[:20] for x in w}
except TypeError:
row[-3] = "x"
else:
row[-2] = result.columns.tolist()
row[-1] = warning_msg
output.append(row)
out = (pd.DataFrame(
output,
columns=["has_num", "has_str", "has_obj", "numeric_only", "type_err", "cols", "warning"])
.sort_values(by="numeric_only").reset_index(drop=True))
print(out) For example, the following output is with Pandas 1.5, for
And this is
So
In 2.0:
Not all aggs behave the same - have to look at the rest of the groupby aggs. |
I think this is a bug. When |
The upcoming
panads=1.5
release contains several deprecations aroundpandas
use ofnumeric_only
(xref pandas-dev/pandas#46560). All of these changes are backwards compatible, so user code shouldn't break, but users will start getting lots of deprecation warnings fromdask
s internal use ofpandas
(which is scattered throughoutdask.dataframe
). Ideally,dask
would emit the same deprecationpandas
does and users wouldn't see any deprecations due todask
s internal use ofpandas
.There have been a few attempts at adding
numeric_only
compatibility todask
withpandas=1.5
(#9269, #9271, #9241) but, as @rhshadrach highlights in pandas-dev/pandas#46560 (comment), exactly what the current default value is fornumeric_only
and when a deprecation warning isn't super straightforward.Opening a dedicated issue so we don't loose track of this issue.
cc @rhshadrach as he appears to be leading the
numeric_only
changes upstream inpandas
cc @mroeschke @jorisvandenbossche in case you have thoughts
The text was updated successfully, but these errors were encountered: