Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

numeric_only compatibility with pandas=1.5 #9471

Open
jrbourbeau opened this issue Sep 7, 2022 · 7 comments
Open

numeric_only compatibility with pandas=1.5 #9471

jrbourbeau opened this issue Sep 7, 2022 · 7 comments
Assignees

Comments

@jrbourbeau
Copy link
Member

jrbourbeau commented Sep 7, 2022

The upcoming panads=1.5 release contains several deprecations around pandas use of numeric_only (xref pandas-dev/pandas#46560). All of these changes are backwards compatible, so user code shouldn't break, but users will start getting lots of deprecation warnings from dasks internal use of pandas (which is scattered throughout dask.dataframe). Ideally, dask would emit the same deprecation pandas does and users wouldn't see any deprecations due to dasks internal use of pandas.

There have been a few attempts at adding numeric_only compatibility to dask with pandas=1.5 (#9269, #9271, #9241) but, as @rhshadrach highlights in pandas-dev/pandas#46560 (comment), exactly what the current default value is for numeric_only and when a deprecation warning isn't super straightforward.

Opening a dedicated issue so we don't loose track of this issue.

cc @rhshadrach as he appears to be leading the numeric_only changes upstream in pandas
cc @mroeschke @jorisvandenbossche in case you have thoughts

@rhshadrach
Copy link

Ideally, dask would emit the same deprecation pandas does and users wouldn't see any deprecations due to dasks internal use of pandas.

I'm guessing that it is not easy / feasible to separate off internal vs non-internal use of pandas, as otherwise you'd be able to wrap internal uses with warnings.catch_warnings. Is that right?

@jrbourbeau jrbourbeau self-assigned this Sep 12, 2022
@github-actions github-actions bot added the needs attention It's been a while since this was pushed on. Needs attention from the owner or a maintainer. label Oct 17, 2022
@rhshadrach
Copy link

@jrbourbeau - not sure if this is important, but all the deprecations involving numeric_only have been enforced on the main branch of pandas.

@jrbourbeau
Copy link
Member Author

Thanks for the heads up @rhshadrach! I'm seeing some test failures in our upstream build (xref #9736) that I think are related. I know you lead the numeric_only efforts on the pandas side -- do you happen to have interest / bandwidth to help make the corresponding changes on the dask side? (no obligation though)

@rhshadrach
Copy link

Happy to assist in summarizing/understanding the pandas changes, and fixing pandas if something is off, but I don't have the bandwidth to make changes to dask.

@jrbourbeau
Copy link
Member Author

No worries @rhshadrach -- totally understand

@charlesbluca is this something you have interest in / bandwidth to work on?

@j-bennet
Copy link
Contributor

I slightly modified the code from pandas-dev/pandas#46072, to compare pandas 2.0 vs 1.5 behavior with aggregations on DataFrameGroupBy object, here is my version:

Show the code
import pandas as pd
import warnings
import itertools as it


if __name__ == "__main__":
    numeric = [1, 1]
    nonnumeric_noagg = [object, object]
    nonnumeric_agg = ["2", "2"]
    for agg_func in ('sum', 'mean', 'std'):
        output = []
        print("=" * 60)
        print(f"Agg: {agg_func}")
        for has_numeric, has_nonnumeric_agg, has_nonnumeric_noagg in it.product(['x', ''], repeat=3):
            for numeric_only in [True, False, None]:
                row = [has_numeric, has_nonnumeric_agg, has_nonnumeric_noagg, numeric_only, "", "", ""]
                df = pd.DataFrame({"A": [1, 1]})
                if has_numeric:
                    df["B"] = numeric
                if has_nonnumeric_agg:
                    df["C"] = nonnumeric_agg
                if has_nonnumeric_noagg:
                    df["D"] = nonnumeric_noagg
                warning_msg = ""
                try:
                    with warnings.catch_warnings(record=True) as w:
                        result = getattr(df.groupby("A"), agg_func)(numeric_only=numeric_only)
                        if len(w) > 0:
                            warning_msg = {str(x.message)[:20] for x in w}
                except TypeError:
                    row[-3] = "x"
                else:
                    row[-2] = result.columns.tolist()
                    row[-1] = warning_msg
                output.append(row)
        out = (pd.DataFrame(
                output,
                columns=["has_num", "has_str", "has_obj", "numeric_only", "type_err", "cols", "warning"])
               .sort_values(by="numeric_only").reset_index(drop=True))
        print(out)

For example, the following output is with Pandas 1.5, for std agg:

Agg: std
   has_num has_str has_obj numeric_only type_err    cols                 warning
0        x       x       x        False           [B, C]  {Dropping invalid col}
1        x       x                False           [B, C]
2        x               x        False              [B]  {Dropping invalid col}
3        x                        False              [B]
4                x       x        False              [C]  {Dropping invalid col}
5                x                False              [C]
6                        x        False        x
7                                 False               []
8        x       x       x         True              [B]
9        x       x                 True              [B]
10       x               x         True              [B]
11       x                         True              [B]
12               x       x         True        x
13               x                 True        x
14                       x         True        x
15                                 True               []
16       x       x       x         None           [B, C]  {Dropping invalid col}
17       x       x                 None           [B, C]
18       x               x         None              [B]  {Dropping invalid col}
19       x                         None              [B]
20               x       x         None              [C]  {Dropping invalid col}
21               x                 None              [C]
22                       x         None        x
23                                 None               []

And this is std with Pandas 2.0:

Agg: std
   has_num has_str has_obj numeric_only type_err    cols warning
0        x       x       x        False        x
1        x       x                False           [B, C]
2        x               x        False        x
3        x                        False              [B]
4                x       x        False        x
5                x                False              [C]
6                        x        False        x
7                                 False               []
8        x       x       x         True              [B]
9        x       x                 True              [B]
10       x               x         True              [B]
11       x                         True              [B]
12               x       x         True        x
13               x                 True        x
14                       x         True        x
15                                 True               []
16       x       x       x         None        x
17       x       x                 None           [B, C]
18       x               x         None        x
19       x                         None              [B]
20               x       x         None        x
21               x                 None              [C]
22                       x         None        x
23                                 None               []

So std behavior in 1.5 was as follows:

  • TypeError happens when all columns would be dropped by numeric_only
  • if numeric_only is False/None, non-aggregatable columns are dropped with a warning

In 2.0:

  • TypeError happens when all columns would be dropped by numeric_only
  • TypeError also happens when a non-aggregatable column is present

Not all aggs behave the same - have to look at the rest of the groupby aggs.

@rhshadrach
Copy link

  • TypeError happens when all columns would be dropped by numeric_only

I think this is a bug. When as_index=True it should return an empty DataFrame with the index being the groups. When as_index=False the groups should be in the columns instead. I've opened pandas-dev/pandas#51080.

@github-actions github-actions bot removed the needs attention It's been a while since this was pushed on. Needs attention from the owner or a maintainer. label May 27, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants