Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Proper support of nullable dtypes as the Categorical dtype #50711

Open
Dr-Irv opened this issue Jan 12, 2023 · 8 comments
Open

Proper support of nullable dtypes as the Categorical dtype #50711

Dr-Irv opened this issue Jan 12, 2023 · 8 comments
Labels
Categorical Categorical Data Type Ice Cream Agreement Issues that would be addressed by the Ice Cream Agreement from the Aug 2023 sprint Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate

Comments

@Dr-Irv
Copy link
Contributor

Dr-Irv commented Jan 12, 2023

Now that Categorical depends on ExtensionArray, it makes more sense to return and output pd.NA as a missing value instead of np.nan.

Propose that we announce in 2.0 release that this will change in a future release. Not clear if/how we create a deprecation message here.

Current behavior:

>>> c = pd.Categorical( ["a", "a", "b", "c", "c"], ["a", "b", "c"])
>>> c
['a', 'a', 'b', 'c', 'c']
Categories (3, object): ['a', 'b', 'c']
>>> s = pd.Series(c)
>>> s
0    a
1    a
2    b
3    c
4    c
dtype: category
Categories (3, object): ['a', 'b', 'c']
>>> s.iloc[2] = pd.NA
>>> s.iloc[2]
nan
@Dr-Irv Dr-Irv changed the title DEP: For v2.0, Deprecate using np.nan as the missing value type in pd.Categorical and use pd.NA instead DEPR: For v2.0, Deprecate using np.nan as the missing value type in pd.Categorical and use pd.NA instead Jan 12, 2023
@Dr-Irv
Copy link
Contributor Author

Dr-Irv commented Jan 12, 2023

Not clear how to track this in #50578 since we should decide that we want to do this.

@jbrockmendel
Copy link
Member

id be more inclined to #29962, which would mean getting pd.NA in a targeted subset of cases.

@Dr-Irv
Copy link
Contributor Author

Dr-Irv commented Jan 12, 2023

id be more inclined to #29962, which would mean getting pd.NA in a targeted subset of cases.

The idea in #29962 is to make the NA value dependent on the underlying dtype of the categorical. But there is also a point made in a comment there (#29962 (comment)) that we should only use pd.NA independent of the underlying dtype. (I agree with this)

In either case, I think we need to make a decision and figure out how to do a deprecation notice. Or as you (@jbrockmendel ) suggested in another comment in that issue (#29962 (comment)), just bite the bullet and make the change now for 2.0.

@jbrockmendel
Copy link
Member

that we should only use pd.NA independent of the underlying dtype

-1. As long as we distinguish between pd.NA and nan etc (xref #32265), this is a semantic change. Besides which getting pd.NA is a PITA.

Another alternative would be #37930 which would let users specify. That would likely be the biggest breaking change.

@Dr-Irv
Copy link
Contributor Author

Dr-Irv commented Jan 12, 2023

Besides which getting pd.NA is a PITA.

Can you explain why getting pd.NA is a PITA in this context?

And why would we need to distinguish between np.nan and pd.NA for a categorical?

@jorisvandenbossche
Copy link
Member

The idea in #29962 is to make the NA value dependent on the underlying dtype of the categorical. But there is also a point made in a comment there (#29962 (comment)) that we should only use pd.NA independent of the underlying dtype. (I agree with this)

I also agree with that (that we should move to only use pd.NA for all dtypes), but as long as we still have dtypes for now that don't use pd.NA (which are actually the default), I think the most logical thing to do is to let Categorical follow the dtype of its categories. If it represents numpy-dtype based categories, use np.nan (and maybe np.NaT for datetime), and if it represents nullable data (a "nullable categorical dtype"), then we should use pd.NA as missing value scalar.

That makes that people who start using the nullable dtypes are ensured they keep nullable dtypes (and matching missing value scalars), while people who didn't opt in to nullable dtypes just keep the current behaviour.
(in theory, a user that just uses the defaults should currently never see pd.NA, as that is still fully opt-in)

(for a long time, a first step was actually support having nullable categories, but I suppose that was "fixed" now we can support EAs in the Index?)

@lithomas1 lithomas1 added Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate Categorical Categorical Data Type Deprecate Functionality to remove in pandas Needs Discussion Requires discussion from core team before further action labels Feb 8, 2023
@Dr-Irv Dr-Irv changed the title DEPR: For v2.0, Deprecate using np.nan as the missing value type in pd.Categorical and use pd.NA instead Proper support of nullable dtypes as the Categorical dtype Feb 8, 2023
@Dr-Irv
Copy link
Contributor Author

Dr-Irv commented Feb 8, 2023

I changed the title of the issue, and summarize here the discussion on 2/8/2023:
Joris: The way forward is to first properly support “nullable categorical dtypes” (categorical dtype with nullable categories): comparison operations, missing value scalar, convert_dtypes (and everywhere use_nullable_dtypes param exists)

@Dr-Irv Dr-Irv removed Deprecate Functionality to remove in pandas Needs Discussion Requires discussion from core team before further action labels Feb 8, 2023
@jorisvandenbossche
Copy link
Member

So while a Categorical can now store categories using a nullable dtype, there are still a variety of aspects that don't follow the expected behaviour for "nullable dtypes" (see comment above). Just as a quick illustration of the comparison case:

In [25]: s = pd.Series([1, 2, pd.NA], dtype="Int64")

In [26]: s_cat = s.astype("category")

In [27]: s == 1
Out[27]: 
0     True
1    False
2     <NA>
dtype: boolean

In [28]: s_cat == 1
Out[28]: 
0     True
1    False
2    False
dtype: bool

I expect a "nullable" categorical column to give the same result as for the non-categorical s == 1.

@jbrockmendel jbrockmendel added the Ice Cream Agreement Issues that would be addressed by the Ice Cream Agreement from the Aug 2023 sprint label Oct 26, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Categorical Categorical Data Type Ice Cream Agreement Issues that would be addressed by the Ice Cream Agreement from the Aug 2023 sprint Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate
Projects
None yet
Development

No branches or pull requests

4 participants