Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix StringArray.astype for category dtype #40450

Merged
merged 16 commits into from Apr 2, 2021
Merged

Conversation

siboehm
Copy link
Contributor

@siboehm siboehm commented Mar 15, 2021

This was failing due to elif np.issubdtype(dtype, np.floating) here, which fails for the pandas dtypes.
StringArrays are now being cast to Categorical just as they were before #38530, when the casting still happened inside Block.

Copy link
Member

@jorisvandenbossche jorisvandenbossche left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me, thanks!

pandas/tests/series/methods/test_astype.py Outdated Show resolved Hide resolved
@siboehm
Copy link
Contributor Author

siboehm commented Mar 16, 2021

This uncovered a bit of a larger issue, basically #37626, but for datetimes & period.

Example:

import pandas as pd
t = pd.Series(["1/1/2021", "2/1/2021", None], dtype="period[M]")
s = pd.Series(["1/1/2021", "2/1/2021", None], dtype="string").astype("period[M]")

t work, s raises because the None is converted to NA which isn't recognized as datetimelike by period.

Example

import pandas as pd
t = pd.Series(["1/1/2021", "2/1/2021"], dtype="object").astype("datetime64[ns]")
s = pd.Series(["1/1/2021", "2/1/2021"], dtype="string").astype("datetime64[ns]")

t work, s raises.

However if it's ok I'd open a new issue for it, since it'll be a larger revamp which will take me a few more days.
These "new" issues also aren't regressions like this Categorical issue, they're all already present in 1.2.3.

@jorisvandenbossche
Copy link
Member

Yes, that's a more general issue, and can be left out of scope for this PR.

Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

needs a whatsnew note for 1.2.4

expected = Series(["A", np.NaN], dtype="category")
tm.assert_series_equal(result, expected)

s = Series(["1/1/2021", "2/1/2021"], dtype="string")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can u add an example for Timedelta, Datetime w/time zone and Interval (all the EA types)

Copy link
Contributor Author

@siboehm siboehm Mar 17, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added test for all ExtensionArray dtypes. BooleanArray and IntervalArray I had to exclude since there's no way to to parse them back from a list of strings. TimedeltaArray xfails due to #40478. For PeriodArrayand DatetimeArray the NaT get converted to NA strings. But converting the NA strings back to NaT fails. I added XFails, unless expecting EA ⇒ StringArray ⇒ EA to roundtrip successfully is still up for debate.

@jorisvandenbossche jorisvandenbossche added this to the 1.3 milestone Mar 16, 2021
@jorisvandenbossche
Copy link
Member

needs a whatsnew note for 1.2.4

This doesn't need a whatsnew, it's only a regression on master

@jreback
Copy link
Contributor

jreback commented Mar 16, 2021

needs a whatsnew note for 1.2.4

This doesn't need a whatsnew, it's only a regression on master

ok great

@jorisvandenbossche jorisvandenbossche added the Regression Functionality that used to work in a prior pandas version label Mar 16, 2021
request.node.add_marker(mark)
if NaT in data and dtype in ("period[M]", "datetime64[ns]"):
mark = pytest.mark.xfail(
reason="TODO StringArray.astype() None to dtype.na_value conversion"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is there an issue for this?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I created one here: #40566 to track this

(["1/1/2021", "2/1/2021"], "period[M]"),
(["1/1/2021", "2/1/2021", NaT], "period[M]"),
(["1 Day", "59 Days", NaT], "timedelta64[ns]"),
# currently no way to parse BooleanArray, IntervalArray from a
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BooleanArray can be parsed from string (see _from_sequence_of_strings, the general method)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had an implementation that used _from_sequence_of_strings instead of _from_sequence to in StringArray.astype(). That required bigger code changes. I'd like to merge this Regression PR and then implement to implement _from_sequence_of_strings as part of #40566

@siboehm
Copy link
Contributor Author

siboehm commented Mar 27, 2021

Why does Codecov say 0% coverage for the diff, when the added if-branch has multiple tests running over it?

Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks fine, can you add a whatsnew note (bug fixes on 1.3 / conversion or extension types section). ping on greenish (also merge master)

@siboehm
Copy link
Contributor Author

siboehm commented Mar 30, 2021

@jreback added a Whatsnew, should be good for merge.

@jreback jreback merged commit 5cb0916 into pandas-dev:master Apr 2, 2021
@jreback
Copy link
Contributor

jreback commented Apr 2, 2021

thanks @siboehm

vladu pushed a commit to vladu/pandas that referenced this pull request Apr 5, 2021
simonjayhawkins added a commit to simonjayhawkins/pandas that referenced this pull request Apr 17, 2021
JulianWgs pushed a commit to JulianWgs/pandas that referenced this pull request Jul 3, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Regression Functionality that used to work in a prior pandas version
Projects
None yet
Development

Successfully merging this pull request may close these issues.

BUG: Casting string series to category raises TypeError in nightlies
4 participants