Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

REGR: Fix interpolation on empty dataframe #35543

Merged
merged 8 commits into from Aug 17, 2020

Conversation

sanderland
Copy link
Contributor

@sanderland sanderland commented Aug 4, 2020

Interpolation on an empty dataframe broke in 1.1 due to a change in how 'all columns are objects' is checked (specifically all(empty set) is True, while before dtype count object = None was checked against size = 0).
This is a complex function and I'm not sure what the proper fix is, suggesting to keep the empty check out of the rest of the logic.

Example code that broke:

import pandas as pd
df = pd.DataFrame([1,2])
df[[]].interpolate(limit_area='inside')

Interpolation on an empty dataframe broke in 1.1 due to a change in how 'all columns are objects' is checked (specifically all(empty set) is True, while before dtype count object = None was checked against size = 0).
This is a complex function and I'm not sure what the proper fix is, suggesting to keep the empty check out of the rest of the logic.
@pep8speaks
Copy link

pep8speaks commented Aug 4, 2020

Hello @sanderland! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found:

There are currently no PEP 8 issues detected in this Pull Request. Cheers! 🍻

Comment last updated at 2020-08-07 19:37:15 UTC

@sanderland sanderland marked this pull request as ready for review August 4, 2020 13:34
@WillAyd
Copy link
Member

WillAyd commented Aug 4, 2020

Can you add a test case for this?

@WillAyd WillAyd added the Algos Non-arithmetic algos: value_counts, factorize, sorting, isin, clip, shift, diff label Aug 4, 2020
@sanderland
Copy link
Contributor Author

Can you add a test case for this?

done

Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i guess this was a regression, can you add a note in 1.1.1

pandas/tests/frame/methods/test_interpolate.py Outdated Show resolved Hide resolved
@@ -6799,6 +6799,9 @@ def interpolate(

obj = self.T if should_transpose else self

if obj.empty:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

would prefer that this happens in the internal method itself (called on L6861)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The error happens before this call (on line 6825)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There were a few changes to interpolate in 1.1.0. will run git bisect to ascertain why this is now failing.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There was a subtle change in #34752 where

if self.ndim == 2 and np.all(self.dtypes == np.dtype(object)):

was changed to

if obj.ndim == 2 and np.all(obj.dtypes == np.dtype(object)):

does reverting this change restore the old behaviour.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the regression is from #33084, but since the error message is about DataFrame columns, I don't think the above should have been changed either. @jbrockmendel

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

would prefer that this happens in the internal method itself (called on L6861)

As @sanderland notes, the check for all object dtype indeed happens here before calling into the internal method, so the check for empty thus also needs to happen here

However, @sanderland I think the df.empty check is not fully correct. Because this was also give True if you have columns but no rows. And for example an empty dataframe with a datetime64[ns] column will give a timedelta64[ns] column as result. That's something we should keep.

So I think we should explicitly check for no columns / no rows (alternatively could also add the check to the offending if obj.ndim == 2 and np.all(obj.dtypes == np.dtype(object)):, eg .. and obj.shape[0])

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sanderland can you try @jorisvandenbossche suggestion here

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

However, @sanderland I think the df.empty check is not fully correct. Because this was also give True if you have columns but no rows. And for example an empty dataframe with a datetime64[ns] column will give a timedelta64[ns] column as result. That's something we should keep.

I tried this in 1.0.x and it does not return a timedelta dtype, I dont see why interpolate would do this either. Could you suggest a test which fails with my approach but passes in 1.0.x?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, you're fully correct. I might have been mixing up the review of two PRs, as I was also reviewing a regression for diff() and maybe was testing that method here as well .. ;) (since df.diff() for datetime64 gives timedelta64, also for empty dataframe)

Forget my comment!

@jreback jreback added the Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate label Aug 6, 2020
@sanderland sanderland requested a review from jreback August 7, 2020 08:46
@simonjayhawkins simonjayhawkins added this to the 1.1.1 milestone Aug 7, 2020
sanderland and others added 3 commits August 7, 2020 21:36
Co-authored-by: Simon Hawkins <simonjayhawkins@gmail.com>
Co-authored-by: Simon Hawkins <simonjayhawkins@gmail.com>
@jorisvandenbossche jorisvandenbossche changed the title Fix interpolation on empty dataframe REGR: Fix interpolation on empty dataframe Aug 13, 2020
@jorisvandenbossche jorisvandenbossche added the Regression Functionality that used to work in a prior pandas version label Aug 13, 2020
@jorisvandenbossche
Copy link
Member

@sanderland can you merge master once more to see if that fixes CI?

@simonjayhawkins
Copy link
Member

test failure was

=========================== short test summary info ===========================
FAILED pandas/tests/computation/test_eval.py::TestAlignment::test_basic_series_frame_alignment[numexpr-python]
= 1 failed, 72824 passed, 1746 skipped, 1055 xfailed, 58 warnings in 1102.57s (0:18:22) =

probably unrelated.

restarted (azure will patch against master)

@simonjayhawkins simonjayhawkins merged commit 0abfc7e into pandas-dev:master Aug 17, 2020
@simonjayhawkins
Copy link
Member

Thanks @sanderland

@simonjayhawkins
Copy link
Member

@meeseeksdev backport 1.1.x

Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@simonjayhawkins in the future pls don't merge these so fast

this is not what we wnat here

eg this should actually return self.copy()

not to mention this is not the correct place for this

simonjayhawkins added a commit that referenced this pull request Aug 17, 2020
simonjayhawkins pushed a commit that referenced this pull request Aug 17, 2020
Co-authored-by: sanderland <48946947+sanderland@users.noreply.github.com>
@jorisvandenbossche
Copy link
Member

I think Simon's call to merge this was appropriate (all the existing review comments were addressed, it was reviewed by several core devs). Yes, you noticed an additional problem (the missing copy) after merge: no problem, that can always happen, and we can do a follow-up to fix this.

not to mention this is not the correct place for this

See my comment at #35543 (comment) (and @sanderland's own answer above that), your suggestion to move it down into the internals is not possible, so this function is a correct place to put the check.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Algos Non-arithmetic algos: value_counts, factorize, sorting, isin, clip, shift, diff Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate Regression Functionality that used to work in a prior pandas version
Projects
None yet
Development

Successfully merging this pull request may close these issues.

BUG: interpolate gives a TypeError on empty dataframes
6 participants