Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

REGR: ensure DataFrame.select_dtypes() returns a copy #48176

Merged

Conversation

jorisvandenbossche
Copy link
Member

@jorisvandenbossche jorisvandenbossche commented Aug 20, 2022

This was caused by the performance improvement in #42611, because _mgr._get_data_subset doesn't return a copy in contrast to indexing columns with iloc. This just adds a copy option to _get_data_subset, to keep all other improvements from #42611 (the performance obviously decreases again because of the additional copy, but it's still faster than before #42611)

Copy link

@villebro villebro left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks for the fix @jorisvandenbossche !

@jreback
Copy link
Contributor

jreback commented Aug 20, 2022

the original patch was for performance

this preserves?

@jorisvandenbossche
Copy link
Member Author

@jreback see the explanation in the top post

This preserves part of the performance improvement. But the improvement of #42611 "too big" since it also removed a copy of the data that it shouldn't have.

@jreback
Copy link
Contributor

jreback commented Aug 20, 2022

can u show the asv results

@jorisvandenbossche
Copy link
Member Author

Using the same code snippet as in #42611, I got:

In [2]: %timeit self.time_select_dtype_string_exclude(dtype)
1.64 ms ± 158 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)  <-- previous code before perf improvement
38.5 µs ± 279 ns per loop (mean ± std. dev. of 7 runs, 10,000 loops each)  <-- main
740 µs ± 17.9 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)  <-- this PR

The slowdown compared to main is fully due to the copy. All the other improvements from #42611 still give a 2x improvement compared to before.

@jreback
Copy link
Contributor

jreback commented Aug 20, 2022

great ty

this actually somewhat answers the question of how CoW would impact (asked in another issue)

@jbrockmendel
Copy link
Member

if we do end up adding copy=True keywords where relevant, could do the same here eventually

@@ -4706,7 +4706,7 @@ def predicate(arr: ArrayLike) -> bool:

return True

mgr = self._mgr._get_data_subset(predicate)
mgr = self._mgr._get_data_subset(predicate, copy=True)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

instead of adding a keyword in the Manager method could we just do a .copy after _get_data_subset?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, good point. I was originally thinking to add it here to ensure that the manager can be smarter to know what needs to be copied and what not. But since the predicate is based on the block dtype, this can never split any blocks, so I suppose there will never be such a case where the subset might already be a copy. So yes, doing a .copy() here is indeed much simpler.

There is one difference though: combine (used by _get_data_subset) doesn't do consolidation, while copy() does. So if your original dataframe is not fully consolidated, the current PR keeps more of the performance improvement of #42611. Of course, at the cost of an unconsolidated return value and a potential later consolidation.

Either way is fine for me. Doing a simpler copy() will also keep it closer to the previous behaviour since iloc also consolidates I think.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's do the copy outside of internals

@simonjayhawkins
Copy link
Member

all green, I think all comments addressed. will merge later today if no objections.

@simonjayhawkins
Copy link
Member

will merge later today

b4 1.5.0rc0

@simonjayhawkins simonjayhawkins merged commit f6e9b1a into pandas-dev:main Aug 23, 2022
@lumberbot-app

This comment was marked as resolved.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Copy / view semantics Regression Functionality that used to work in a prior pandas version
Projects
None yet
Development

Successfully merging this pull request may close these issues.

BUG: DataFrame.select_dtypes returns reference to original DataFrame
5 participants