REGR: ensure DataFrame.select_dtypes() returns a copy #48176

jorisvandenbossche · 2022-08-20T08:00:56Z

closes BUG: DataFrame.select_dtypes returns reference to original DataFrame #48090
Tests added and passed if fixing a bug or adding a new feature
All code checks passed.
Added type annotations to new arguments/methods/functions.
Added an entry in the latest doc/source/whatsnew/vX.X.X.rst file if fixing a bug or adding a new feature.

This was caused by the performance improvement in #42611, because _mgr._get_data_subset doesn't return a copy in contrast to indexing columns with iloc. This just adds a copy option to _get_data_subset, to keep all other improvements from #42611 (the performance obviously decreases again because of the additional copy, but it's still faster than before #42611)

villebro

LGTM, thanks for the fix @jorisvandenbossche !

jreback · 2022-08-20T11:45:16Z

the original patch was for performance

this preserves?

jorisvandenbossche · 2022-08-20T11:53:43Z

@jreback see the explanation in the top post

This preserves part of the performance improvement. But the improvement of #42611 "too big" since it also removed a copy of the data that it shouldn't have.

jreback · 2022-08-20T11:54:48Z

can u show the asv results

jorisvandenbossche · 2022-08-20T12:03:48Z

Using the same code snippet as in #42611, I got:

In [2]: %timeit self.time_select_dtype_string_exclude(dtype)
1.64 ms ± 158 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)  <-- previous code before perf improvement
38.5 µs ± 279 ns per loop (mean ± std. dev. of 7 runs, 10,000 loops each)  <-- main
740 µs ± 17.9 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)  <-- this PR

The slowdown compared to main is fully due to the copy. All the other improvements from #42611 still give a 2x improvement compared to before.

jreback · 2022-08-20T12:09:30Z

great ty

this actually somewhat answers the question of how CoW would impact (asked in another issue)

jbrockmendel · 2022-08-20T19:41:51Z

if we do end up adding copy=True keywords where relevant, could do the same here eventually

jbrockmendel · 2022-08-20T19:42:42Z

pandas/core/frame.py

@@ -4706,7 +4706,7 @@ def predicate(arr: ArrayLike) -> bool:

            return True

-        mgr = self._mgr._get_data_subset(predicate)
+        mgr = self._mgr._get_data_subset(predicate, copy=True)


instead of adding a keyword in the Manager method could we just do a .copy after _get_data_subset?

Yes, good point. I was originally thinking to add it here to ensure that the manager can be smarter to know what needs to be copied and what not. But since the predicate is based on the block dtype, this can never split any blocks, so I suppose there will never be such a case where the subset might already be a copy. So yes, doing a .copy() here is indeed much simpler.

There is one difference though: combine (used by _get_data_subset) doesn't do consolidation, while copy() does. So if your original dataframe is not fully consolidated, the current PR keeps more of the performance improvement of #42611. Of course, at the cost of an unconsolidated return value and a potential later consolidation.

Either way is fine for me. Doing a simpler copy() will also keep it closer to the previous behaviour since iloc also consolidates I think.

let's do the copy outside of internals

…view

simonjayhawkins · 2022-08-23T10:14:45Z

all green, I think all comments addressed. will merge later today if no objections.

simonjayhawkins · 2022-08-23T10:16:15Z

will merge later today

b4 1.5.0rc0

…returns a copy

…pes() returns a copy) (#48219)

REGR: ensure DataFrame.select_dtypes() returns a copy

4cb5f7d

jorisvandenbossche added Regression Copy / view semantics labels Aug 20, 2022

jorisvandenbossche added this to the 1.4.4 milestone Aug 20, 2022

jorisvandenbossche mentioned this pull request Aug 20, 2022

BUG: DataFrame.select_dtypes returns reference to original DataFrame #48090

Closed

3 tasks

jorisvandenbossche requested a review from jbrockmendel August 20, 2022 08:11

fix type issues

8a8f6cf

villebro approved these changes Aug 20, 2022

View reviewed changes

jbrockmendel reviewed Aug 20, 2022

View reviewed changes

jorisvandenbossche added 4 commits August 23, 2022 07:08

Merge remote-tracking branch 'upstream/main' into regr-select-dtypes-…

0ab19f1

…view

simplify copy

49e86fa

Merge remote-tracking branch 'upstream/main' into regr-select-dtypes-…

b44350a

…view

fixup tests

f1d0502

simonjayhawkins merged commit f6e9b1a into pandas-dev:main Aug 23, 2022

This comment was marked as resolved.

Sign in to view

lumberbot-app bot added the Still Needs Manual Backport label Aug 23, 2022

simonjayhawkins pushed a commit to simonjayhawkins/pandas that referenced this pull request Aug 23, 2022

Backport PR pandas-dev#48176: REGR: ensure DataFrame.select_dtypes() …

3193388

…returns a copy

simonjayhawkins mentioned this pull request Aug 23, 2022

Backport PR #48176 on branch 1.4.x (REGR: ensure DataFrame.select_dtypes() returns a copy) #48219

Merged

simonjayhawkins removed the Still Needs Manual Backport label Aug 23, 2022

jorisvandenbossche deleted the regr-select-dtypes-view branch August 23, 2022 22:09

simonjayhawkins added a commit that referenced this pull request Aug 24, 2022

Backport PR #48176 on branch 1.4.x (REGR: ensure DataFrame.select_dty…

3e938c2

…pes() returns a copy) (#48219)

noatamir pushed a commit to noatamir/pandas that referenced this pull request Nov 9, 2022

REGR: ensure DataFrame.select_dtypes() returns a copy (pandas-dev#48176)

4193380

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

REGR: ensure DataFrame.select_dtypes() returns a copy #48176

REGR: ensure DataFrame.select_dtypes() returns a copy #48176

jorisvandenbossche commented Aug 20, 2022 •

edited

Loading

villebro left a comment

jreback commented Aug 20, 2022

jorisvandenbossche commented Aug 20, 2022

jreback commented Aug 20, 2022

jorisvandenbossche commented Aug 20, 2022

jreback commented Aug 20, 2022

jbrockmendel commented Aug 20, 2022

jbrockmendel Aug 20, 2022

jorisvandenbossche Aug 21, 2022

jbrockmendel Aug 22, 2022

simonjayhawkins commented Aug 23, 2022

simonjayhawkins commented Aug 23, 2022

This comment was marked as resolved.

REGR: ensure DataFrame.select_dtypes() returns a copy #48176

REGR: ensure DataFrame.select_dtypes() returns a copy #48176

Conversation

jorisvandenbossche commented Aug 20, 2022 • edited Loading

villebro left a comment

Choose a reason for hiding this comment

jreback commented Aug 20, 2022

jorisvandenbossche commented Aug 20, 2022

jreback commented Aug 20, 2022

jorisvandenbossche commented Aug 20, 2022

jreback commented Aug 20, 2022

jbrockmendel commented Aug 20, 2022

jbrockmendel Aug 20, 2022

Choose a reason for hiding this comment

jorisvandenbossche Aug 21, 2022

Choose a reason for hiding this comment

jbrockmendel Aug 22, 2022

Choose a reason for hiding this comment

simonjayhawkins commented Aug 23, 2022

simonjayhawkins commented Aug 23, 2022

This comment was marked as resolved.

jorisvandenbossche commented Aug 20, 2022 •

edited

Loading