Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Avoid set_index(..., inplace=True) where not necessary #9472

Merged
merged 4 commits into from Sep 15, 2022

Conversation

jrbourbeau
Copy link
Member

This PR handles the inplace= keyword being deprecated in pandas set_index(...) calls.

Note, there is still discussion happening over in the pandas repo about whether or not to revert this deprecation (xref pandas-dev/pandas#48417). We should wait to merge this PR until a decision has been made there. cc @jorisvandenbossche for visibility

@@ -97,7 +97,7 @@ def test_orc_roundtrip(tmpdir, index, columns):
}
)
if index:
data.set_index(index, inplace=True)
data = data.set_index(index)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that here, and in other tests, our use of inplace=True isn't actually needed. All these pandas DataFrames are small, so I've just removed the inplace option altogether.

@jrbourbeau jrbourbeau mentioned this pull request Sep 13, 2022
7 tasks
@mroeschke
Copy link
Contributor

We'll be discussing this on Wendesday, Sept 14 during the pandas dev call to decide whether to revert this deprecation.

@jrbourbeau
Copy link
Member Author

Thanks for the heads up @mroeschke! Would you mind pinging me with the results of that decision after tomorrow's call?

@jorisvandenbossche
Copy link
Member

Note that here, and in other tests, our use of inplace=True isn't actually needed

Even if we decide to not yet deprecate that, the above is certainly true, and I think it would be good to already clean up part of those. I think you can basically keep all the changes in test files in this PR.

@mroeschke
Copy link
Contributor

It was decided to revert the set_index deprecations, but agreed with @jorisvandenbossche on keeping the changes.

@jrbourbeau jrbourbeau self-assigned this Sep 14, 2022
@jrbourbeau
Copy link
Member Author

Great, thanks for up the update! I'll roll back the changes strictly related to the reverted deprecation

@jrbourbeau jrbourbeau changed the title [DNM] Inplace set_index pandas=1.5 compatibility Avoid set_index(..., inplace=True) where not necessary Sep 14, 2022
@jrbourbeau
Copy link
Member Author

@ncclementi @ian-r-rose is there a chance one of you could take a look at the changes here? Should be a relatively straightforward review with the reduced scope of this PR

Copy link
Member

@ncclementi ncclementi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, had only one comment for learning purposes.

@@ -49,6 +49,7 @@ def __call__(self, parts):
)
if self.index:
_df.set_index(self.index, inplace=True)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just an informative question, why do we keep the inplace=True in here?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm keeping inplace=True in non-testing code to try to minimize data copies. To be clear, I'm not 100% sure whether or not copies would still be avoided if inplace=False (maybe @jorisvandenbossche or @mroeschke have more insight here)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Btw this is definitely a fair question to ask. In general, I think avoiding inplace operations make things easier to reason about. Let's handle as needed in a follow-up PR

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking at set_index in pandas, a copy of the entire DataFrame is avoided if inplace=True

(Part of the motivation of the original deprecation is that a copy can always be avoid since the index is only being modified)

@jrbourbeau jrbourbeau merged commit 803c7fd into dask:main Sep 15, 2022
@jrbourbeau jrbourbeau deleted the set_index_inplace branch September 15, 2022 21:30
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants