Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG/API: Indexes on empty frames/series should be RangeIndex #49637

Merged

Conversation

topper-123
Copy link
Contributor

@topper-123 topper-123 commented Nov 11, 2022

I've looked into #49572 and it seems that the fix is reasonable simple, see included code. In short, currently if the user hasn't supplied index or columns values, then the index/columns is a RangeIndex if the data has lenght > 0, while they're Index[object] if the length is 0. After this PR, a RangeIndex will be used in both cases. This will simplify type inference etc. testing etc.

However, this fix requires changes to a lot of tests (>500), of which I've fixed about 100 (not included in this version of the PR), so it's quite a lot of work to get this fixed up. Test fixing so far seems to be only a matter of ensuring that various empty dataframes/Series have RangeIndex rather than a Index[object], so relatively simple, but also tedious work.

I didn't get much response to #49572 from the core devs, so before I spend more time on this rather big item, could you guys chime in on if you agree that this a good thing to do? IMO having the same index dtypes for series/frame of length 0 and >0 will conceptually simplify pandas and decrease the amount of surprises that users will encounter.

I've included the code that fixes the issue in this PR, as mentioned, and ATM this PR fails, because I haven't fixed the tests, but I will get them fixed if I get positive response. If you can respond either here or in #49572, I'll of course read them both.

@topper-123
Copy link
Contributor Author

@pandas-dev/pandas-core any comment?

IMO, this simplify the construction logic nicely, so will be good to get in.

@MarcoGorelli MarcoGorelli self-requested a review November 13, 2022 08:34
Copy link
Member

@MarcoGorelli MarcoGorelli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense to me, just got a question about a comment (which may no longer be relevant, especially with a new major release on the horizon)

pandas/core/frame.py Show resolved Hide resolved
@bashtage
Copy link
Contributor

I think this is also a sensible change for 2.0.0.

@jreback
Copy link
Contributor

jreback commented Nov 13, 2022

this doesn't break any tests?

@topper-123
Copy link
Contributor Author

topper-123 commented Nov 13, 2022

@jreback , this break a lot of tests, as empty series/frames are in a lot of tests. So I wanted to know it if others approve of this general idea/change, before really going through the tests.

@mroeschke
Copy link
Member

I would be supportive of this change; IMO it would needs its own section in the 2.0 whatsnew

@mroeschke mroeschke added the Index Related to the Index class or subclasses label Nov 16, 2022
@jbrockmendel
Copy link
Member

closes #40077?

@topper-123 topper-123 force-pushed the zero-len_frames_series_rangeindex branch from f224dad to 4237e62 Compare November 27, 2022 03:27
@topper-123
Copy link
Contributor Author

Still need docs + some cleanup, otherwise is ready if passes the CI.

@topper-123
Copy link
Contributor Author

closes #40077?

Yes, this makes it so the index of the df is a RangeIndex, so the example in #40077 passes after this PR.

@topper-123
Copy link
Contributor Author

This is ready for review now.

@topper-123 topper-123 force-pushed the zero-len_frames_series_rangeindex branch from 9effc51 to 36f3630 Compare November 29, 2022 11:58
@topper-123
Copy link
Contributor Author

Updated.

@topper-123
Copy link
Contributor Author

Updated.

Copy link
Member

@mroeschke mroeschke left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks pretty good. Just a clarification on the whatsnew note

@topper-123 topper-123 force-pushed the zero-len_frames_series_rangeindex branch from 3ccd284 to f6e5f0e Compare December 2, 2022 12:19
@topper-123
Copy link
Contributor Author

Rebased & updated.

@mroeschke mroeschke added this to the 2.0 milestone Dec 2, 2022
Copy link
Member

@mroeschke mroeschke left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Terji Petersen and others added 7 commits December 6, 2022 19:54
@topper-123 topper-123 force-pushed the zero-len_frames_series_rangeindex branch from ba9bc75 to 131dfef Compare December 6, 2022 19:55
@topper-123
Copy link
Contributor Author

Gentle ping. I've rebased (no code changes) just in case.

@mroeschke mroeschke merged commit e93ee07 into pandas-dev:main Dec 7, 2022
@mroeschke
Copy link
Member

Thanks @topper-123

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Index Related to the Index class or subclasses
Projects
None yet
6 participants