Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

API: read_stata with index_col=None should return RangeIndex #49745

Merged
merged 3 commits into from Nov 18, 2022

Conversation

topper-123
Copy link
Contributor

@topper-123 topper-123 commented Nov 17, 2022

If read_stata was used with parameter index=None, an index based on np.arange was supplied to the constructed DataFrame, i.e. (pre pandas 2.0) an Int64Index.

np.arange has dtype np.int_, i.e. like np.intp, except is always 32bit on windows, which makes it annoying to use with tests when indexes can take all numpy numeric dtypes (like after #49560), so I'm looking into how arange is used in #49560. One case I found it was used is in read_stata and in that case it's better to use a range, so we get a RangeIndex instead of an Index[int_] when using read_stata(index_col=None).

This is a slight change in API, so I separate it out into its own PR here, so #49560, which is a large Pr, can be as focused as possible.

@mroeschke mroeschke added IO Stata read_stata, to_stata Index Related to the Index class or subclasses labels Nov 17, 2022
@@ -340,6 +340,7 @@ Other API changes
- Passing strings that cannot be parsed as datetimes to :class:`Series` or :class:`DataFrame` with ``dtype="datetime64[ns]"`` will raise instead of silently ignoring the keyword and returning ``object`` dtype (:issue:`24435`)
- Passing a sequence containing a type that cannot be converted to :class:`Timedelta` to :func:`to_timedelta` or to the :class:`Series` or :class:`DataFrame` constructor with ``dtype="timedelta64[ns]"`` or to :class:`TimedeltaIndex` now raises ``TypeError`` instead of ``ValueError`` (:issue:`49525`)
- Changed behavior of :class:`Index` constructor with sequence containing at least one ``NaT`` and everything else either ``None`` or ``NaN`` to infer ``datetime64[ns]`` dtype instead of ``object``, matching :class:`Series` behavior (:issue:`49340`)
- If no parameter ``index_col`` is given to :func:`read_stata`, the index will be a :class:`RangeIndex` Previously the index would have been a :class:`Int64Index` (:issue:`49745`)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the Performance improvement note can be used instead of this one

@@ -594,6 +595,7 @@ Performance improvements
- Memory improvement in :meth:`RangeIndex.sort_values` (:issue:`48801`)
- Performance improvement in :class:`DataFrameGroupBy` and :class:`SeriesGroupBy` when ``by`` is a categorical type and ``sort=False`` (:issue:`48976`)
- Performance improvement in :class:`DataFrameGroupBy` and :class:`SeriesGroupBy` when ``by`` is a categorical type and ``observed=False`` (:issue:`49596`)
- Performance improvement in :func:`read_stata` with parameter ``index_col`` set to ``None``(the default). Now the index will be a :class:`RangeIndex` instead of :class:`Int64Index` (:issue:`49745`)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the docbuild is complaining about this line

@topper-123
Copy link
Contributor Author

Updated.

@mroeschke mroeschke added this to the 2.0 milestone Nov 18, 2022
@mroeschke mroeschke merged commit c37dfc1 into pandas-dev:main Nov 18, 2022
@mroeschke
Copy link
Member

Thanks @topper-123

@topper-123 topper-123 deleted the read_stata_index_col branch November 18, 2022 18:39
twoertwein pushed a commit to twoertwein/pandas that referenced this pull request Nov 20, 2022
…dev#49745)

* API: read_stata with index_col=None return RangeIndex

* fix comments

* fix comments II

Co-authored-by: Terji Petersen <terjipetersen@Terjis-MacBook-Air.local>
Co-authored-by: Terji Petersen <terjipetersen@Terjis-Air.fritz.box>
mliu08 pushed a commit to mliu08/pandas that referenced this pull request Nov 27, 2022
…dev#49745)

* API: read_stata with index_col=None return RangeIndex

* fix comments

* fix comments II

Co-authored-by: Terji Petersen <terjipetersen@Terjis-MacBook-Air.local>
Co-authored-by: Terji Petersen <terjipetersen@Terjis-Air.fritz.box>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Index Related to the Index class or subclasses IO Stata read_stata, to_stata
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants