Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: Index on empty frame should be RangeIndex #52404

Closed
3 tasks done
hagenw opened this issue Apr 4, 2023 · 14 comments · Fixed by #52426
Closed
3 tasks done

BUG: Index on empty frame should be RangeIndex #52404

hagenw opened this issue Apr 4, 2023 · 14 comments · Fixed by #52426
Labels
Bug Constructors Series/DataFrame/Index/pd.array Constructors DataFrame DataFrame data structure
Milestone

Comments

@hagenw
Copy link

hagenw commented Apr 4, 2023

Pandas version checks

  • I have checked that this issue has not already been reported.

  • I have confirmed this bug exists on the latest version of pandas.

  • I have confirmed this bug exists on the main branch of pandas.

Reproducible Example

import pandas as pd
pd.DataFrame({}).columns

Issue Description

The above code returns

Index([], dtype='object')

Expected Behavior

But as stated in https://pandas.pydata.org/docs/dev/whatsnew/v2.0.0.html#empty-dataframes-series-will-now-default-to-have-a-rangeindex it should return instead:

RangeIndex(start=0, stop=0, step=1)

which it does for

pd.DataFrame().columns
pd.DataFrame(None).columns
pd.DataFrame([]).columns
pd.DataFrame(()).columns

Installed Versions

INSTALLED VERSIONS

commit : 478d340
python : 3.8.16.final.0
python-bits : 64
OS : Linux
OS-release : 5.4.0-144-generic
Version : #161~18.04.1-Ubuntu SMP Fri Feb 10 15:55:22 UTC 2023
machine : x86_64
processor : x86_64
byteorder : little
LC_ALL : None
LANG : en_GB.UTF-8
LOCALE : en_GB.UTF-8

pandas : 2.0.0
numpy : 1.24.2
pytz : 2023.3
dateutil : 2.8.2
setuptools : 67.5.1
pip : 23.0.1
Cython : None
pytest : None
hypothesis : None
sphinx : 6.1.3
blosc : None
feather : None
xlsxwriter : None
lxml.etree : None
html5lib : None
pymysql : None
psycopg2 : None
jinja2 : 3.1.2
IPython : None
pandas_datareader: None
bs4 : None
bottleneck : None
brotli : None
fastparquet : None
fsspec : None
gcsfs : None
matplotlib : None
numba : None
numexpr : None
odfpy : None
openpyxl : None
pandas_gbq : None
pyarrow : None
pyreadstat : None
pyxlsb : None
s3fs : None
scipy : 1.10.1
snappy : None
sqlalchemy : None
tables : None
tabulate : None
xarray : None
xlrd : None
zstandard : None
tzdata : 2023.3
qtpy : None
pyqt5 : None

@hagenw hagenw added Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Apr 4, 2023
@hagenw
Copy link
Author

hagenw commented Apr 4, 2023

When I try to reproduce it for main I get:

>>> import pandas as pd
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/audeering.local/hwierstorf/git/pandas/pandas/__init__.py", line 22, in <module>
    from pandas.compat import is_numpy_dev as _is_numpy_dev  # pyright: ignore # noqa:F401
  File "/home/audeering.local/hwierstorf/git/pandas/pandas/compat/__init__.py", line 25, in <module>
    from pandas.compat.numpy import (
  File "/home/audeering.local/hwierstorf/git/pandas/pandas/compat/numpy/__init__.py", line 4, in <module>
    from pandas.util.version import Version
  File "/home/audeering.local/hwierstorf/git/pandas/pandas/util/__init__.py", line 2, in <module>
    from pandas.util._decorators import (  # noqa:F401
  File "/home/audeering.local/hwierstorf/git/pandas/pandas/util/_decorators.py", line 14, in <module>
    from pandas._libs.properties import cache_readonly
  File "/home/audeering.local/hwierstorf/git/pandas/pandas/_libs/__init__.py", line 16, in <module>
    import pandas._libs.pandas_parser  # noqa # isort: skip # type: ignore[reportUnusedImport]
ModuleNotFoundError: No module named 'pandas._libs.pandas_parser'

@hagenw
Copy link
Author

hagenw commented Apr 4, 2023

I now followed https://pandas.pydata.org/docs/dev/development/contributing_environment.html#option-2-using-pip and was able to reproduce the issue on the main branch as well:

>>> pd.__version__
'2.1.0.dev0+409.g5a1f280647'
>>> pd.DataFrame({}).columns
Index([], dtype='object')

@DeaMariaLeon DeaMariaLeon added DataFrame DataFrame data structure Constructors Series/DataFrame/Index/pd.array Constructors and removed Needs Triage Issue that has not been reviewed by a pandas team member labels Apr 4, 2023
@zmwaris1
Copy link

zmwaris1 commented Apr 4, 2023

Hi, I would like to work on this issue. Can you assign this to me and share the details?

@MarcoGorelli
Copy link
Member

thanks @hagenw for the report! it looks like this works for some initialisations but not others:

In [8]: import pandas as pd
   ...: pd.DataFrame().axes
Out[8]: [RangeIndex(start=0, stop=0, step=1), RangeIndex(start=0, stop=0, step=1)]

In [9]: import pandas as pd
   ...: pd.DataFrame([]).axes
Out[9]: [RangeIndex(start=0, stop=0, step=1), RangeIndex(start=0, stop=0, step=1)]

In [10]: import pandas as pd
    ...: pd.DataFrame({}).axes
Out[10]: [RangeIndex(start=0, stop=0, step=1), Index([], dtype='object')]

cc @topper-123

@phofl phofl added this to the 2.0.1 milestone Apr 4, 2023
@topper-123
Copy link
Contributor

topper-123 commented Apr 4, 2023

This was intentional on my part when I made #49572.

@mroeschke asked in a comment:

Why does an empty dict not produce RangeIndexes?

I argued there that that for a dict d, Series(d) is the most equivalent to Series(d.values(), index=d.keys()), which is equivalent to Series([], index=[]) for en empty dict, i.e. has an index with dtype object.

Also notice that non-empty dict can never give a RangeIndex.

But IDK, maybe this just trips people up and it would be better to have empty dicts to give a RangeIndex?

@MarcoGorelli
Copy link
Member

thanks for explaining! I think your explanation makes sense, personally I think it'd be fine to keep as-is

@mroeschke
Copy link
Member

Would special casing be necessary to make DataFrame({}) produce a RangeIndex on both axes? If not, it might be better to forgo semantics and align with user expectation of "empty"

@topper-123
Copy link
Contributor

It's very easy to change, so it's more a question of what we want. I can follow the thought that this can be a bit surprising.

@mroeschke
Copy link
Member

I would support returning a RangeIndex on both axes to match user expectation and an opportunity to avoid introducing an object dtype axis

@phofl
Copy link
Member

phofl commented Apr 4, 2023

agree with @mroeschke

This is confusing for most users.

@topper-123
Copy link
Contributor

I've made a PR about this.

@jorisvandenbossche
Copy link
Member

I would support returning a RangeIndex on both axes to match user expectation and an opportunity to avoid introducing an object dtype axis

On the other hand, I personally found it confusing to get a RangeIndex for columns, and I actually want to avoid introducing an int64 axis for the columns (if you otherwise always use string column names, using object dtype for an empty columns object is closer to what you want than an int64)

Anyway, I don't necessarily object the change (consistency with other variants of initialization also has its value), but just wanted to point out that "confusing" / "user expectation" depends quite a bit on your use case (as usual ;)).
Pyarrow will keep returning object dtype for empty columns (which I think makes most sense for pyarrow since our column names are always strings, and which follows from using pandas' Index([])

@jorisvandenbossche
Copy link
Member

Pyarrow will keep returning object dtype for empty columns (which I think makes most sense for pyarrow since our column names are always strings,

Small correction: if the pyarrow Table came from a pandas DataFrame roundtrip originally, we actually store in the pandas metadata the dtype of the columns object, and use that information to correctly "restore" the column names. We don't know that it was a RangeIndex though, so if using this information, it comes back as an empty Index[int64]. When there is no pandas metadata, then we will use empty object dtype Index.

@topper-123
Copy link
Contributor

I agreed with you initially, but when I had to explain it it sounded maybe more complex than I expected. But I could personally live with both I think they each have their advantages, and I now like the explanation "an empty axes on empty data is always a RangeIndex"...

kou pushed a commit to apache/arrow that referenced this issue Apr 12, 2023
…nge in pandas 2.0.1 (#35031)

### Rationale for this change

Pandas changed the default dtype of the columns object for an empty DataFrame from object dtype to integer RangeIndex (see pandas-dev/pandas#52404). This updates our tests to pass with that change.

* Closes: #15070

Authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com>
Signed-off-by: Sutou Kouhei <kou@clear-code.com>
liujiacheng777 pushed a commit to LoongArch-Python/arrow that referenced this issue May 11, 2023
…pe change in pandas 2.0.1 (apache#35031)

### Rationale for this change

Pandas changed the default dtype of the columns object for an empty DataFrame from object dtype to integer RangeIndex (see pandas-dev/pandas#52404). This updates our tests to pass with that change.

* Closes: apache#15070

Authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com>
Signed-off-by: Sutou Kouhei <kou@clear-code.com>
ArgusLi pushed a commit to Bit-Quill/arrow that referenced this issue May 15, 2023
…pe change in pandas 2.0.1 (apache#35031)

### Rationale for this change

Pandas changed the default dtype of the columns object for an empty DataFrame from object dtype to integer RangeIndex (see pandas-dev/pandas#52404). This updates our tests to pass with that change.

* Closes: apache#15070

Authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com>
Signed-off-by: Sutou Kouhei <kou@clear-code.com>
rtpsw pushed a commit to rtpsw/arrow that referenced this issue May 16, 2023
…pe change in pandas 2.0.1 (apache#35031)

### Rationale for this change

Pandas changed the default dtype of the columns object for an empty DataFrame from object dtype to integer RangeIndex (see pandas-dev/pandas#52404). This updates our tests to pass with that change.

* Closes: apache#15070

Authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com>
Signed-off-by: Sutou Kouhei <kou@clear-code.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Constructors Series/DataFrame/Index/pd.array Constructors DataFrame DataFrame data structure
Projects
None yet
Development

Successfully merging a pull request may close this issue.

8 participants