Skip to content

Latest commit

 

History

History
1251 lines (1025 loc) · 73.7 KB

v1.5.0.rst

File metadata and controls

1251 lines (1025 loc) · 73.7 KB

What's new in 1.5.0 (??)

These are the changes in pandas 1.5.0. See :ref:`release` for a full changelog including other versions of pandas.

{{ header }}

Enhancements

pandas-stubs

The pandas-stubs library is now supported by the pandas development team, providing type stubs for the pandas API. Please visit https://github.com/pandas-dev/pandas-stubs for more information.

We thank VirtusLab and Microsoft for their initial, significant contributions to pandas-stubs

Native PyArrow-backed ExtensionArray

With Pyarrow installed, users can now create pandas objects that are backed by a pyarrow.ChunkedArray and pyarrow.DataType.

The dtype argument can accept a string of a pyarrow data type with pyarrow in brackets e.g. "int64[pyarrow]" or, for pyarrow data types that take parameters, a :class:`ArrowDtype` initialized with a pyarrow.DataType.

.. ipython:: python

    import pyarrow as pa
    ser_float = pd.Series([1.0, 2.0, None], dtype="float32[pyarrow]")
    ser_float

    list_of_int_type = pd.ArrowDtype(pa.list_(pa.int64()))
    ser_list = pd.Series([[1, 2], [3, None]], dtype=list_of_int_type)
    ser_list

    ser_list.take([1, 0])
    ser_float * 5
    ser_float.mean()
    ser_float.dropna()

Most operations are supported and have been implemented using pyarrow compute functions. We recommend installing the latest version of PyArrow to access the most recently implemented compute functions.

Warning

This feature is experimental, and the API can change in a future release without warning.

DataFrame interchange protocol implementation

Pandas now implement the DataFrame interchange API spec. See the full details on the API at https://data-apis.org/dataframe-protocol/latest/index.html

The protocol consists of two parts:

  • New method :meth:`DataFrame.__dataframe__` which produces the interchange object. It effectively "exports" the pandas dataframe as an interchange object so any other library which has the protocol implemented can "import" that dataframe without knowing anything about the producer except that it makes an interchange object.
  • New function :func:`pandas.api.interchange.from_dataframe` which can take an arbitrary interchange object from any conformant library and construct a pandas DataFrame out of it.

Styler

The most notable development is the new method :meth:`.Styler.concat` which allows adding customised footer rows to visualise additional calculations on the data, e.g. totals and counts etc. (:issue:`43875`, :issue:`46186`)

Additionally there is an alternative output method :meth:`.Styler.to_string`, which allows using the Styler's formatting methods to create, for example, CSVs (:issue:`44502`).

A new feature :meth:`.Styler.relabel_index` is also made available to provide full customisation of the display of index or column headers (:issue:`47864`)

Minor feature improvements are:

Control of index with group_keys in :meth:`DataFrame.resample`

The argument group_keys has been added to the method :meth:`DataFrame.resample`. As with :meth:`DataFrame.groupby`, this argument controls the whether each group is added to the index in the resample when :meth:`.Resampler.apply` is used.

Warning

Not specifying the group_keys argument will retain the previous behavior and emit a warning if the result will change by specifying group_keys=False. In a future version of pandas, not specifying group_keys will default to the same behavior as group_keys=False.

.. ipython:: python

    df = pd.DataFrame(
        {'a': range(6)},
        index=pd.date_range("2021-01-01", periods=6, freq="8H")
    )
    df.resample("D", group_keys=True).apply(lambda x: x)
    df.resample("D", group_keys=False).apply(lambda x: x)

Previously, the resulting index would depend upon the values returned by apply, as seen in the following example.

In [1]: # pandas 1.3
In [2]: df.resample("D").apply(lambda x: x)
Out[2]:
                     a
2021-01-01 00:00:00  0
2021-01-01 08:00:00  1
2021-01-01 16:00:00  2
2021-01-02 00:00:00  3
2021-01-02 08:00:00  4
2021-01-02 16:00:00  5

In [3]: df.resample("D").apply(lambda x: x.reset_index())
Out[3]:
                           index  a
2021-01-01 0 2021-01-01 00:00:00  0
           1 2021-01-01 08:00:00  1
           2 2021-01-01 16:00:00  2
2021-01-02 0 2021-01-02 00:00:00  3
           1 2021-01-02 08:00:00  4
           2 2021-01-02 16:00:00  5

from_dummies

Added new function :func:`~pandas.from_dummies` to convert a dummy coded :class:`DataFrame` into a categorical :class:`DataFrame`.

.. ipython:: python

    import pandas as pd

    df = pd.DataFrame({"col1_a": [1, 0, 1], "col1_b": [0, 1, 0],
                       "col2_a": [0, 1, 0], "col2_b": [1, 0, 0],
                       "col2_c": [0, 0, 1]})

    pd.from_dummies(df, sep="_")

Writing to ORC files

The new method :meth:`DataFrame.to_orc` allows writing to ORC files (:issue:`43864`).

This functionality depends the pyarrow library. For more details, see :ref:`the IO docs on ORC <io.orc>`.

Warning

df = pd.DataFrame(data={"col1": [1, 2], "col2": [3, 4]})
df.to_orc("./out.orc")

Reading directly from TAR archives

I/O methods like :func:`read_csv` or :meth:`DataFrame.to_json` now allow reading and writing directly on TAR archives (:issue:`44787`).

df = pd.read_csv("./movement.tar.gz")
# ...
df.to_csv("./out.tar.gz")

This supports .tar, .tar.gz, .tar.bz and .tar.xz2 archives. The used compression method is inferred from the filename. If the compression method cannot be inferred, use the compression argument:

df = pd.read_csv(some_file_obj, compression={"method": "tar", "mode": "r:gz"}) # noqa F821

(mode being one of tarfile.open's modes: https://docs.python.org/3/library/tarfile.html#tarfile.open)

read_xml now supports dtype, converters, and parse_dates

Similar to other IO methods, :func:`pandas.read_xml` now supports assigning specific dtypes to columns, apply converter methods, and parse dates (:issue:`43567`).

.. ipython:: python

    xml_dates = """<?xml version='1.0' encoding='utf-8'?>
    <data>
      <row>
        <shape>square</shape>
        <degrees>00360</degrees>
        <sides>4.0</sides>
        <date>2020-01-01</date>
       </row>
      <row>
        <shape>circle</shape>
        <degrees>00360</degrees>
        <sides/>
        <date>2021-01-01</date>
      </row>
      <row>
        <shape>triangle</shape>
        <degrees>00180</degrees>
        <sides>3.0</sides>
        <date>2022-01-01</date>
      </row>
    </data>"""

    df = pd.read_xml(
        xml_dates,
        dtype={'sides': 'Int64'},
        converters={'degrees': str},
        parse_dates=['date']
    )
    df
    df.dtypes


read_xml now supports large XML using iterparse

For very large XML files that can range in hundreds of megabytes to gigabytes, :func:`pandas.read_xml` now supports parsing such sizeable files using lxml's iterparse and etree's iterparse which are memory-efficient methods to iterate through XML trees and extract specific elements and attributes without holding entire tree in memory (:issue:`45442`).

In [1]: df = pd.read_xml(
...      "/path/to/downloaded/enwikisource-latest-pages-articles.xml",
...      iterparse = {"page": ["title", "ns", "id"]})
...  )
df
Out[2]:
                                                     title   ns        id
0                                       Gettysburg Address    0     21450
1                                                Main Page    0     42950
2                            Declaration by United Nations    0      8435
3             Constitution of the United States of America    0      8435
4                     Declaration of Independence (Israel)    0     17858
...                                                    ...  ...       ...
3578760               Page:Black cat 1897 07 v2 n10.pdf/17  104    219649
3578761               Page:Black cat 1897 07 v2 n10.pdf/43  104    219649
3578762               Page:Black cat 1897 07 v2 n10.pdf/44  104    219649
3578763      The History of Tom Jones, a Foundling/Book IX    0  12084291
3578764  Page:Shakespeare of Stratford (1926) Yale.djvu/91  104     21450

[3578765 rows x 3 columns]

Other enhancements

Notable bug fixes

These are bug fixes that might have notable behavior changes.

Using dropna=True with groupby transforms

A transform is an operation whose result has the same size as its input. When the result is a :class:`DataFrame` or :class:`Series`, it is also required that the index of the result matches that of the input. In pandas 1.4, using :meth:`.DataFrameGroupBy.transform` or :meth:`.SeriesGroupBy.transform` with null values in the groups and dropna=True gave incorrect results. Demonstrated by the examples below, the incorrect results either contained incorrect values, or the result did not have the same index as the input.

.. ipython:: python

    df = pd.DataFrame({'a': [1, 1, np.nan], 'b': [2, 3, 4]})

Old behavior:

In [3]: # Value in the last row should be np.nan
        df.groupby('a', dropna=True).transform('sum')
Out[3]:
   b
0  5
1  5
2  5

In [3]: # Should have one additional row with the value np.nan
        df.groupby('a', dropna=True).transform(lambda x: x.sum())
Out[3]:
   b
0  5
1  5

In [3]: # The value in the last row is np.nan interpreted as an integer
        df.groupby('a', dropna=True).transform('ffill')
Out[3]:
                     b
0                    2
1                    3
2 -9223372036854775808

In [3]: # Should have one additional row with the value np.nan
        df.groupby('a', dropna=True).transform(lambda x: x)
Out[3]:
   b
0  2
1  3

New behavior:

.. ipython:: python

    df.groupby('a', dropna=True).transform('sum')
    df.groupby('a', dropna=True).transform(lambda x: x.sum())
    df.groupby('a', dropna=True).transform('ffill')
    df.groupby('a', dropna=True).transform(lambda x: x)

Serializing tz-naive Timestamps with to_json() with iso_dates=True

:meth:`DataFrame.to_json`, :meth:`Series.to_json`, and :meth:`Index.to_json` would incorrectly localize DatetimeArrays/DatetimeIndexes with tz-naive Timestamps to UTC. (:issue:`38760`)

Note that this patch does not fix the localization of tz-aware Timestamps to UTC upon serialization. (Related issue :issue:`12997`)

Old Behavior

.. ipython:: python

    index = pd.date_range(
        start='2020-12-28 00:00:00',
        end='2020-12-28 02:00:00',
        freq='1H',
    )
    a = pd.Series(
        data=range(3),
        index=index,
    )

In [4]: a.to_json(date_format='iso')
Out[4]: '{"2020-12-28T00:00:00.000Z":0,"2020-12-28T01:00:00.000Z":1,"2020-12-28T02:00:00.000Z":2}'

In [5]: pd.read_json(a.to_json(date_format='iso'), typ="series").index == a.index
Out[5]: array([False, False, False])

New Behavior

.. ipython:: python

    a.to_json(date_format='iso')
    # Roundtripping now works
    pd.read_json(a.to_json(date_format='iso'), typ="series").index == a.index

DataFrameGroupBy.value_counts with non-grouping categorical columns and observed=True

Calling :meth:`.DataFrameGroupBy.value_counts` with observed=True would incorrectly drop non-observed categories of non-grouping columns (:issue:`46357`).

In [6]: df = pd.DataFrame(["a", "b", "c"], dtype="category").iloc[0:2]
In [7]: df
Out[7]:
   0
0  a
1  b

Old Behavior

In [8]: df.groupby(level=0, observed=True).value_counts()
Out[8]:
0  a    1
1  b    1
dtype: int64

New Behavior

In [9]: df.groupby(level=0, observed=True).value_counts()
Out[9]:
0  a    1
1  a    0
   b    1
0  b    0
   c    0
1  c    0
dtype: int64

Backwards incompatible API changes

Increased minimum versions for dependencies

Some minimum supported versions of dependencies were updated. If installed, we now require:

Package Minimum Version Required Changed
numpy 1.20.3 X X
mypy (dev) 0.971   X
beautifulsoup4 4.9.3   X
blosc 1.21.0   X
bottleneck 1.3.2   X
fsspec 2021.07.0   X
hypothesis 6.13.0   X
gcsfs 2021.07.0   X
jinja2 3.0.0   X
lxml 4.6.3   X
numba 0.53.1   X
numexpr 2.7.3   X
openpyxl 3.0.7   X
pandas-gbq 0.15.0   X
psycopg2 2.8.6   X
pymysql 1.0.2   X
pyreadstat 1.1.2   X
pyxlsb 1.0.8   X
s3fs 2021.08.0   X
scipy 1.7.1   X
sqlalchemy 1.4.16   X
tabulate 0.8.9   X
xarray 0.19.0   X
xlsxwriter 1.4.3   X

For optional libraries the general recommendation is to use the latest version. The following table lists the lowest version per library that is currently being tested throughout the development of pandas. Optional libraries below the lowest tested version may still work, but are not considered supported.

Package Minimum Version Changed
beautifulsoup4 4.9.3 X
blosc 1.21.0 X
bottleneck 1.3.2 X
brotlipy 0.7.0  
fastparquet 0.4.0  
fsspec 2021.08.0 X
html5lib 1.1  
hypothesis 6.13.0 X
gcsfs 2021.08.0 X
jinja2 3.0.0 X
lxml 4.6.3 X
matplotlib 3.3.2  
numba 0.53.1 X
numexpr 2.7.3 X
odfpy 1.4.1  
openpyxl 3.0.7 X
pandas-gbq 0.15.0 X
psycopg2 2.8.6 X
pyarrow 1.0.1  
pymysql 1.0.2 X
pyreadstat 1.1.2 X
pytables 3.6.1  
python-snappy 0.6.0  
pyxlsb 1.0.8 X
s3fs 2021.08.0 X
scipy 1.7.1 X
sqlalchemy 1.4.16 X
tabulate 0.8.9 X
tzdata 2022a  
xarray 0.19.0 X
xlrd 2.0.1  
xlsxwriter 1.4.3 X
xlwt 1.3.0  
zstandard 0.15.2  

See :ref:`install.dependencies` and :ref:`install.optional_dependencies` for more.

Other API changes

Deprecations

Warning

In the next major version release, 2.0, several larger API changes are being considered without a formal deprecation such as making the standard library zoneinfo the default timezone implementation instead of pytz, having the :class:`Index` support all data types instead of having multiple subclasses (:class:`CategoricalIndex`, :class:`Int64Index`, etc.), and more. The changes under consideration are logged in this Github issue, and any feedback or concerns are welcome.

Label-based integer slicing on a Series with an Int64Index or RangeIndex

In a future version, integer slicing on a :class:`Series` with a :class:`Int64Index` or :class:`RangeIndex` will be treated as label-based, not positional. This will make the behavior consistent with other :meth:`Series.__getitem__` and :meth:`Series.__setitem__` behaviors (:issue:`45162`).

For example:

.. ipython:: python

   ser = pd.Series([1, 2, 3, 4, 5], index=[2, 3, 5, 7, 11])

In the old behavior, ser[2:4] treats the slice as positional:

Old behavior:

In [3]: ser[2:4]
Out[3]:
5    3
7    4
dtype: int64

In a future version, this will be treated as label-based:

Future behavior:

In [4]: ser.loc[2:4]
Out[4]:
2    1
3    2
dtype: int64

To retain the old behavior, use series.iloc[i:j]. To get the future behavior, use series.loc[i:j].

Slicing on a :class:`DataFrame` will not be affected.

All attributes of :class:`ExcelWriter` were previously documented as not public. However some third party Excel engines documented accessing ExcelWriter.book or ExcelWriter.sheets, and users were utilizing these and possibly other attributes. Previously these attributes were not safe to use; e.g. modifications to ExcelWriter.book would not update ExcelWriter.sheets and conversely. In order to support this, pandas has made some attributes public and improved their implementations so that they may now be safely used. (:issue:`45572`)

The following attributes are now public and considered safe to access.

  • book
  • check_extension
  • close
  • date_format
  • datetime_format
  • engine
  • if_sheet_exists
  • sheets
  • supported_extensions

The following attributes have been deprecated. They now raise a FutureWarning when accessed and will be removed in a future version. Users should be aware that their usage is considered unsafe, and can lead to unexpected results.

  • cur_sheet
  • handles
  • path
  • save
  • write_cells

See the documentation of :class:`ExcelWriter` for further details.

Using group_keys with transformers in :meth:`.GroupBy.apply`

In previous versions of pandas, if it was inferred that the function passed to :meth:`.GroupBy.apply` was a transformer (i.e. the resulting index was equal to the input index), the group_keys argument of :meth:`DataFrame.groupby` and :meth:`Series.groupby` was ignored and the group keys would never be added to the index of the result. In the future, the group keys will be added to the index when the user specifies group_keys=True.

As group_keys=True is the default value of :meth:`DataFrame.groupby` and :meth:`Series.groupby`, not specifying group_keys with a transformer will raise a FutureWarning. This can be silenced and the previous behavior retained by specifying group_keys=False.

Inplace operation when setting values with loc and iloc

Most of the time setting values with :meth:`DataFrame.iloc` attempts to set values inplace, only falling back to inserting a new array if necessary. There are some cases where this rule is not followed, for example when setting an entire column from an array with different dtype:

.. ipython:: python

   df = pd.DataFrame({'price': [11.1, 12.2]}, index=['book1', 'book2'])
   original_prices = df['price']
   new_prices = np.array([98, 99])

Old behavior:

In [3]: df.iloc[:, 0] = new_prices
In [4]: df.iloc[:, 0]
Out[4]:
book1    98
book2    99
Name: price, dtype: int64
In [5]: original_prices
Out[5]:
book1    11.1
book2    12.2
Name: price, float: 64

This behavior is deprecated. In a future version, setting an entire column with iloc will attempt to operate inplace.

Future behavior:

In [3]: df.iloc[:, 0] = new_prices
In [4]: df.iloc[:, 0]
Out[4]:
book1    98.0
book2    99.0
Name: price, dtype: float64
In [5]: original_prices
Out[5]:
book1    98.0
book2    99.0
Name: price, dtype: float64

To get the old behavior, use :meth:`DataFrame.__setitem__` directly:

In [3]: df[df.columns[0]] = new_prices
In [4]: df.iloc[:, 0]
Out[4]
book1    98
book2    99
Name: price, dtype: int64
In [5]: original_prices
Out[5]:
book1    11.1
book2    12.2
Name: price, dtype: float64

To get the old behaviour when df.columns is not unique and you want to change a single column by index, you can use :meth:`DataFrame.isetitem`, which has been added in pandas 1.5:

In [3]: df_with_duplicated_cols = pd.concat([df, df], axis='columns')
In [3]: df_with_duplicated_cols.isetitem(0, new_prices)
In [4]: df_with_duplicated_cols.iloc[:, 0]
Out[4]:
book1    98
book2    99
Name: price, dtype: int64
In [5]: original_prices
Out[5]:
book1    11.1
book2    12.2
Name: 0, dtype: float64

numeric_only default value

Across the :class:`DataFrame`, :class:`.DataFrameGroupBy`, and :class:`.Resampler` operations such as min, sum, and idxmax, the default value of the numeric_only argument, if it exists at all, was inconsistent. Furthermore, operations with the default value None can lead to surprising results. (:issue:`46560`)

In [1]: df = pd.DataFrame({"a": [1, 2], "b": ["x", "y"]})

In [2]: # Reading the next line without knowing the contents of df, one would
        # expect the result to contain the products for both columns a and b.
        df[["a", "b"]].prod()
Out[2]:
a    2
dtype: int64

To avoid this behavior, the specifying the value numeric_only=None has been deprecated, and will be removed in a future version of pandas. In the future, all operations with a numeric_only argument will default to False. Users should either call the operation only with columns that can be operated on, or specify numeric_only=True to operate only on Boolean, integer, and float columns.

In order to support the transition to the new behavior, the following methods have gained the numeric_only argument.

Other Deprecations

Performance improvements

Bug fixes

Categorical

Datetimelike

Timedelta

  • Bug in :func:`astype_nansafe` astype("timedelta64[ns]") fails when np.nan is included (:issue:`45798`)
  • Bug in constructing a :class:`Timedelta` with a np.timedelta64 object and a unit sometimes silently overflowing and returning incorrect results instead of raising OutOfBoundsTimedelta (:issue:`46827`)
  • Bug in constructing a :class:`Timedelta` from a large integer or float with unit="W" silently overflowing and returning incorrect results instead of raising OutOfBoundsTimedelta (:issue:`47268`)

Time Zones

Numeric

  • Bug in operations with array-likes with dtype="boolean" and :attr:`NA` incorrectly altering the array in-place (:issue:`45421`)
  • Bug in arithmetic operations with nullable types without :attr:`NA` values not matching the same operation with non-nullable types (:issue:`48223`)
  • Bug in floordiv when dividing by IntegerDtype 0 would return 0 instead of inf (:issue:`48223`)
  • Bug in division, pow and mod operations on array-likes with dtype="boolean" not being like their np.bool_ counterparts (:issue:`46063`)
  • Bug in multiplying a :class:`Series` with IntegerDtype or FloatingDtype by an array-like with timedelta64[ns] dtype incorrectly raising (:issue:`45622`)
  • Bug in :meth:`mean` where the optional dependency bottleneck causes precision loss linear in the length of the array. bottleneck has been disabled for :meth:`mean` improving the loss to log-linear but may result in a performance decrease. (:issue:`42878`)
  • Bug in :func:`factorize` would convert the value None to np.nan (:issue:`46601`)

Conversion

Strings

Interval

Indexing

Missing

MultiIndex

I/O

Period

Plotting

Groupby/resample/rolling

Reshaping

Sparse

ExtensionArray

Styler

Metadata

Other

Contributors