Skip to content

Latest commit

 

History

History
895 lines (747 loc) · 70.1 KB

v2.1.0.rst

File metadata and controls

895 lines (747 loc) · 70.1 KB

What's new in 2.1.0 (Aug 30, 2023)

These are the changes in pandas 2.1.0. See :ref:`release` for a full changelog including other versions of pandas.

{{ header }}

Enhancements

PyArrow will become a required dependency with pandas 3.0

PyArrow will become a required dependency of pandas starting with pandas 3.0. This decision was made based on PDEP 10.

This will enable more changes that are hugely beneficial to pandas users, including but not limited to:

  • inferring strings as PyArrow backed strings by default enabling a significant reduction of the memory footprint and huge performance improvements.
  • inferring more complex dtypes with PyArrow by default, like Decimal, lists, bytes, structured data and more.
  • Better interoperability with other libraries that depend on Apache Arrow.

We are collecting feedback on this decision here.

Avoid NumPy object dtype for strings by default

Previously, all strings were stored in columns with NumPy object dtype by default. This release introduces an option future.infer_string that infers all strings as PyArrow backed strings with dtype "string[pyarrow_numpy]" instead. This is a new string dtype implementation that follows NumPy semantics in comparison operations and will return np.nan as the missing value indicator. Setting the option will also infer the dtype "string" as a :class:`StringDtype` with storage set to "pyarrow_numpy", ignoring the value behind the option mode.string_storage.

This option only works if PyArrow is installed. PyArrow backed strings have a significantly reduced memory footprint and provide a big performance improvement compared to NumPy object (:issue:`54430`).

The option can be enabled with:

pd.options.future.infer_string = True

This behavior will become the default with pandas 3.0.

DataFrame reductions preserve extension dtypes

In previous versions of pandas, the results of DataFrame reductions (:meth:`DataFrame.sum` :meth:`DataFrame.mean` etc.) had NumPy dtypes, even when the DataFrames were of extension dtypes. Pandas can now keep the dtypes when doing reductions over DataFrame columns with a common dtype (:issue:`52788`).

Old Behavior

In [1]: df = pd.DataFrame({"a": [1, 1, 2, 1], "b": [np.nan, 2.0, 3.0, 4.0]}, dtype="Int64")
In [2]: df.sum()
Out[2]:
a    5
b    9
dtype: int64
In [3]: df = df.astype("int64[pyarrow]")
In [4]: df.sum()
Out[4]:
a    5
b    9
dtype: int64

New Behavior

.. ipython:: python

    df = pd.DataFrame({"a": [1, 1, 2, 1], "b": [np.nan, 2.0, 3.0, 4.0]}, dtype="Int64")
    df.sum()
    df = df.astype("int64[pyarrow]")
    df.sum()

Notice that the dtype is now a masked dtype and PyArrow dtype, respectively, while previously it was a NumPy integer dtype.

To allow DataFrame reductions to preserve extension dtypes, :meth:`.ExtensionArray._reduce` has gotten a new keyword parameter keepdims. Calling :meth:`.ExtensionArray._reduce` with keepdims=True should return an array of length 1 along the reduction axis. In order to maintain backward compatibility, the parameter is not required, but will it become required in the future. If the parameter is not found in the signature, DataFrame reductions can not preserve extension dtypes. Also, if the parameter is not found, a FutureWarning will be emitted and type checkers like mypy may complain about the signature not being compatible with :meth:`.ExtensionArray._reduce`.

Copy-on-Write improvements

  • :meth:`Series.transform` not respecting Copy-on-Write when func modifies :class:`Series` inplace (:issue:`53747`)
  • Calling :meth:`Index.values` will now return a read-only NumPy array (:issue:`53704`)
  • Setting a :class:`Series` into a :class:`DataFrame` now creates a lazy instead of a deep copy (:issue:`53142`)
  • The :class:`DataFrame` constructor, when constructing a DataFrame from a dictionary of Index objects and specifying copy=False, will now use a lazy copy of those Index objects for the columns of the DataFrame (:issue:`52947`)
  • A shallow copy of a Series or DataFrame (df.copy(deep=False)) will now also return a shallow copy of the rows/columns :class:`Index` objects instead of only a shallow copy of the data, i.e. the index of the result is no longer identical (df.copy(deep=False).index is df.index is no longer True) (:issue:`53721`)
  • :meth:`DataFrame.head` and :meth:`DataFrame.tail` will now return deep copies (:issue:`54011`)
  • Add lazy copy mechanism to :meth:`DataFrame.eval` (:issue:`53746`)
  • Trying to operate inplace on a temporary column selection (for example, df["a"].fillna(100, inplace=True)) will now always raise a warning when Copy-on-Write is enabled. In this mode, operating inplace like this will never work, since the selection behaves as a temporary copy. This holds true for:
    • DataFrame.update / Series.update
    • DataFrame.fillna / Series.fillna
    • DataFrame.replace / Series.replace
    • DataFrame.clip / Series.clip
    • DataFrame.where / Series.where
    • DataFrame.mask / Series.mask
    • DataFrame.interpolate / Series.interpolate
    • DataFrame.ffill / Series.ffill
    • DataFrame.bfill / Series.bfill

New :meth:`DataFrame.map` method and support for ExtensionArrays

The :meth:`DataFrame.map` been added and :meth:`DataFrame.applymap` has been deprecated. :meth:`DataFrame.map` has the same functionality as :meth:`DataFrame.applymap`, but the new name better communicates that this is the :class:`DataFrame` version of :meth:`Series.map` (:issue:`52353`).

When given a callable, :meth:`Series.map` applies the callable to all elements of the :class:`Series`. Similarly, :meth:`DataFrame.map` applies the callable to all elements of the :class:`DataFrame`, while :meth:`Index.map` applies the callable to all elements of the :class:`Index`.

Frequently, it is not desirable to apply the callable to nan-like values of the array and to avoid doing that, the map method could be called with na_action="ignore", i.e. ser.map(func, na_action="ignore"). However, na_action="ignore" was not implemented for many :class:`.ExtensionArray` and Index types and na_action="ignore" did not work correctly for any :class:`.ExtensionArray` subclass except the nullable numeric ones (i.e. with dtype :class:`Int64` etc.).

na_action="ignore" now works for all array types (:issue:`52219`, :issue:`51645`, :issue:`51809`, :issue:`51936`, :issue:`52033`; :issue:`52096`).

Previous behavior:

In [1]: ser = pd.Series(["a", "b", np.nan], dtype="category")
In [2]: ser.map(str.upper, na_action="ignore")
NotImplementedError
In [3]: df = pd.DataFrame(ser)
In [4]: df.applymap(str.upper, na_action="ignore")  # worked for DataFrame
     0
0    A
1    B
2  NaN
In [5]: idx = pd.Index(ser)
In [6]: idx.map(str.upper, na_action="ignore")
TypeError: CategoricalIndex.map() got an unexpected keyword argument 'na_action'

New behavior:

.. ipython:: python

    ser = pd.Series(["a", "b", np.nan], dtype="category")
    ser.map(str.upper, na_action="ignore")
    df = pd.DataFrame(ser)
    df.map(str.upper, na_action="ignore")
    idx = pd.Index(ser)
    idx.map(str.upper, na_action="ignore")

Also, note that :meth:`Categorical.map` implicitly has had its na_action set to "ignore" by default. This has been deprecated and the default for :meth:`Categorical.map` will change to na_action=None, consistent with all the other array types.

New implementation of :meth:`DataFrame.stack`

pandas has reimplemented :meth:`DataFrame.stack`. To use the new implementation, pass the argument future_stack=True. This will become the only option in pandas 3.0.

The previous implementation had two main behavioral downsides.

  1. The previous implementation would unnecessarily introduce NA values into the result. The user could have NA values automatically removed by passing dropna=True (the default), but doing this could also remove NA values from the result that existed in the input. See the examples below.
  2. The previous implementation with sort=True (the default) would sometimes sort part of the resulting index, and sometimes not. If the input's columns are not a :class:`MultiIndex`, then the resulting index would never be sorted. If the columns are a :class:`MultiIndex`, then in most cases the level(s) in the resulting index that come from stacking the column level(s) would be sorted. In rare cases such level(s) would be sorted in a non-standard order, depending on how the columns were created.

The new implementation (future_stack=True) will no longer unnecessarily introduce NA values when stacking multiple levels and will never sort. As such, the arguments dropna and sort are not utilized and must remain unspecified when using future_stack=True. These arguments will be removed in the next major release.

.. ipython:: python

    columns = pd.MultiIndex.from_tuples([("B", "d"), ("A", "c")])
    df = pd.DataFrame([[0, 2], [1, 3]], index=["z", "y"], columns=columns)
    df

In the previous version (future_stack=False), the default of dropna=True would remove unnecessarily introduced NA values but still coerce the dtype to float64 in the process. In the new version, no NAs are introduced and so there is no coercion of the dtype.

.. ipython:: python
    :okwarning:

    df.stack([0, 1], future_stack=False, dropna=True)
    df.stack([0, 1], future_stack=True)

If the input contains NA values, the previous version would drop those as well with dropna=True or introduce new NA values with dropna=False. The new version persists all values from the input.

.. ipython:: python
    :okwarning:

    df = pd.DataFrame([[0, 2], [np.nan, np.nan]], columns=columns)
    df
    df.stack([0, 1], future_stack=False, dropna=True)
    df.stack([0, 1], future_stack=False, dropna=False)
    df.stack([0, 1], future_stack=True)

Other enhancements

Backwards incompatible API changes

Increased minimum version for Python

pandas 2.1.0 supports Python 3.9 and higher.

Increased minimum versions for dependencies

Some minimum supported versions of dependencies were updated. If installed, we now require:

Package Minimum Version Required Changed
numpy 1.22.4 X X
mypy (dev) 1.4.1   X
beautifulsoup4 4.11.1   X
bottleneck 1.3.4   X
dataframe-api-compat 0.1.7   X
fastparquet 0.8.1   X
fsspec 2022.05.0   X
hypothesis 6.46.1   X
gcsfs 2022.05.0   X
jinja2 3.1.2   X
lxml 4.8.0   X
numba 0.55.2   X
numexpr 2.8.0   X
openpyxl 3.0.10   X
pandas-gbq 0.17.5   X
psycopg2 2.9.3   X
pyreadstat 1.1.5   X
pyqt5 5.15.6   X
pytables 3.7.0   X
pytest 7.3.2   X
python-snappy 0.6.1   X
pyxlsb 1.0.9   X
s3fs 2022.05.0   X
scipy 1.8.1   X
sqlalchemy 1.4.36   X
tabulate 0.8.10   X
xarray 2022.03.0   X
xlsxwriter 3.0.3   X
zstandard 0.17.0   X

For optional libraries the general recommendation is to use the latest version.

See :ref:`install.dependencies` and :ref:`install.optional_dependencies` for more.

Other API changes

Deprecations

Deprecated silent upcasting in setitem-like Series operations

PDEP-6: https://pandas.pydata.org/pdeps/0006-ban-upcasting.html

Setitem-like operations on Series (or DataFrame columns) which silently upcast the dtype are deprecated and show a warning. Examples of affected operations are:

  • ser.fillna('foo', inplace=True)
  • ser.where(ser.isna(), 'foo', inplace=True)
  • ser.iloc[indexer] = 'foo'
  • ser.loc[indexer] = 'foo'
  • df.iloc[indexer, 0] = 'foo'
  • df.loc[indexer, 'a'] = 'foo'
  • ser[indexer] = 'foo'

where ser is a :class:`Series`, df is a :class:`DataFrame`, and indexer could be a slice, a mask, a single value, a list or array of values, or any other allowed indexer.

In a future version, these will raise an error and you should cast to a common dtype first.

Previous behavior:

In [1]: ser = pd.Series([1, 2, 3])

In [2]: ser
Out[2]:
0    1
1    2
2    3
dtype: int64

In [3]: ser[0] = 'not an int64'

In [4]: ser
Out[4]:
0    not an int64
1               2
2               3
dtype: object

New behavior:

In [1]: ser = pd.Series([1, 2, 3])

In [2]: ser
Out[2]:
0    1
1    2
2    3
dtype: int64

In [3]: ser[0] = 'not an int64'
FutureWarning:
  Setting an item of incompatible dtype is deprecated and will raise in a future error of pandas.
  Value 'not an int64' has dtype incompatible with int64, please explicitly cast to a compatible dtype first.

In [4]: ser
Out[4]:
0    not an int64
1               2
2               3
dtype: object

To retain the current behaviour, in the case above you could cast ser to object dtype first:

.. ipython:: python

  ser = pd.Series([1, 2, 3])
  ser = ser.astype('object')
  ser[0] = 'not an int64'
  ser

Depending on the use-case, it might be more appropriate to cast to a different dtype. In the following, for example, we cast to float64:

.. ipython:: python

  ser = pd.Series([1, 2, 3])
  ser = ser.astype('float64')
  ser[0] = 1.1
  ser

For further reading, please see https://pandas.pydata.org/pdeps/0006-ban-upcasting.html.

Deprecated parsing datetimes with mixed time zones

Parsing datetimes with mixed time zones is deprecated and shows a warning unless user passes utc=True to :func:`to_datetime` (:issue:`50887`)

Previous behavior:

In [7]: data = ["2020-01-01 00:00:00+06:00", "2020-01-01 00:00:00+01:00"]

In [8]:  pd.to_datetime(data, utc=False)
Out[8]:
Index([2020-01-01 00:00:00+06:00, 2020-01-01 00:00:00+01:00], dtype='object')

New behavior:

In [9]: pd.to_datetime(data, utc=False)
FutureWarning:
  In a future version of pandas, parsing datetimes with mixed time zones will raise
  a warning unless `utc=True`. Please specify `utc=True` to opt in to the new behaviour
  and silence this warning. To create a `Series` with mixed offsets and `object` dtype,
  please use `apply` and `datetime.datetime.strptime`.
Index([2020-01-01 00:00:00+06:00, 2020-01-01 00:00:00+01:00], dtype='object')

In order to silence this warning and avoid an error in a future version of pandas, please specify utc=True:

.. ipython:: python

    data = ["2020-01-01 00:00:00+06:00", "2020-01-01 00:00:00+01:00"]
    pd.to_datetime(data, utc=True)

To create a Series with mixed offsets and object dtype, please use apply and datetime.datetime.strptime:

.. ipython:: python

    import datetime as dt

    data = ["2020-01-01 00:00:00+06:00", "2020-01-01 00:00:00+01:00"]
    pd.Series(data).apply(lambda x: dt.datetime.strptime(x, '%Y-%m-%d %H:%M:%S%z'))

Other Deprecations

Performance improvements

Bug fixes

Categorical

Datetimelike

Timedelta

Timezones

Numeric

Conversion

Strings

Interval

Indexing

Missing

MultiIndex

I/O

Period

Plotting

Groupby/resample/rolling

Reshaping

Sparse

ExtensionArray

Styler

Metadata

Other

Contributors

.. contributors:: v2.0.3..v2.1.0|HEAD