Skip to content

Latest commit

 

History

History
568 lines (448 loc) · 30.4 KB

v1.5.0.rst

File metadata and controls

568 lines (448 loc) · 30.4 KB

What's new in 1.5.0 (??)

These are the changes in pandas 1.5.0. See release for a full changelog including other versions of pandas.

{{ header }}

Enhancements

Styler

  • New method .Styler.to_string for alternative customisable output methods (44502)
  • Added the ability to render border and border-{side} CSS properties in Excel (42276)
  • Added a new method .Styler.concat which allows adding customised footer rows to visualise additional calculations on the data, e.g. totals and counts etc. (43875, 46186)
  • .Styler.highlight_null now accepts color consistently with other builtin methods and deprecates null_color although this remains backwards compatible (45907)

enhancement2

Other enhancements

  • MultiIndex.to_frame now supports the argument allow_duplicates and raises on duplicate labels if it is missing or False (45245)
  • StringArray now accepts array-likes containing nan-likes (None, np.nan) for the values parameter in its constructor in addition to strings and pandas.NA. (40839)
  • Improved the rendering of categories in CategoricalIndex (45218)
  • to_numeric now preserves float64 arrays when downcasting would generate values not representable in float32 (43693)
  • Series.reset_index and DataFrame.reset_index now support the argument allow_duplicates (44410)
  • .GroupBy.min and .GroupBy.max now supports Numba execution with the engine keyword (45428)
  • read_csv now supports defaultdict as a dtype parameter (41574)
  • DataFrame.rolling and Series.rolling now support a step parameter with fixed-length windows (15354)
  • Implemented a bool-dtype Index, passing a bool-dtype array-like to pd.Index will now retain bool dtype instead of casting to object (45061)
  • Implemented a complex-dtype Index, passing a complex-dtype array-like to pd.Index will now retain complex dtype instead of casting to object (45845)
  • Improved error message in ~pandas.core.window.Rolling when window is a frequency and NaT is in the rolling axis (46087)
  • Series and DataFrame with IntegerDtype now supports bitwise operations (34463)
  • Add milliseconds field support for ~pandas.DateOffset (43371)

Notable bug fixes

These are bug fixes that might have notable behavior changes.

Styler

  • Fixed bug in CSSToExcelConverter leading to TypeError when border color provided without border style for xlsxwriter engine (42276)

Using dropna=True with groupby transforms

A transform is an operation whose result has the same size as its input. When the result is a DataFrame or Series, it is also required that the index of the result matches that of the input. In pandas 1.4, using .DataFrameGroupBy.transform or .SeriesGroupBy.transform with null values in the groups and dropna=True gave incorrect results. Demonstrated by the examples below, the incorrect results either contained incorrect values, or the result did not have the same index as the input.

python

df = pd.DataFrame({'a': [1, 1, np.nan], 'b': [2, 3, 4]})

Old behavior:

In [3]: # Value in the last row should be np.nan
        df.groupby('a', dropna=True).transform('sum')
Out[3]:
   b
0  5
1  5
2  5

In [3]: # Should have one additional row with the value np.nan
        df.groupby('a', dropna=True).transform(lambda x: x.sum())
Out[3]:
   b
0  5
1  5

In [3]: # The value in the last row is np.nan interpreted as an integer
        df.groupby('a', dropna=True).transform('ffill')
Out[3]:
                     b
0                    2
1                    3
2 -9223372036854775808

In [3]: # Should have one additional row with the value np.nan
        df.groupby('a', dropna=True).transform(lambda x: x)
Out[3]:
   b
0  2
1  3

New behavior:

python

df.groupby('a', dropna=True).transform('sum') df.groupby('a', dropna=True).transform(lambda x: x.sum()) df.groupby('a', dropna=True).transform('ffill') df.groupby('a', dropna=True).transform(lambda x: x)

Styler

  • Fix showing "None" as ylabel in Series.plot when not setting ylabel (46129)

notable_bug_fix2

Backwards incompatible API changes

read_xml now supports dtype, converters, and parse_dates

Similar to other IO methods, pandas.read_xml now supports assigning specific dtypes to columns, apply converter methods, and parse dates (43567).

python

xml_dates = """<?xml version='1.0' encoding='utf-8'?> <data> <row> <shape>square</shape> <degrees>00360</degrees> <sides>4.0</sides> <date>2020-01-01</date> </row> <row> <shape>circle</shape> <degrees>00360</degrees> <sides/> <date>2021-01-01</date> </row> <row> <shape>triangle</shape> <degrees>00180</degrees> <sides>3.0</sides> <date>2022-01-01</date> </row> </data>"""

df = pd.read_xml(

xml_dates, dtype={'sides': 'Int64'}, converters={'degrees': str}, parse_dates=['date']

) df df.dtypes

read_xml now supports large XML using iterparse

For very large XML files that can range in hundreds of megabytes to gigabytes, pandas.read_xml now supports parsing such sizeable files using lxml's iterparse and etree's iterparse which are memory-efficient methods to iterate through XML trees and extract specific elements and attributes without holding entire tree in memory (#45442).

In [1]: df = pd.read_xml(
...      "/path/to/downloaded/enwikisource-latest-pages-articles.xml",
...      iterparse = {"page": ["title", "ns", "id"]})
...  )
df
Out[2]:
                                                     title   ns        id
0                                       Gettysburg Address    0     21450
1                                                Main Page    0     42950
2                            Declaration by United Nations    0      8435
3             Constitution of the United States of America    0      8435
4                     Declaration of Independence (Israel)    0     17858
...                                                    ...  ...       ...
3578760               Page:Black cat 1897 07 v2 n10.pdf/17  104    219649
3578761               Page:Black cat 1897 07 v2 n10.pdf/43  104    219649
3578762               Page:Black cat 1897 07 v2 n10.pdf/44  104    219649
3578763      The History of Tom Jones, a Foundling/Book IX    0  12084291
3578764  Page:Shakespeare of Stratford (1926) Yale.djvu/91  104     21450

[3578765 rows x 3 columns]

api_breaking_change2

Increased minimum versions for dependencies

Some minimum supported versions of dependencies were updated. If installed, we now require:

Package Minimum Version Required Changed
mypy (dev) 0.941

X

For optional libraries the general recommendation is to use the latest version. The following table lists the lowest version per library that is currently being tested throughout the development of pandas. Optional libraries below the lowest tested version may still work, but are not considered supported.

Package Minimum Version Changed

X

See install.dependencies and install.optional_dependencies for more.

Other API changes

  • BigQuery I/O methods read_gbq and DataFrame.to_gbq default to auth_local_webserver = True. Google has deprecated the auth_local_webserver = False "out of band" (copy-paste) flow. The auth_local_webserver = False option is planned to stop working in October 2022. (46312)

Deprecations

In a future version, integer slicing on a Series with a Int64Index or RangeIndex will be treated as label-based, not positional. This will make the behavior consistent with other Series.__getitem__ and Series.__setitem__ behaviors (45162).

For example:

python

ser = pd.Series([1, 2, 3, 4, 5], index=[2, 3, 5, 7, 11])

In the old behavior, ser[2:4] treats the slice as positional:

Old behavior:

In [3]: ser[2:4]
Out[3]:
5    3
7    4
dtype: int64

In a future version, this will be treated as label-based:

Future behavior:

In [4]: ser.loc[2:4]
Out[4]:
2    1
3    2
dtype: int64

To retain the old behavior, use series.iloc[i:j]. To get the future behavior, use series.loc[i:j].

Slicing on a DataFrame will not be affected.

ExcelWriter attributes

All attributes of ExcelWriter were previously documented as not public. However some third party Excel engines documented accessing ExcelWriter.book or ExcelWriter.sheets, and users were utilizing these and possibly other attributes. Previously these attributes were not safe to use; e.g. modifications to ExcelWriter.book would not update ExcelWriter.sheets and conversely. In order to support this, pandas has made some attributes public and improved their implementations so that they may now be safely used. (45572)

The following attributes are now public and considered safe to access.

  • book
  • check_extension
  • close
  • date_format
  • datetime_format
  • engine
  • if_sheet_exists
  • sheets
  • supported_extensions

The following attributes have been deprecated. They now raise a FutureWarning when accessed and will be removed in a future version. Users should be aware that their usage is considered unsafe, and can lead to unexpected results.

  • cur_sheet
  • handles
  • path
  • save
  • write_cells

See the documentation of ExcelWriter for further details.

Other Deprecations

  • Deprecated the keyword line_terminator in DataFrame.to_csv and Series.to_csv, use lineterminator instead; this is for consistency with read_csv and the standard library 'csv' module (9568)
  • Deprecated behavior of SparseArray.astype, Series.astype, and DataFrame.astype with SparseDtype when passing a non-sparse dtype. In a future version, this will cast to that non-sparse dtype instead of wrapping it in a SparseDtype (34457)
  • Deprecated behavior of DatetimeIndex.intersection and DatetimeIndex.symmetric_difference (union behavior was already deprecated in version 1.3.0) with mixed time zones; in a future version both will be cast to UTC instead of object dtype (39328, 45357)
  • Deprecated DataFrame.iteritems, Series.iteritems, HDFStore.iteritems in favor of DataFrame.items, Series.items, HDFStore.items (45321)
  • Deprecated Series.is_monotonic and Index.is_monotonic in favor of Series.is_monotonic_increasing and Index.is_monotonic_increasing (45422, 21335)
  • Deprecated behavior of DatetimeIndex.astype, TimedeltaIndex.astype, PeriodIndex.astype when converting to an integer dtype other than int64. In a future version, these will convert to exactly the specified dtype (instead of always int64) and will raise if the conversion overflows (45034)
  • Deprecated the __array_wrap__ method of DataFrame and Series, rely on standard numpy ufuncs instead (45451)
  • Deprecated treating float-dtype data as wall-times when passed with a timezone to Series or DatetimeIndex (45573)
  • Deprecated the behavior of Series.fillna and DataFrame.fillna with timedelta64[ns] dtype and incompatible fill value; in a future version this will cast to a common dtype (usually object) instead of raising, matching the behavior of other dtypes (45746)
  • Deprecated the warn parameter in infer_freq (45947)
  • Deprecated allowing non-keyword arguments in ExtensionArray.argsort (46134)
  • Deprecated treating all-bool object-dtype columns as bool-like in DataFrame.any and DataFrame.all with bool_only=True, explicitly cast to bool instead (46188)
  • Deprecated behavior of method DataFrame.quantile, attribute numeric_only will default False. Including datetime/timedelta columns in the result (7308).
  • Deprecated Timedelta.freq and Timedelta.is_populated (46430)
  • Deprecated Timedelta.delta (46476)

Performance improvements

  • Performance improvement in DataFrame.corrwith for column-wise (axis=0) Pearson and Spearman correlation when other is a Series (46174)
  • Performance improvement in .GroupBy.transform for some user-defined DataFrame -> Series functions (45387)
  • Performance improvement in DataFrame.duplicated when subset consists of only one column (45236)
  • Performance improvement in .GroupBy.diff (16706)
  • Performance improvement in .GroupBy.transform when broadcasting values for user-defined functions (45708)
  • Performance improvement in .GroupBy.transform for user-defined functions when only a single group exists (44977)
  • Performance improvement in DataFrame.loc and Series.loc for tuple-based indexing of a MultiIndex (45681, 46040, 46330)
  • Performance improvement in MultiIndex.values when the MultiIndex contains levels of type DatetimeIndex, TimedeltaIndex or ExtensionDtypes (46288)
  • Performance improvement in merge when left and/or right are empty (45838)
  • Performance improvement in DataFrame.join when left and/or right are empty (46015)
  • Performance improvement in DataFrame.reindex and Series.reindex when target is a MultiIndex (46235)
  • Performance improvement when setting values in a pyarrow backed string array (46400)
  • Performance improvement in factorize (46109)
  • Performance improvement in DataFrame and Series constructors for extension dtype scalars (45854)

Bug fixes

Categorical

  • Bug in Categorical.view not accepting integer dtypes (25464)
  • Bug in CategoricalIndex.union when the index's categories are integer-dtype and the index contains NaN values incorrectly raising instead of casting to float64 (45362)

Datetimelike

  • Bug in DataFrame.quantile with datetime-like dtypes and no rows incorrectly returning float64 dtype instead of retaining datetime-like dtype (41544)
  • Bug in to_datetime with sequences of np.str_ objects incorrectly raising (32264)
  • Bug in Timestamp construction when passing datetime components as positional arguments and tzinfo as a keyword argument incorrectly raising (31929)
  • Bug in Index.astype when casting from object dtype to timedelta64[ns] dtype incorrectly casting np.datetime64("NaT") values to np.timedelta64("NaT") instead of raising (45722)
  • Bug in SeriesGroupBy.value_counts index when passing categorical column (44324)
  • Bug in DatetimeIndex.tz_localize localizing to UTC failing to make a copy of the underlying data (46460)

Timedelta

  • Bug in astype_nansafe astype("timedelta64[ns]") fails when np.nan is included (45798)

Time Zones

Numeric

  • Bug in operations with array-likes with dtype="boolean" and NA incorrectly altering the array in-place (45421)
  • Bug in division, pow and mod operations on array-likes with dtype="boolean" not being like their np.bool_ counterparts (46063)
  • Bug in multiplying a Series with IntegerDtype or FloatingDtype by an array-like with timedelta64[ns] dtype incorrectly raising (45622)

Conversion

  • Bug in DataFrame.astype not preserving subclasses (40810)
  • Bug in constructing a Series from a float-containing list or a floating-dtype ndarray-like (e.g. dask.Array) and an integer dtype raising instead of casting like we would with an np.ndarray (40110)
  • Bug in Float64Index.astype to unsigned integer dtype incorrectly casting to np.int64 dtype (45309)
  • Bug in Series.astype and DataFrame.astype from floating dtype to unsigned integer dtype failing to raise in the presence of negative values (45151)
  • Bug in array with FloatingDtype and values containing float-castable strings incorrectly raising (45424)
  • Bug when comparing string and datetime64ns objects causing OverflowError exception. (45506)

Strings

  • Bug in str.startswith and str.endswith when using other series as parameter _pat. Now raises TypeError (3485)

Interval

  • Bug in IntervalArray.__setitem__ when setting np.nan into an integer-backed array raising ValueError instead of TypeError (45484)

Indexing

  • Bug in loc.__getitem__ with a list of keys causing an internal inconsistency that could lead to a disconnect between frame.at[x, y] vs frame[y].loc[x] (22372)
  • Bug in DataFrame.iloc where indexing a single row on a DataFrame with a single ExtensionDtype column gave a copy instead of a view on the underlying data (45241)
  • Bug in Series.align does not create MultiIndex with union of levels when both MultiIndexes intersections are identical (45224)
  • Bug in setting a NA value (None or np.nan) into a Series with int-based IntervalDtype incorrectly casting to object dtype instead of a float-based IntervalDtype (45568)
  • Bug in indexing setting values into an ExtensionDtype column with df.iloc[:, i] = values with values having the same dtype as df.iloc[:, i] incorrectly inserting a new array instead of setting in-place (33457)
  • Bug in Series.__setitem__ with a non-integer Index when using an integer key to set a value that cannot be set inplace where a ValueError was raised instead of casting to a common dtype (45070)
  • Bug in Series.__setitem__ when setting incompatible values into a PeriodDtype or IntervalDtype Series raising when indexing with a boolean mask but coercing when indexing with otherwise-equivalent indexers; these now consistently coerce, along with Series.mask and Series.where (45768)
  • Bug in DataFrame.where with multiple columns with datetime-like dtypes failing to downcast results consistent with other dtypes (45837)
  • Bug in Series.loc.__setitem__ and Series.loc.__getitem__ not raising when using multiple keys without using a MultiIndex (13831)
  • Bug in Index.reindex raising AssertionError when level was specified but no MultiIndex was given; level is ignored now (35132)
  • Bug when setting a value too large for a Series dtype failing to coerce to a common type (26049, 32878)
  • Bug in loc.__setitem__ treating range keys as positional instead of label-based (45479)
  • Bug in Series.__setitem__ when setting boolean dtype values containing NA incorrectly raising instead of casting to boolean dtype (45462)
  • Bug in Series.__setitem__ where setting NA into a numeric-dtpye Series would incorrectly upcast to object-dtype rather than treating the value as np.nan (44199)
  • Bug in Series.__setitem__ with datetime64[ns] dtype, an all-False boolean mask, and an incompatible value incorrectly casting to object instead of retaining datetime64[ns] dtype (45967)
  • Bug in Index.__getitem__ raising ValueError when indexer is from boolean dtype with NA (45806)
  • Bug in Series.mask with inplace=True or setting values with a boolean mask with small integer dtypes incorrectly raising (45750)
  • Bug in DataFrame.mask with inplace=True and ExtensionDtype columns incorrectly raising (45577)
  • Bug in getting a column from a DataFrame with an object-dtype row index with datetime-like values: the resulting Series now preserves the exact object-dtype Index from the parent DataFrame (42950)
  • Bug in DataFrame.__getattribute__ raising AttributeError if columns have "string" dtype (46185)
  • Bug in indexing on a DatetimeIndex with a np.str_ key incorrectly raising (45580)
  • Bug in CategoricalIndex.get_indexer when index contains NaN values, resulting in elements that are in target but not present in the index to be mapped to the index of the NaN element, instead of -1 (45361)
  • Bug in setting large integer values into Series with float32 or float16 dtype incorrectly altering these values instead of coercing to float64 dtype (45844)
  • Bug in Series.asof and DataFrame.asof incorrectly casting bool-dtype results to float64 dtype (16063)

Missing

  • Bug in Series.fillna and DataFrame.fillna with downcast keyword not being respected in some cases where there are no NA values present (45423)
  • Bug in Series.fillna and DataFrame.fillna with IntervalDtype and incompatible value raising instead of casting to a common (usually object) dtype (45796)
  • Bug in DataFrame.interpolate with object-dtype column not returning a copy with inplace=False (45791)

MultiIndex

  • Bug in DataFrame.loc returning empty result when slicing a MultiIndex with a negative step size and non-null start/stop values (46156)
  • Bug in DataFrame.loc raising when slicing a MultiIndex with a negative step size other than -1 (46156)
  • Bug in DataFrame.loc raising when slicing a MultiIndex with a negative step size and slicing a non-int labeled index level (46156)
  • Bug in Series.to_numpy where multiindexed Series could not be converted to numpy arrays when an na_value was supplied (45774)
  • Bug in MultiIndex.equals not commutative when only one side has extension array dtype (46026)

I/O

  • Bug in DataFrame.to_stata where no error is raised if the DataFrame contains -np.inf (45350)
  • Bug in read_excel results in an infinite loop with certain skiprows callables (45585)
  • Bug in DataFrame.info where a new line at the end of the output is omitted when called on an empty DataFrame (45494)
  • Bug in read_csv not recognizing line break for on_bad_lines="warn" for engine="c" (41710)
  • Bug in DataFrame.to_csv not respecting float_format for Float64 dtype (45991)
  • Bug in read_csv not respecting a specified converter to index columns in all cases (40589)
  • Bug in read_parquet when engine="pyarrow" which caused partial write to disk when column of unsupported datatype was passed (44914)
  • Bug in DataFrame.to_excel and ExcelWriter would raise when writing an empty DataFrame to a .ods file (45793)
  • Bug in Parquet roundtrip for Interval dtype with datetime64[ns] subtype (45881)
  • Bug in read_excel when reading a .ods file with newlines between xml elements(45598)

Period

  • Bug in subtraction of Period from PeriodArray returning wrong results (45999)
  • Bug in Period.strftime and PeriodIndex.strftime, directives %l and %u were giving wrong results (46252)

Plotting

  • Bug in DataFrame.plot.barh that prevented labeling the x-axis and xlabel updating the y-axis label (45144)
  • Bug in DataFrame.plot.box that prevented labeling the x-axis (45463)
  • Bug in DataFrame.boxplot that prevented passing in xlabel and ylabel (45463)
  • Bug in DataFrame.boxplot that prevented specifying vert=False (36918)
  • Bug in DataFrame.plot.scatter that prevented specifying norm (45809)

Groupby/resample/rolling

  • Bug in DataFrame.resample ignoring closed="right" on TimedeltaIndex (45414)
  • Bug in .DataFrameGroupBy.transform fails when func="size" and the input DataFrame has multiple columns (27469)
  • Bug in .DataFrameGroupBy.size and .DataFrameGroupBy.transform with func="size" produced incorrect results when axis=1 (45715)
  • Bug in .ExponentialMovingWindow.mean with axis=1 and engine='numba' when the DataFrame has more columns than rows (46086)
  • Bug when using engine="numba" would return the same jitted function when modifying engine_kwargs (46086)
  • Bug in .DataFrameGroupby.transform fails when axis=1 and func is "first" or "last" (45986)
  • Bug in DataFrameGroupby.cumsum with skipna=False giving incorrect results (46216)
  • Bug in .GroupBy.cumsum with timedelta64[ns] dtype failing to recognize NaT as a null value (46216)
  • Bug in GroupBy.cummin and GroupBy.cummax with nullable dtypes incorrectly altering the original data in place (46220)
  • Bug in GroupBy.cummax with int64 dtype with leading value being the smallest possible int64 (46382)
  • Bug in GroupBy.max with empty groups and uint64 dtype incorrectly raising RuntimeError (46408)
  • Bug in .GroupBy.apply would fail when func was a string and args or kwargs were supplied (46479)

Reshaping

  • Bug in concat between a Series with integer dtype and another with CategoricalDtype with integer categories and containing NaN values casting to object dtype instead of float64 (45359)
  • Bug in get_dummies that selected object and categorical dtypes but not string (44965)
  • Bug in DataFrame.align when aligning a MultiIndex to a Series with another MultiIndex (46001)
  • Bug in concanenation with IntegerDtype, or FloatingDtype arrays where the resulting dtype did not mirror the behavior of the non-nullable dtypes (46379)

Sparse

  • Bug in Series.where and DataFrame.where with SparseDtype failing to retain the array's fill_value (45691)

ExtensionArray

  • Bug in IntegerArray.searchsorted and FloatingArray.searchsorted returning inconsistent results when acting on np.nan (45255)

Styler

  • Bug when attempting to apply styling functions to an empty DataFrame subset (45313)

Other

Contributors