What's new in 2.0.0 (??)

These are the changes in pandas 2.0.0. See release for a full changelog including other versions of pandas.

Enhancements

enhancement1

enhancement2

Other enhancements

read_sas now supports using encoding='infer' to correctly read and use the encoding specified by the sas file. (48048)
.DataFrameGroupBy.quantile and .SeriesGroupBy.quantile now preserve nullable dtypes instead of casting to numpy dtypes (37493)
Series.add_suffix, DataFrame.add_suffix, Series.add_prefix and DataFrame.add_prefix support an axis argument. If axis is set, the default behaviour of which axis to consider can be overwritten (47819)
assert_frame_equal now shows the first element where the DataFrames differ, analogously to pytest's output (47910)
Added new argument use_nullable_dtypes to read_csv to enable automatic conversion to nullable dtypes (36712)
Added index parameter to DataFrame.to_dict (46398)
Added metadata propagation for binary operators on DataFrame (28283)
.CategoricalConversionWarning, .InvalidComparison, .InvalidVersion, .LossySetitemError, and .NoBufferPresent are now exposed in pandas.errors (27656)

Notable bug fixes

These are bug fixes that might have notable behavior changes.

`.GroupBy.cumsum` and `.GroupBy.cumprod` overflow instead of lossy casting to float

In previous versions we cast to float when applying cumsum and cumprod which lead to incorrect results even if the result could be hold by int64 dtype. Additionally, the aggregation overflows consistent with numpy and the regular DataFrame.cumprod and DataFrame.cumsum methods when the limit of int64 is reached (37493).

Old Behavior

In [1]: df = pd.DataFrame({"key": ["b"] * 7, "value": 625})
In [2]: df.groupby("key")["value"].cumprod()[5]
Out[2]: 5.960464477539062e+16

We return incorrect results with the 6th value.

New Behavior

python

df = pd.DataFrame({"key": ["b"] * 7, "value": 625}) df.groupby("key")["value"].cumprod()

We overflow with the 7th value, but the 6th value is still correct.

notable_bug_fix2

Backwards incompatible API changes

Increased minimum versions for dependencies

Some minimum supported versions of dependencies were updated. If installed, we now require:

Package	Minimum Version	Required	Changed
mypy (dev)	0.981		X

For optional libraries the general recommendation is to use the latest version. The following table lists the lowest version per library that is currently being tested throughout the development of pandas. Optional libraries below the lowest tested version may still work, but are not considered supported.

Package	Minimum Version	Changed
		X

See install.dependencies and install.optional_dependencies for more.

Other API changes

Passing nanoseconds greater than 999 or less than 0 in Timestamp now raises a ValueError (48538, 48255)
read_csv: specifying an incorrect number of columns with index_col of now raises ParserError instead of IndexError when using the c parser.
Default value of dtype in get_dummies is changed to bool from uint8 (45848)
DataFrame.astype, Series.astype, and DatetimeIndex.astype casting datetime64 data to any of "datetime64[s]", "datetime64[ms]", "datetime64[us]" will return an object with the given resolution instead of coercing back to "datetime64[ns]" (48928)
DataFrame.astype, Series.astype, and DatetimeIndex.astype casting timedelta64 data to any of "timedelta64[s]", "timedelta64[ms]", "timedelta64[us]" will return an object with the given resolution instead of coercing to "float64" dtype (48963)
Passing a np.datetime64 object with non-nanosecond resolution to Timestamp will retain the input resolution if it is "s", "ms", or "ns"; otherwise it will be cast to the closest supported resolution (49008)

Deprecations

Performance improvements

Performance improvement in .DataFrameGroupBy.median and .SeriesGroupBy.median and .GroupBy.cumprod for nullable dtypes (37493)
Performance improvement in MultiIndex.argsort and MultiIndex.sort_values (48406)
Performance improvement in MultiIndex.size (48723)
Performance improvement in MultiIndex.union without missing values and without duplicates (48505)
Performance improvement in MultiIndex.difference (48606)
Performance improvement in MultiIndex set operations with sort=None (49010)
Performance improvement in .DataFrameGroupBy.mean, .SeriesGroupBy.mean, .DataFrameGroupBy.var, and .SeriesGroupBy.var for extension array dtypes (37493)
Performance improvement in MultiIndex.isin when level=None (48622)
Performance improvement in Index.union and MultiIndex.union when index contains duplicates (48900)
Performance improvement for Series.value_counts with nullable dtype (48338)
Performance improvement for Series constructor passing integer numpy array with nullable dtype (48338)
Performance improvement for DatetimeIndex constructor passing a list (48609)
Performance improvement in merge and DataFrame.join when joining on a sorted MultiIndex (48504)
Performance improvement in DataFrame.loc and Series.loc for tuple-based indexing of a MultiIndex (48384)
Performance improvement for MultiIndex.unique (48335)
Performance improvement in DataFrame.join when joining on a subset of a MultiIndex (48611)
Performance improvement for MultiIndex.intersection (48604)
Performance improvement in var for nullable dtypes (48379).
Performance improvements to read_sas (47403, 47405, 47656, 48502)
Memory improvement in RangeIndex.sort_values (48801)
Performance improvement in DataFrameGroupBy and SeriesGroupBy when by is a categorical type and sort=False (48976)

Bug fixes

Categorical

Bug in Categorical.set_categories losing dtype information (48812)

Datetimelike

Bug in pandas.infer_freq, raising TypeError when inferred on RangeIndex (47084)
Bug in to_datetime was raising on invalid offsets with errors='coerce' and infer_datetime_format=True (48633)
Bug in DatetimeIndex constructor failing to raise when tz=None is explicitly specified in conjunction with timezone-aware dtype or data (48659)
Bug in subtracting a datetime scalar from DatetimeIndex failing to retain the original freq attribute (48818)

Timedelta

Bug in to_timedelta raising error when input has nullable dtype Float64 (48796)
Bug in Timedelta constructor incorrectly raising instead of returning NaT when given a np.timedelta64("nat") (48898)
Bug in Timedelta constructor failing to raise when passed both a Timedelta object and keywords (e.g. days, seconds) (48898)

Timezones

Numeric

Bug in DataFrame.add cannot apply ufunc when inputs contain mixed DataFrame type and Series type (39853)

Conversion

Bug in constructing Series with int64 dtype from a string list raising instead of casting (44923)
Bug in DataFrame.eval incorrectly raising an AttributeError when there are negative values in function call (46471)
Bug in Series.convert_dtypes not converting dtype to nullable dtype when Series contains NA and has dtype object (48791)
Bug where any ExtensionDtype subclass with kind="M" would be interpreted as a timezone type (34986)

Strings

Interval

Indexing

Bug in DataFrame.reindex filling with wrong values when indexing columns and index for uint dtypes (48184)
Bug in DataFrame.reindex casting dtype to object when DataFrame has single extension array column when re-indexing columns and index (48190)
Bug in ~DataFrame.describe when formatting percentiles in the resulting index showed more decimals than needed (46362)

Missing

Bug in Index.equals raising TypeError when Index consists of tuples that contain NA (48446)

MultiIndex

Bug in MultiIndex.argsort raising TypeError when index contains NA (48495)
Bug in MultiIndex.difference losing extension array dtype (48606)
Bug in MultiIndex.set_levels raising IndexError when setting empty level (48636)
Bug in MultiIndex.unique losing extension array dtype (48335)
Bug in MultiIndex.intersection losing extension array (48604)
Bug in MultiIndex.union losing extension array (48498, 48505, 48900)
Bug in MultiIndex.union not sorting when sort=None and index contains missing values (49010)
Bug in MultiIndex.append not checking names for equality (48288)
Bug in MultiIndex.symmetric_difference losing extension array (48607)

I/O

Bug in read_sas caused fragmentation of DataFrame and raised .errors.PerformanceWarning (48595)

Period

Bug in Period.strftime and PeriodIndex.strftime, raising UnicodeDecodeError when a locale-specific directive was passed (46319)

Plotting

Groupby/resample/rolling

Bug in .ExponentialMovingWindow with online not raising a NotImplementedError for unsupported operations (48834)
Bug in DataFrameGroupBy.sample raises ValueError when the object is empty (48459)
Bug in Series.groupby raises ValueError when an entry of the index is equal to the name of the index (48567)
Bug in DataFrameGroupBy.resample produces inconsistent results when passing empty DataFrame (47705)

Reshaping

Bug in DataFrame.pivot_table raising TypeError for nullable dtype and margins=True (48681)
Bug in DataFrame.pivot not respecting None as column name (48293)
Bug in join when left_on or right_on is or includes a CategoricalIndex incorrectly raising AttributeError (48464)

Sparse

ExtensionArray

Bug in Series.mean overflowing unnecessarily with nullable integers (48378)
Bug when concatenating an empty DataFrame with an ExtensionDtype to another DataFrame with the same ExtensionDtype, the resulting dtype turned into object (48510)

Files

v2.0.0.rst

Latest commit

History