Skip to content

Latest commit

 

History

History
303 lines (235 loc) · 13.5 KB

v2.0.0.rst

File metadata and controls

303 lines (235 loc) · 13.5 KB

What's new in 2.0.0 (??)

These are the changes in pandas 2.0.0. See release for a full changelog including other versions of pandas.

{{ header }}

Enhancements

enhancement1

enhancement2

Other enhancements

  • read_sas now supports using encoding='infer' to correctly read and use the encoding specified by the sas file. (48048)
  • .DataFrameGroupBy.quantile and .SeriesGroupBy.quantile now preserve nullable dtypes instead of casting to numpy dtypes (37493)
  • Series.add_suffix, DataFrame.add_suffix, Series.add_prefix and DataFrame.add_prefix support an axis argument. If axis is set, the default behaviour of which axis to consider can be overwritten (47819)
  • assert_frame_equal now shows the first element where the DataFrames differ, analogously to pytest's output (47910)
  • Added new argument use_nullable_dtypes to read_csv to enable automatic conversion to nullable dtypes (36712)
  • Added index parameter to DataFrame.to_dict (46398)
  • Added metadata propagation for binary operators on DataFrame (28283)
  • .CategoricalConversionWarning, .InvalidComparison, .InvalidVersion, .LossySetitemError, and .NoBufferPresent are now exposed in pandas.errors (27656)

Notable bug fixes

These are bug fixes that might have notable behavior changes.

.GroupBy.cumsum and .GroupBy.cumprod overflow instead of lossy casting to float

In previous versions we cast to float when applying cumsum and cumprod which lead to incorrect results even if the result could be hold by int64 dtype. Additionally, the aggregation overflows consistent with numpy and the regular DataFrame.cumprod and DataFrame.cumsum methods when the limit of int64 is reached (37493).

Old Behavior

In [1]: df = pd.DataFrame({"key": ["b"] * 7, "value": 625})
In [2]: df.groupby("key")["value"].cumprod()[5]
Out[2]: 5.960464477539062e+16

We return incorrect results with the 6th value.

New Behavior

python

df = pd.DataFrame({"key": ["b"] * 7, "value": 625}) df.groupby("key")["value"].cumprod()

We overflow with the 7th value, but the 6th value is still correct.

notable_bug_fix2

Backwards incompatible API changes

Increased minimum versions for dependencies

Some minimum supported versions of dependencies were updated. If installed, we now require:

Package Minimum Version Required Changed
mypy (dev) 0.981

X

For optional libraries the general recommendation is to use the latest version. The following table lists the lowest version per library that is currently being tested throughout the development of pandas. Optional libraries below the lowest tested version may still work, but are not considered supported.

Package Minimum Version Changed

X

See install.dependencies and install.optional_dependencies for more.

Other API changes

  • Passing nanoseconds greater than 999 or less than 0 in Timestamp now raises a ValueError (48538, 48255)
  • read_csv: specifying an incorrect number of columns with index_col of now raises ParserError instead of IndexError when using the c parser.
  • Default value of dtype in get_dummies is changed to bool from uint8 (45848)
  • DataFrame.astype, Series.astype, and DatetimeIndex.astype casting datetime64 data to any of "datetime64[s]", "datetime64[ms]", "datetime64[us]" will return an object with the given resolution instead of coercing back to "datetime64[ns]" (48928)
  • DataFrame.astype, Series.astype, and DatetimeIndex.astype casting timedelta64 data to any of "timedelta64[s]", "timedelta64[ms]", "timedelta64[us]" will return an object with the given resolution instead of coercing to "float64" dtype (48963)
  • Passing a np.datetime64 object with non-nanosecond resolution to Timestamp will retain the input resolution if it is "s", "ms", or "ns"; otherwise it will be cast to the closest supported resolution (49008)

Deprecations

Performance improvements

  • Performance improvement in .DataFrameGroupBy.median and .SeriesGroupBy.median and .GroupBy.cumprod for nullable dtypes (37493)
  • Performance improvement in MultiIndex.argsort and MultiIndex.sort_values (48406)
  • Performance improvement in MultiIndex.size (48723)
  • Performance improvement in MultiIndex.union without missing values and without duplicates (48505)
  • Performance improvement in MultiIndex.difference (48606)
  • Performance improvement in MultiIndex set operations with sort=None (49010)
  • Performance improvement in .DataFrameGroupBy.mean, .SeriesGroupBy.mean, .DataFrameGroupBy.var, and .SeriesGroupBy.var for extension array dtypes (37493)
  • Performance improvement in MultiIndex.isin when level=None (48622)
  • Performance improvement in Index.union and MultiIndex.union when index contains duplicates (48900)
  • Performance improvement for Series.value_counts with nullable dtype (48338)
  • Performance improvement for Series constructor passing integer numpy array with nullable dtype (48338)
  • Performance improvement for DatetimeIndex constructor passing a list (48609)
  • Performance improvement in merge and DataFrame.join when joining on a sorted MultiIndex (48504)
  • Performance improvement in DataFrame.loc and Series.loc for tuple-based indexing of a MultiIndex (48384)
  • Performance improvement for MultiIndex.unique (48335)
  • Performance improvement in DataFrame.join when joining on a subset of a MultiIndex (48611)
  • Performance improvement for MultiIndex.intersection (48604)
  • Performance improvement in var for nullable dtypes (48379).
  • Performance improvements to read_sas (47403, 47405, 47656, 48502)
  • Memory improvement in RangeIndex.sort_values (48801)
  • Performance improvement in DataFrameGroupBy and SeriesGroupBy when by is a categorical type and sort=False (48976)

Bug fixes

Categorical

  • Bug in Categorical.set_categories losing dtype information (48812)

Datetimelike

  • Bug in pandas.infer_freq, raising TypeError when inferred on RangeIndex (47084)
  • Bug in to_datetime was raising on invalid offsets with errors='coerce' and infer_datetime_format=True (48633)
  • Bug in DatetimeIndex constructor failing to raise when tz=None is explicitly specified in conjunction with timezone-aware dtype or data (48659)
  • Bug in subtracting a datetime scalar from DatetimeIndex failing to retain the original freq attribute (48818)

Timedelta

  • Bug in to_timedelta raising error when input has nullable dtype Float64 (48796)
  • Bug in Timedelta constructor incorrectly raising instead of returning NaT when given a np.timedelta64("nat") (48898)
  • Bug in Timedelta constructor failing to raise when passed both a Timedelta object and keywords (e.g. days, seconds) (48898)

Timezones

Numeric

  • Bug in DataFrame.add cannot apply ufunc when inputs contain mixed DataFrame type and Series type (39853)

Conversion

  • Bug in constructing Series with int64 dtype from a string list raising instead of casting (44923)
  • Bug in DataFrame.eval incorrectly raising an AttributeError when there are negative values in function call (46471)
  • Bug in Series.convert_dtypes not converting dtype to nullable dtype when Series contains NA and has dtype object (48791)
  • Bug where any ExtensionDtype subclass with kind="M" would be interpreted as a timezone type (34986)

Strings

Interval

Indexing

  • Bug in DataFrame.reindex filling with wrong values when indexing columns and index for uint dtypes (48184)
  • Bug in DataFrame.reindex casting dtype to object when DataFrame has single extension array column when re-indexing columns and index (48190)
  • Bug in ~DataFrame.describe when formatting percentiles in the resulting index showed more decimals than needed (46362)

Missing

  • Bug in Index.equals raising TypeError when Index consists of tuples that contain NA (48446)

MultiIndex

  • Bug in MultiIndex.argsort raising TypeError when index contains NA (48495)
  • Bug in MultiIndex.difference losing extension array dtype (48606)
  • Bug in MultiIndex.set_levels raising IndexError when setting empty level (48636)
  • Bug in MultiIndex.unique losing extension array dtype (48335)
  • Bug in MultiIndex.intersection losing extension array (48604)
  • Bug in MultiIndex.union losing extension array (48498, 48505, 48900)
  • Bug in MultiIndex.union not sorting when sort=None and index contains missing values (49010)
  • Bug in MultiIndex.append not checking names for equality (48288)
  • Bug in MultiIndex.symmetric_difference losing extension array (48607)

I/O

  • Bug in read_sas caused fragmentation of DataFrame and raised .errors.PerformanceWarning (48595)

Period

  • Bug in Period.strftime and PeriodIndex.strftime, raising UnicodeDecodeError when a locale-specific directive was passed (46319)

Plotting

Groupby/resample/rolling

  • Bug in .ExponentialMovingWindow with online not raising a NotImplementedError for unsupported operations (48834)
  • Bug in DataFrameGroupBy.sample raises ValueError when the object is empty (48459)
  • Bug in Series.groupby raises ValueError when an entry of the index is equal to the name of the index (48567)
  • Bug in DataFrameGroupBy.resample produces inconsistent results when passing empty DataFrame (47705)

Reshaping

  • Bug in DataFrame.pivot_table raising TypeError for nullable dtype and margins=True (48681)
  • Bug in DataFrame.pivot not respecting None as column name (48293)
  • Bug in join when left_on or right_on is or includes a CategoricalIndex incorrectly raising AttributeError (48464)

Sparse

ExtensionArray

  • Bug in Series.mean overflowing unnecessarily with nullable integers (48378)
  • Bug when concatenating an empty DataFrame with an ExtensionDtype to another DataFrame with the same ExtensionDtype, the resulting dtype turned into object (48510)

Styler

Metadata

  • Fixed metadata propagation in DataFrame.corr and DataFrame.cov (28283)

Other

Contributors