Skip to content

Latest commit

 

History

History
executable file
·
1303 lines (976 loc) · 72 KB

v1.0.0.rst

File metadata and controls

executable file
·
1303 lines (976 loc) · 72 KB

What's new in 1.0.0 (January 29, 2020)

These are the changes in pandas 1.0.0. See :ref:`release` for a full changelog including other versions of pandas.

Note

The pandas 1.0 release removed a lot of functionality that was deprecated in previous releases (see :ref:`below <whatsnew_100.prior_deprecations>` for an overview). It is recommended to first upgrade to pandas 0.25 and to ensure your code is working without warnings, before upgrading to pandas 1.0.

New deprecation policy

Starting with pandas 1.0.0, pandas will adopt a variant of SemVer to version releases. Briefly,

  • Deprecations will be introduced in minor releases (e.g. 1.1.0, 1.2.0, 2.1.0, ...)
  • Deprecations will be enforced in major releases (e.g. 1.0.0, 2.0.0, 3.0.0, ...)
  • API-breaking changes will be made only in major releases (except for experimental features)

See :ref:`policies.version` for more.

{{ header }}

Enhancements

Using Numba in rolling.apply and expanding.apply

We've added an engine keyword to :meth:`~core.window.rolling.Rolling.apply` and :meth:`~core.window.expanding.Expanding.apply` that allows the user to execute the routine using Numba instead of Cython. Using the Numba engine can yield significant performance gains if the apply function can operate on numpy arrays and the data set is larger (1 million rows or greater). For more details, see :ref:`rolling apply documentation <window.numba_engine>` (:issue:`28987`, :issue:`30936`)

Defining custom windows for rolling operations

We've added a :func:`pandas.api.indexers.BaseIndexer` class that allows users to define how window bounds are created during rolling operations. Users can define their own get_window_bounds method on a :func:`pandas.api.indexers.BaseIndexer` subclass that will generate the start and end indices used for each window during the rolling aggregation. For more details and example usage, see the :ref:`custom window rolling documentation <window.custom_rolling_window>`

Converting to markdown

We've added :meth:`~DataFrame.to_markdown` for creating a markdown table (:issue:`11052`)

.. ipython:: python

   df = pd.DataFrame({"A": [1, 2, 3], "B": [1, 2, 3]}, index=['a', 'a', 'b'])
   print(df.to_markdown())

Experimental new features

Experimental NA scalar to denote missing values

A new pd.NA value (singleton) is introduced to represent scalar missing values. Up to now, pandas used several values to represent missing data: np.nan is used for this for float data, np.nan or None for object-dtype data and pd.NaT for datetime-like data. The goal of pd.NA is to provide a "missing" indicator that can be used consistently across data types. pd.NA is currently used by the nullable integer and boolean data types and the new string data type (:issue:`28095`).

Warning

Experimental: the behaviour of pd.NA can still change without warning.

For example, creating a Series using the nullable integer dtype:

.. ipython:: python

    s = pd.Series([1, 2, None], dtype="Int64")
    s
    s[2]

Compared to np.nan, pd.NA behaves differently in certain operations. In addition to arithmetic operations, pd.NA also propagates as "missing" or "unknown" in comparison operations:

.. ipython:: python

    np.nan > 1
    pd.NA > 1

For logical operations, pd.NA follows the rules of the three-valued logic (or Kleene logic). For example:

.. ipython:: python

    pd.NA | True

For more, see :ref:`NA section <missing_data.NA>` in the user guide on missing data.

Dedicated string data type

We've added :class:`StringDtype`, an extension type dedicated to string data. Previously, strings were typically stored in object-dtype NumPy arrays. (:issue:`29975`)

Warning

StringDtype is currently considered experimental. The implementation and parts of the API may change without warning.

The 'string' extension type solves several issues with object-dtype NumPy arrays:

  1. You can accidentally store a mixture of strings and non-strings in an object dtype array. A StringArray can only store strings.
  2. object dtype breaks dtype-specific operations like :meth:`DataFrame.select_dtypes`. There isn't a clear way to select just text while excluding non-text, but still object-dtype columns.
  3. When reading code, the contents of an object dtype array is less clear than string.
.. ipython:: python

   pd.Series(['abc', None, 'def'], dtype=pd.StringDtype())

You can use the alias "string" as well.

.. ipython:: python

   s = pd.Series(['abc', None, 'def'], dtype="string")
   s

The usual string accessor methods work. Where appropriate, the return type of the Series or columns of a DataFrame will also have string dtype.

.. ipython:: python

   s.str.upper()
   s.str.split('b', expand=True).dtypes

String accessor methods returning integers will return a value with :class:`Int64Dtype`

.. ipython:: python

   s.str.count("a")

We recommend explicitly using the string data type when working with strings. See :ref:`text.types` for more.

Boolean data type with missing values support

We've added :class:`BooleanDtype` / :class:`~arrays.BooleanArray`, an extension type dedicated to boolean data that can hold missing values. The default bool data type based on a bool-dtype NumPy array, the column can only hold True or False, and not missing values. This new :class:`~arrays.BooleanArray` can store missing values as well by keeping track of this in a separate mask. (:issue:`29555`, :issue:`30095`, :issue:`31131`)

.. ipython:: python

   pd.Series([True, False, None], dtype=pd.BooleanDtype())

You can use the alias "boolean" as well.

.. ipython:: python

   s = pd.Series([True, False, None], dtype="boolean")
   s

Method convert_dtypes to ease use of supported extension dtypes

In order to encourage use of the extension dtypes StringDtype, BooleanDtype, Int64Dtype, Int32Dtype, etc., that support pd.NA, the methods :meth:`DataFrame.convert_dtypes` and :meth:`Series.convert_dtypes` have been introduced. (:issue:`29752`) (:issue:`30929`)

Example:

.. ipython:: python

   df = pd.DataFrame({'x': ['abc', None, 'def'],
                      'y': [1, 2, np.nan],
                      'z': [True, False, True]})
   df
   df.dtypes

.. ipython:: python

   converted = df.convert_dtypes()
   converted
   converted.dtypes

This is especially useful after reading in data using readers such as :func:`read_csv` and :func:`read_excel`. See :ref:`here <missing_data.NA.conversion>` for a description.

Other enhancements

Backwards incompatible API changes

Avoid using names from MultiIndex.levels

As part of a larger refactor to :class:`MultiIndex` the level names are now stored separately from the levels (:issue:`27242`). We recommend using :attr:`MultiIndex.names` to access the names, and :meth:`Index.set_names` to update the names.

For backwards compatibility, you can still access the names via the levels.

.. ipython:: python

   mi = pd.MultiIndex.from_product([[1, 2], ['a', 'b']], names=['x', 'y'])
   mi.levels[0].name

However, it is no longer possible to update the names of the MultiIndex via the level.

.. ipython:: python
   :okexcept:

   mi.levels[0].name = "new name"
   mi.names

To update, use MultiIndex.set_names, which returns a new MultiIndex.

.. ipython:: python

   mi2 = mi.set_names("new name", level=0)
   mi2.names

:class:`pandas.arrays.IntervalArray` adopts a new __repr__ in accordance with other array classes (:issue:`25022`)

pandas 0.25.x

In [1]: pd.arrays.IntervalArray.from_tuples([(0, 1), (2, 3)])
Out[2]:
IntervalArray([(0, 1], (2, 3]],
              closed='right',
              dtype='interval[int64]')

pandas 1.0.0

.. ipython:: python

   pd.arrays.IntervalArray.from_tuples([(0, 1), (2, 3)])

DataFrame.rename now only accepts one positional argument

:meth:`DataFrame.rename` would previously accept positional arguments that would lead to ambiguous or undefined behavior. From pandas 1.0, only the very first argument, which maps labels to their new names along the default axis, is allowed to be passed by position (:issue:`29136`).

.. ipython:: python
   :suppress:

   df = pd.DataFrame([[1]])

pandas 0.25.x

In [1]: df = pd.DataFrame([[1]])
In [2]: df.rename({0: 1}, {0: 2})
Out[2]:
FutureWarning: ...Use named arguments to resolve ambiguity...
   2
1  1

pandas 1.0.0

In [3]: df.rename({0: 1}, {0: 2})
Traceback (most recent call last):
...
TypeError: rename() takes from 1 to 2 positional arguments but 3 were given

Note that errors will now be raised when conflicting or potentially ambiguous arguments are provided.

pandas 0.25.x

In [4]: df.rename({0: 1}, index={0: 2})
Out[4]:
   0
1  1

In [5]: df.rename(mapper={0: 1}, index={0: 2})
Out[5]:
   0
2  1

pandas 1.0.0

In [6]: df.rename({0: 1}, index={0: 2})
Traceback (most recent call last):
...
TypeError: Cannot specify both 'mapper' and any of 'index' or 'columns'

In [7]: df.rename(mapper={0: 1}, index={0: 2})
Traceback (most recent call last):
...
TypeError: Cannot specify both 'mapper' and any of 'index' or 'columns'

You can still change the axis along which the first positional argument is applied by supplying the axis keyword argument.

.. ipython:: python

   df.rename({0: 1})
   df.rename({0: 1}, axis=1)

If you would like to update both the index and column labels, be sure to use the respective keywords.

.. ipython:: python

   df.rename(index={0: 1}, columns={0: 2})

Extended verbose info output for :class:`~pandas.DataFrame`

:meth:`DataFrame.info` now shows line numbers for the columns summary (:issue:`17304`)

pandas 0.25.x

In [1]: df = pd.DataFrame({"int_col": [1, 2, 3],
...                    "text_col": ["a", "b", "c"],
...                    "float_col": [0.0, 0.1, 0.2]})
In [2]: df.info(verbose=True)
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3 entries, 0 to 2
Data columns (total 3 columns):
int_col      3 non-null int64
text_col     3 non-null object
float_col    3 non-null float64
dtypes: float64(1), int64(1), object(1)
memory usage: 152.0+ bytes

pandas 1.0.0

.. ipython:: python

   df = pd.DataFrame({"int_col": [1, 2, 3],
                      "text_col": ["a", "b", "c"],
                      "float_col": [0.0, 0.1, 0.2]})
   df.info(verbose=True)

:meth:`pandas.array` inference changes

:meth:`pandas.array` now infers pandas' new extension types in several cases (:issue:`29791`):

  1. String data (including missing values) now returns a :class:`arrays.StringArray`.
  2. Integer data (including missing values) now returns a :class:`arrays.IntegerArray`.
  3. Boolean data (including missing values) now returns the new :class:`arrays.BooleanArray`

pandas 0.25.x

In [1]: pd.array(["a", None])
Out[1]:
<PandasArray>
['a', None]
Length: 2, dtype: object

In [2]: pd.array([1, None])
Out[2]:
<PandasArray>
[1, None]
Length: 2, dtype: object

pandas 1.0.0

.. ipython:: python

   pd.array(["a", None])
   pd.array([1, None])

As a reminder, you can specify the dtype to disable all inference.

:class:`arrays.IntegerArray` now uses :attr:`pandas.NA` rather than :attr:`numpy.nan` as its missing value marker (:issue:`29964`).

pandas 0.25.x

In [1]: a = pd.array([1, 2, None], dtype="Int64")
In [2]: a
Out[2]:
<IntegerArray>
[1, 2, NaN]
Length: 3, dtype: Int64

In [3]: a[2]
Out[3]:
nan

pandas 1.0.0

.. ipython:: python

   a = pd.array([1, 2, None], dtype="Int64")
   a
   a[2]

This has a few API-breaking consequences.

Converting to a NumPy ndarray

When converting to a NumPy array missing values will be pd.NA, which cannot be converted to a float. So calling np.asarray(integer_array, dtype="float") will now raise.

pandas 0.25.x

In [1]: np.asarray(a, dtype="float")
Out[1]:
array([ 1.,  2., nan])

pandas 1.0.0

.. ipython:: python
   :okexcept:

   np.asarray(a, dtype="float")

Use :meth:`arrays.IntegerArray.to_numpy` with an explicit na_value instead.

.. ipython:: python

   a.to_numpy(dtype="float", na_value=np.nan)

Reductions can return pd.NA

When performing a reduction such as a sum with skipna=False, the result will now be pd.NA instead of np.nan in presence of missing values (:issue:`30958`).

pandas 0.25.x

In [1]: pd.Series(a).sum(skipna=False)
Out[1]:
nan

pandas 1.0.0

.. ipython:: python

   pd.Series(a).sum(skipna=False)

value_counts returns a nullable integer dtype

:meth:`Series.value_counts` with a nullable integer dtype now returns a nullable integer dtype for the values.

pandas 0.25.x

In [1]: pd.Series([2, 1, 1, None], dtype="Int64").value_counts().dtype
Out[1]:
dtype('int64')

pandas 1.0.0

.. ipython:: python

   pd.Series([2, 1, 1, None], dtype="Int64").value_counts().dtype

See :ref:`missing_data.NA` for more on the differences between :attr:`pandas.NA` and :attr:`numpy.nan`.

Comparison operations on a :class:`arrays.IntegerArray` now returns a :class:`arrays.BooleanArray` rather than a NumPy array (:issue:`29964`).

pandas 0.25.x

In [1]: a = pd.array([1, 2, None], dtype="Int64")
In [2]: a
Out[2]:
<IntegerArray>
[1, 2, NaN]
Length: 3, dtype: Int64

In [3]: a > 1
Out[3]:
array([False,  True, False])

pandas 1.0.0

.. ipython:: python

   a = pd.array([1, 2, None], dtype="Int64")
   a > 1

Note that missing values now propagate, rather than always comparing unequal like :attr:`numpy.nan`. See :ref:`missing_data.NA` for more.

By default :meth:`Categorical.min` now returns the minimum instead of np.nan

When :class:`Categorical` contains np.nan, :meth:`Categorical.min` no longer return np.nan by default (skipna=True) (:issue:`25303`)

pandas 0.25.x

In [1]: pd.Categorical([1, 2, np.nan], ordered=True).min()
Out[1]: nan

pandas 1.0.0

.. ipython:: python

   pd.Categorical([1, 2, np.nan], ordered=True).min()


Default dtype of empty :class:`pandas.Series`

Initialising an empty :class:`pandas.Series` without specifying a dtype will raise a DeprecationWarning now (:issue:`17261`). The default dtype will change from float64 to object in future releases so that it is consistent with the behaviour of :class:`DataFrame` and :class:`Index`.

pandas 1.0.0

In [1]: pd.Series()
Out[2]:
DeprecationWarning: The default dtype for empty Series will be 'object' instead of 'float64' in a future version. Specify a dtype explicitly to silence this warning.
Series([], dtype: float64)

Result dtype inference changes for resample operations

The rules for the result dtype in :meth:`DataFrame.resample` aggregations have changed for extension types (:issue:`31359`). Previously, pandas would attempt to convert the result back to the original dtype, falling back to the usual inference rules if that was not possible. Now, pandas will only return a result of the original dtype if the scalar values in the result are instances of the extension dtype's scalar type.

.. ipython:: python

   df = pd.DataFrame({"A": ['a', 'b']}, dtype='category',
                     index=pd.date_range('2000', periods=2))
   df


pandas 0.25.x

In [1]> df.resample("2D").agg(lambda x: 'a').A.dtype
Out[1]:
CategoricalDtype(categories=['a', 'b'], ordered=False)

pandas 1.0.0

.. ipython:: python

   df.resample("2D").agg(lambda x: 'a').A.dtype

This fixes an inconsistency between resample and groupby. This also fixes a potential bug, where the values of the result might change depending on how the results are cast back to the original dtype.

pandas 0.25.x

In [1] df.resample("2D").agg(lambda x: 'c')
Out[1]:

     A
0  NaN

pandas 1.0.0

.. ipython:: python

   df.resample("2D").agg(lambda x: 'c')


Increased minimum version for Python

pandas 1.0.0 supports Python 3.6.1 and higher (:issue:`29212`).

Increased minimum versions for dependencies

Some minimum supported versions of dependencies were updated (:issue:`29766`, :issue:`29723`). If installed, we now require:

Package Minimum Version Required Changed
numpy 1.13.3 X  
pytz 2015.4 X  
python-dateutil 2.6.1 X  
bottleneck 1.2.1    
numexpr 2.6.2    
pytest (dev) 4.0.2    

For optional libraries the general recommendation is to use the latest version. The following table lists the lowest version per library that is currently being tested throughout the development of pandas. Optional libraries below the lowest tested version may still work, but are not considered supported.

Package Minimum Version Changed
beautifulsoup4 4.6.0  
fastparquet 0.3.2 X
gcsfs 0.2.2  
lxml 3.8.0  
matplotlib 2.2.2  
numba 0.46.0 X
openpyxl 2.5.7 X
pyarrow 0.13.0 X
pymysql 0.7.1  
pytables 3.4.2  
s3fs 0.3.0 X
scipy 0.19.0  
sqlalchemy 1.1.4  
xarray 0.8.2  
xlrd 1.1.0  
xlsxwriter 0.9.8  
xlwt 1.2.0  

See :ref:`install.dependencies` and :ref:`install.optional_dependencies` for more.

Build changes

pandas has added a pyproject.toml file and will no longer include cythonized files in the source distribution uploaded to PyPI (:issue:`28341`, :issue:`20775`). If you're installing a built distribution (wheel) or via conda, this shouldn't have any effect on you. If you're building pandas from source, you should no longer need to install Cython into your build environment before calling pip install pandas.

Other API changes

Documentation improvements

Deprecations

Selecting Columns from a Grouped DataFrame

When selecting columns from a :class:`DataFrameGroupBy` object, passing individual keys (or a tuple of keys) inside single brackets is deprecated, a list of items should be used instead. (:issue:`23566`) For example:

df = pd.DataFrame({
    "A": ["foo", "bar", "foo", "bar", "foo", "bar", "foo", "foo"],
    "B": np.random.randn(8),
    "C": np.random.randn(8),
})
g = df.groupby('A')

# single key, returns SeriesGroupBy
g['B']

# tuple of single key, returns SeriesGroupBy
g[('B',)]

# tuple of multiple keys, returns DataFrameGroupBy, raises FutureWarning
g[('B', 'C')]

# multiple keys passed directly, returns DataFrameGroupBy, raises FutureWarning
# (implicitly converts the passed strings into a single tuple)
g['B', 'C']

# proper way, returns DataFrameGroupBy
g[['B', 'C']]

Removal of prior version deprecations/changes

Removed SparseSeries and SparseDataFrame

SparseSeries, SparseDataFrame and the DataFrame.to_sparse method have been removed (:issue:`28425`). We recommend using a Series or DataFrame with sparse values instead. See :ref:`sparse.migration` for help with migrating existing code.

Matplotlib unit registration

Previously, pandas would register converters with matplotlib as a side effect of importing pandas (:issue:`18720`). This changed the output of plots made via matplotlib plots after pandas was imported, even if you were using matplotlib directly rather than :meth:`~DataFrame.plot`.

To use pandas formatters with a matplotlib plot, specify

In [1]: import pandas as pd
In [2]: pd.options.plotting.matplotlib.register_converters = True

Note that plots created by :meth:`DataFrame.plot` and :meth:`Series.plot` do register the converters automatically. The only behavior change is when plotting a date-like object via matplotlib.pyplot.plot or matplotlib.Axes.plot. See :ref:`plotting.formatters` for more.

Other removals

Performance improvements

Bug fixes

Categorical

Datetimelike

Timedelta

Timezones

Numeric

Conversion

Strings

Interval

Indexing

Missing

MultiIndex

  • Constructor for :class:`MultiIndex` verifies that the given sortorder is compatible with the actual lexsort_depth if verify_integrity parameter is True (the default) (:issue:`28735`)
  • Series and MultiIndex .drop with MultiIndex raise exception if labels not in given in level (:issue:`8594`)

IO

Plotting

GroupBy/resample/rolling

Reshaping

Sparse

ExtensionArray

Other

Contributors

.. contributors:: v0.25.3..v1.0.0