pandas is distributed under a 3-clause ("Simplified" or "New") BSD
license. Parts of NumPy, SciPy, numpydoc, bottleneck, which all have
BSD-compatible licenses, are included. Their licenses follow the pandas
license.

pandas license
==============

Copyright (c) 2011-2012, Lambda Foundry, Inc. and PyData Development Team
All rights reserved.

Copyright (c) 2008-2011 AQR Capital Management, LLC
All rights reserved. Redistribution and use in source and binary forms, with or without -modification, are permitted provided that the following conditions are -met: - - * Redistributions of source code must retain the above copyright - notice, this list of conditions and the following disclaimer. - - * Redistributions in binary form must reproduce the above - copyright notice, this list of conditions and the following - disclaimer in the documentation and/or other materials provided - with the distribution. - - * Neither the name of the copyright holder nor the names of any - contributors may be used to endorse or promote products derived - from this software without specific prior written permission. - -THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDER AND CONTRIBUTORS -"AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT -LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR -A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT -OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, -SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT -LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, -DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY -THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT -(INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE +modification, are permitted provided that the following conditions are met: + +* Redistributions of source code must retain the above copyright notice, this + list of conditions and the following disclaimer. + +* Redistributions in binary form must reproduce the above copyright notice, + this list of conditions and the following disclaimer in the documentation + and/or other materials provided with the distribution. + +* Neither the name of the copyright holder nor the names of its + contributors may be used to endorse or promote products derived from + this software without specific prior written permission. + +THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" +AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE +IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE +DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE +FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL +DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR +SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER +CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, +OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. - -About the Copyright Holders -=========================== - -AQR Capital Management began pandas development in 2008. Development was -led by Wes McKinney. AQR released the source under this license in 2009. -Wes is now an employee of Lambda Foundry, and remains the pandas project -lead. - -The PyData Development Team is the collection of developers of the PyData -project. This includes all of the PyData sub-projects, including pandas. The -core team that coordinates development on GitHub can be found here: -http://github.com/pydata. - -Full credits for pandas contributors can be found in the documentation. - -Our Copyright Policy -==================== - -PyData uses a shared copyright model. Each contributor maintains copyright -over their contributions to PyData. However, it is important to note that -these contributions are typically only changes to the repositories. Thus, -the PyData source code, in its entirety, is not the copyright of any single -person or institution. Instead, it is the collective copyright of the -entire PyData Development Team. Latest Releaselatest release + + latest release + +
latest release + + latest release + +
Package Statusstatus + + status
Licenselicense + + license + +
Build Status
- - circleci build status - -
- - appveyor build status + + Azure Pipelines build status
Conda - - conda default downloads +   + + coverage
Conda-forgeDownloads - + conda-forge downloads
PyPI - - pypi downloads - - Gitter + + +
-[![https://gitter.im/pydata/pandas](https://badges.gitter.im/Join%20Chat.svg)](https://gitter.im/pydata/pandas?utm_source=badge&utm_medium=badge&utm_campaign=pr-badge&utm_content=badge) -## What is it + +## What is it? **pandas** is a Python package providing fast, flexible, and expressive data structures designed to make working with "relational" or "labeled" data both @@ -86,7 +89,7 @@ easy and intuitive. It aims to be the fundamental high-level building block for doing practical, **real world** data analysis in Python. Additionally, it has the broader goal of becoming **the most powerful and flexible open source data analysis / manipulation tool available in any language**. It is already well on -its way toward this goal. +its way towards this goal. ## Main Features Here are just a few of the things that pandas does well: @@ -123,31 +126,31 @@ Here are just a few of the things that pandas does well: moving window linear regressions, date shifting and lagging, etc. - [missing-data]: http://pandas.pydata.org/pandas-docs/stable/missing_data.html#working-with-missing-data - [insertion-deletion]: http://pandas.pydata.org/pandas-docs/stable/dsintro.html#column-selection-addition-deletion - [alignment]: http://pandas.pydata.org/pandas-docs/stable/dsintro.html?highlight=alignment#intro-to-data-structures - [groupby]: http://pandas.pydata.org/pandas-docs/stable/groupby.html#group-by-split-apply-combine - [conversion]: http://pandas.pydata.org/pandas-docs/stable/dsintro.html#dataframe - [slicing]: http://pandas.pydata.org/pandas-docs/stable/indexing.html#slicing-ranges - [fancy-indexing]: http://pandas.pydata.org/pandas-docs/stable/indexing.html#advanced-indexing-with-ix - [subsetting]: http://pandas.pydata.org/pandas-docs/stable/indexing.html#boolean-indexing - [merging]: http://pandas.pydata.org/pandas-docs/stable/merging.html#database-style-dataframe-joining-merging - [joining]: http://pandas.pydata.org/pandas-docs/stable/merging.html#joining-on-index - [reshape]: http://pandas.pydata.org/pandas-docs/stable/reshaping.html#reshaping-and-pivot-tables - [pivot-table]: http://pandas.pydata.org/pandas-docs/stable/reshaping.html#pivot-tables-and-cross-tabulations - [mi]: http://pandas.pydata.org/pandas-docs/stable/indexing.html#hierarchical-indexing-multiindex - [flat-files]: http://pandas.pydata.org/pandas-docs/stable/io.html#csv-text-files - [excel]: http://pandas.pydata.org/pandas-docs/stable/io.html#excel-files - [db]: http://pandas.pydata.org/pandas-docs/stable/io.html#sql-queries - [hdfstore]: http://pandas.pydata.org/pandas-docs/stable/io.html#hdf5-pytables - [timeseries]: http://pandas.pydata.org/pandas-docs/stable/timeseries.html#time-series-date-functionality + [missing-data]: https://pandas.pydata.org/pandas-docs/stable/missing_data.html#working-with-missing-data + [insertion-deletion]: https://pandas.pydata.org/pandas-docs/stable/dsintro.html#column-selection-addition-deletion + [alignment]: https://pandas.pydata.org/pandas-docs/stable/dsintro.html?highlight=alignment#intro-to-data-structures + [groupby]: https://pandas.pydata.org/pandas-docs/stable/groupby.html#group-by-split-apply-combine + [conversion]: https://pandas.pydata.org/pandas-docs/stable/dsintro.html#dataframe + [slicing]: https://pandas.pydata.org/pandas-docs/stable/indexing.html#slicing-ranges + [fancy-indexing]: https://pandas.pydata.org/pandas-docs/stable/indexing.html#advanced-indexing-with-ix + [subsetting]: https://pandas.pydata.org/pandas-docs/stable/indexing.html#boolean-indexing + [merging]: https://pandas.pydata.org/pandas-docs/stable/merging.html#database-style-dataframe-joining-merging + [joining]: https://pandas.pydata.org/pandas-docs/stable/merging.html#joining-on-index + [reshape]: https://pandas.pydata.org/pandas-docs/stable/reshaping.html#reshaping-and-pivot-tables + [pivot-table]: https://pandas.pydata.org/pandas-docs/stable/reshaping.html#pivot-tables-and-cross-tabulations + [mi]: https://pandas.pydata.org/pandas-docs/stable/indexing.html#hierarchical-indexing-multiindex + [flat-files]: https://pandas.pydata.org/pandas-docs/stable/io.html#csv-text-files + [excel]: https://pandas.pydata.org/pandas-docs/stable/io.html#excel-files + [db]: https://pandas.pydata.org/pandas-docs/stable/io.html#sql-queries + [hdfstore]: https://pandas.pydata.org/pandas-docs/stable/io.html#hdf5-pytables + [timeseries]: https://pandas.pydata.org/pandas-docs/stable/timeseries.html#time-series-date-functionality ## Where to get it The source code is currently hosted on GitHub at: -http://github.com/pandas-dev/pandas +https://github.com/pandas-dev/pandas Binary installers for the latest released version are available at the [Python -package index](http://pypi.python.org/pypi/pandas/) and on conda. +package index](https://pypi.org/project/pandas) and on conda. ```sh # conda @@ -160,12 +163,11 @@ pip install pandas ``` ## Dependencies -- [NumPy](http://www.numpy.org): 1.7.0 or higher -- [python-dateutil](http://labix.org/python-dateutil): 1.5 or higher -- [pytz](http://pytz.sourceforge.net) - - Needed for time zone support with ``pandas.date_range`` +- [NumPy](https://www.numpy.org): 1.12.0 or higher +- [python-dateutil](https://labix.org/python-dateutil): 2.5.0 or higher +- [pytz](https://pythonhosted.org/pytz): 2011k or higher -See the [full installation instructions](http://pandas.pydata.org/pandas-docs/stable/install.html#dependencies) +See the [full installation instructions](https://pandas.pydata.org/pandas-docs/stable/install.html#dependencies) for recommended and optional dependencies. ## Installation from sources @@ -197,32 +199,36 @@ mode](https://pip.pypa.io/en/latest/reference/pip_install.html#editable-installs pip install -e . ``` -On Windows, you will need to install MinGW and execute: - -```sh -python setup.py build --compiler=mingw32 -python setup.py install -``` - -See http://pandas.pydata.org/ for more information. +See the full instructions for [installing from source](https://pandas.pydata.org/pandas-docs/stable/install.html#installing-from-source). ## License -BSD +[BSD 3](LICENSE) ## Documentation -The official documentation is hosted on PyData.org: http://pandas.pydata.org/ - -The Sphinx documentation should provide a good starting point for learning how -to use the library. Expect the docs to continue to expand as time goes on. +The official documentation is hosted on PyData.org: https://pandas.pydata.org/pandas-docs/stable ## Background Work on ``pandas`` started at AQR (a quantitative hedge fund) in 2008 and has been under active development since then. +## Getting Help + +For usage questions, the best place to go to is [StackOverflow](https://stackoverflow.com/questions/tagged/pandas). +Further, general questions and discussions can also take place on the [pydata mailing list](https://groups.google.com/forum/?fromgroups#!forum/pydata). + ## Discussion and Development -Since pandas development is related to a number of other scientific -Python projects, questions are welcome on the scipy-user mailing -list. Specialized discussions or design issues should take place on -the PyData mailing list / Google group: +Most development discussion is taking place on github in this repo. Further, the [pandas-dev mailing list](https://mail.python.org/mailman/listinfo/pandas-dev) can also be used for specialized discussions or design issues, and a [Gitter channel](https://gitter.im/pydata/pandas) is available for quick development related questions. + +## Contributing to pandas [![Open Source Helpers](https://www.codetriage.com/pandas-dev/pandas/badges/users.svg)](https://www.codetriage.com/pandas-dev/pandas) + +All contributions, bug reports, bug fixes, documentation improvements, enhancements and ideas are welcome. + +A detailed overview on how to contribute can be found in the **[contributing guide](https://pandas-docs.github.io/pandas-docs-travis/contributing.html)**. There is also an [overview](.github/CONTRIBUTING.md) on GitHub. + +If you are simply looking to start working with the pandas codebase, navigate to the [GitHub "issues" tab](https://github.com/pandas-dev/pandas/issues) and start looking through interesting issues. There are a number of issues listed under [Docs](https://github.com/pandas-dev/pandas/issues?labels=Docs&sort=updated&state=open) and [good first issue](https://github.com/pandas-dev/pandas/issues?labels=good+first+issue&sort=updated&state=open) where you could start out. + +You can also triage issues which may include reproducing bug reports, or asking for vital information such as version numbers or reproduction instructions. If you would like to start triaging issues, one easy way to get started is to [subscribe to pandas on CodeTriage](https://www.codetriage.com/pandas-dev/pandas). + +Or maybe through using pandas you have an idea of your own or are looking for something in the documentation and thinking ‘this can be improved’...you can do something about it! -https://groups.google.com/forum/#!forum/pydata +Feel free to ask questions on the [mailing list](https://groups.google.com/forum/?fromgroups#!forum/pydata) or on [Gitter](https://gitter.im/pydata/pandas). diff --git a/appveyor.yml b/appveyor.yml deleted file mode 100644 index db729b3005be6..0000000000000 --- a/appveyor.yml +++ /dev/null @@ -1,89 +0,0 @@ -# With infos from -# http://tjelvarolsson.com/blog/how-to-continuously-test-your-python-code-on-windows-using-appveyor/ -# https://packaging.python.org/en/latest/appveyor/ -# https://github.com/rmcgibbo/python-appveyor-conda-example - -# Backslashes in quotes need to be escaped: \ -> "\\" - -matrix: - fast_finish: true # immediately finish build once one of the jobs fails. - -environment: - global: - # SDK v7.0 MSVC Express 2008's SetEnv.cmd script will fail if the - # /E:ON and /V:ON options are not enabled in the batch script intepreter - # See: http://stackoverflow.com/a/13751649/163740 - CMD_IN_ENV: "cmd /E:ON /V:ON /C .\\ci\\run_with_env.cmd" - clone_folder: C:\projects\pandas - - matrix: - - - CONDA_ROOT: "C:\\Miniconda3_64" - PYTHON_VERSION: "3.6" - PYTHON_ARCH: "64" - CONDA_PY: "36" - CONDA_NPY: "112" - - - CONDA_ROOT: "C:\\Miniconda3_64" - PYTHON_VERSION: "2.7" - PYTHON_ARCH: "64" - CONDA_PY: "27" - CONDA_NPY: "110" - -# We always use a 64-bit machine, but can build x86 distributions -# with the PYTHON_ARCH variable (which is used by CMD_IN_ENV). -platform: - - x64 - -# all our python builds have to happen in tests_script... -build: false - -install: - # cancel older builds for the same PR - - ps: if ($env:APPVEYOR_PULL_REQUEST_NUMBER -and $env:APPVEYOR_BUILD_NUMBER -ne ((Invoke-RestMethod ` - https://ci.appveyor.com/api/projects/$env:APPVEYOR_ACCOUNT_NAME/$env:APPVEYOR_PROJECT_SLUG/history?recordsNumber=50).builds | ` - Where-Object pullRequestId -eq $env:APPVEYOR_PULL_REQUEST_NUMBER)[0].buildNumber) { ` - throw "There are newer queued builds for this pull request, failing early." } - - # this installs the appropriate Miniconda (Py2/Py3, 32/64 bit) - # updates conda & installs: conda-build jinja2 anaconda-client - - powershell .\ci\install.ps1 - - SET PATH=%CONDA_ROOT%;%CONDA_ROOT%\Scripts;%PATH% - - echo "install" - - cd - - ls -ltr - - git tag --sort v:refname - - # this can conflict with git - - cmd: rmdir C:\cygwin /s /q - - # install our build environment - - cmd: conda config --set show_channel_urls true --set always_yes true --set changeps1 false - - cmd: conda update -q conda - - cmd: conda config --set ssl_verify false - - # add the pandas channel *before* defaults to have defaults take priority - - cmd: conda config --add channels conda-forge - - cmd: conda config --add channels pandas - - cmd: conda config --remove channels defaults - - cmd: conda config --add channels defaults - - # this is now the downloaded conda... - - cmd: conda info -a - - # create our env - - cmd: conda create -n pandas python=%PYTHON_VERSION% cython pytest - - cmd: activate pandas - - SET REQ=ci\requirements-%PYTHON_VERSION%_WIN.run - - cmd: echo "installing requirements from %REQ%" - - cmd: conda install -n pandas --file=%REQ% - - cmd: conda list -n pandas - - cmd: echo "installing requirements from %REQ% - done" - - # build em using the local source checkout in the correct windows env - - cmd: '%CMD_IN_ENV% python setup.py build_ext --inplace' - -test_script: - # tests - - cmd: activate pandas - - cmd: test.bat diff --git a/asv_bench/asv.conf.json b/asv_bench/asv.conf.json index 4fc6f9f634426..fa098e2455683 100644 --- a/asv_bench/asv.conf.json +++ b/asv_bench/asv.conf.json @@ -26,7 +26,7 @@ // The Pythons you'd like to test against. If not provided, defaults // to the current version of Python used to run `asv`. // "pythons": ["2.7", "3.4"], - "pythons": ["2.7"], + "pythons": ["3.6"], // The matrix of dependencies to test. Each key is the name of a // package (in PyPI) and the values are version numbers. An empty @@ -46,12 +46,14 @@ "numexpr": [], "pytables": [null, ""], // platform dependent, see excludes below "tables": [null, ""], - "libpython": [null, ""], "openpyxl": [], "xlsxwriter": [], "xlrd": [], "xlwt": [], "pytest": [], + // If using Windows with python 2.7 and want to build using the + // mingw toolchain (rather than MSVC), uncomment the following line. + // "libpython": [], }, // Combinations of libraries/python versions can be excluded/included @@ -80,10 +82,6 @@ {"environment_type": "conda", "pytables": null}, {"environment_type": "(?!conda).*", "tables": null}, {"environment_type": "(?!conda).*", "pytables": ""}, - // On conda&win32, install libpython - {"sys_platform": "(?!win32).*", "libpython": ""}, - {"environment_type": "conda", "sys_platform": "win32", "libpython": null}, - {"environment_type": "(?!conda).*", "libpython": ""} ], "include": [], @@ -119,8 +117,9 @@ // with results. If the commit is `null`, regression detection is // skipped for the matching benchmark. // - // "regressions_first_commits": { - // "some_benchmark": "352cdf", // Consider regressions only after this commit - // "another_benchmark": null, // Skip regression detection altogether - // } + "regressions_first_commits": { + ".*": "0409521665" + }, + "regression_thresholds": { + } } diff --git a/asv_bench/benchmarks/__init__.py b/asv_bench/benchmarks/__init__.py index e69de29bb2d1d..eada147852fe1 100644 --- a/asv_bench/benchmarks/__init__.py +++ b/asv_bench/benchmarks/__init__.py @@ -0,0 +1 @@ +"""Pandas benchmarks.""" diff --git a/asv_bench/benchmarks/algorithms.py b/asv_bench/benchmarks/algorithms.py index fe657936c403e..74849d330f2bc 100644 --- a/asv_bench/benchmarks/algorithms.py +++ b/asv_bench/benchmarks/algorithms.py @@ -1,115 +1,144 @@ +from importlib import import_module + import numpy as np + import pandas as pd from pandas.util import testing as tm +for imp in ['pandas.util', 'pandas.tools.hashing']: + try: + hashing = import_module(imp) + break + except (ImportError, TypeError, ValueError): + pass + + +class Factorize(object): + + params = [[True, False], ['int', 'uint', 'float', 'string']] + param_names = ['sort', 'dtype'] + + def setup(self, sort, dtype): + N = 10**5 + data = {'int': pd.Int64Index(np.arange(N).repeat(5)), + 'uint': pd.UInt64Index(np.arange(N).repeat(5)), + 'float': pd.Float64Index(np.random.randn(N).repeat(5)), + 'string': tm.makeStringIndex(N).repeat(5)} + self.idx = data[dtype] + + def time_factorize(self, sort, dtype): + self.idx.factorize(sort=sort) + -class Algorithms(object): - goal_time = 0.2 +class FactorizeUnique(object): - def setup(self): - N = 100000 - np.random.seed(1234) + params = [[True, False], ['int', 'uint', 'float', 'string']] + param_names = ['sort', 'dtype'] - self.int_unique = pd.Int64Index(np.arange(N * 5)) + def setup(self, sort, dtype): + N = 10**5 + data = {'int': pd.Int64Index(np.arange(N)), + 'uint': pd.UInt64Index(np.arange(N)), + 'float': pd.Float64Index(np.arange(N)), + 'string': tm.makeStringIndex(N)} + self.idx = data[dtype] + assert self.idx.is_unique + + def time_factorize(self, sort, dtype): + self.idx.factorize(sort=sort) + + +class Duplicated(object): + + params = [['first', 'last', False], ['int', 'uint', 'float', 'string']] + param_names = ['keep', 'dtype'] + + def setup(self, keep, dtype): + N = 10**5 + data = {'int': pd.Int64Index(np.arange(N).repeat(5)), + 'uint': pd.UInt64Index(np.arange(N).repeat(5)), + 'float': pd.Float64Index(np.random.randn(N).repeat(5)), + 'string': tm.makeStringIndex(N).repeat(5)} + self.idx = data[dtype] # cache is_unique - self.int_unique.is_unique + self.idx.is_unique - self.int = pd.Int64Index(np.arange(N).repeat(5)) - self.float = pd.Float64Index(np.random.randn(N).repeat(5)) + def time_duplicated(self, keep, dtype): + self.idx.duplicated(keep=keep) - # Convenience naming. - self.checked_add = pd.core.algorithms.checked_add_with_arr - self.arr = np.arange(1000000) - self.arrpos = np.arange(1000000) - self.arrneg = np.arange(-1000000, 0) - self.arrmixed = np.array([1, -1]).repeat(500000) - self.strings = tm.makeStringIndex(100000) +class DuplicatedUniqueIndex(object): - self.arr_nan = np.random.choice([True, False], size=1000000) - self.arrmixed_nan = np.random.choice([True, False], size=1000000) + params = ['int', 'uint', 'float', 'string'] + param_names = ['dtype'] - # match - self.uniques = tm.makeStringIndex(1000).values - self.all = self.uniques.repeat(10) + def setup(self, dtype): + N = 10**5 + data = {'int': pd.Int64Index(np.arange(N)), + 'uint': pd.UInt64Index(np.arange(N)), + 'float': pd.Float64Index(np.random.randn(N)), + 'string': tm.makeStringIndex(N)} + self.idx = data[dtype] + # cache is_unique + self.idx.is_unique - def time_factorize_string(self): - self.strings.factorize() + def time_duplicated_unique(self, dtype): + self.idx.duplicated() - def time_factorize_int(self): - self.int.factorize() - def time_factorize_float(self): - self.int.factorize() +class Hashing(object): - def time_duplicated_int_unique(self): - self.int_unique.duplicated() + def setup_cache(self): + N = 10**5 - def time_duplicated_int(self): - self.int.duplicated() + df = pd.DataFrame( + {'strings': pd.Series(tm.makeStringIndex(10000).take( + np.random.randint(0, 10000, size=N))), + 'floats': np.random.randn(N), + 'ints': np.arange(N), + 'dates': pd.date_range('20110101', freq='s', periods=N), + 'timedeltas': pd.timedelta_range('1 day', freq='s', periods=N)}) + df['categories'] = df['strings'].astype('category') + df.iloc[10:20] = np.nan + return df - def time_duplicated_float(self): - self.float.duplicated() + def time_frame(self, df): + hashing.hash_pandas_object(df) - def time_match_strings(self): - pd.match(self.all, self.uniques) + def time_series_int(self, df): + hashing.hash_pandas_object(df['ints']) - def time_add_overflow_pos_scalar(self): - self.checked_add(self.arr, 1) + def time_series_string(self, df): + hashing.hash_pandas_object(df['strings']) - def time_add_overflow_neg_scalar(self): - self.checked_add(self.arr, -1) + def time_series_float(self, df): + hashing.hash_pandas_object(df['floats']) - def time_add_overflow_zero_scalar(self): - self.checked_add(self.arr, 0) + def time_series_categorical(self, df): + hashing.hash_pandas_object(df['categories']) - def time_add_overflow_pos_arr(self): - self.checked_add(self.arr, self.arrpos) + def time_series_timedeltas(self, df): + hashing.hash_pandas_object(df['timedeltas']) - def time_add_overflow_neg_arr(self): - self.checked_add(self.arr, self.arrneg) + def time_series_dates(self, df): + hashing.hash_pandas_object(df['dates']) - def time_add_overflow_mixed_arr(self): - self.checked_add(self.arr, self.arrmixed) - def time_add_overflow_first_arg_nan(self): - self.checked_add(self.arr, self.arrmixed, arr_mask=self.arr_nan) +class Quantile(object): + params = [[0, 0.5, 1], + ['linear', 'nearest', 'lower', 'higher', 'midpoint'], + ['float', 'int', 'uint']] + param_names = ['quantile', 'interpolation', 'dtype'] - def time_add_overflow_second_arg_nan(self): - self.checked_add(self.arr, self.arrmixed, b_mask=self.arrmixed_nan) + def setup(self, quantile, interpolation, dtype): + N = 10**5 + data = {'int': np.arange(N), + 'uint': np.arange(N).astype(np.uint64), + 'float': np.random.randn(N)} + self.idx = pd.Series(data[dtype].repeat(5)) - def time_add_overflow_both_arg_nan(self): - self.checked_add(self.arr, self.arrmixed, arr_mask=self.arr_nan, - b_mask=self.arrmixed_nan) + def time_quantile(self, quantile, interpolation, dtype): + self.idx.quantile(quantile, interpolation=interpolation) -class Hashing(object): - goal_time = 0.2 - - def setup(self): - N = 100000 - - self.df = pd.DataFrame( - {'A': pd.Series(tm.makeStringIndex(100).take( - np.random.randint(0, 100, size=N))), - 'B': pd.Series(tm.makeStringIndex(10000).take( - np.random.randint(0, 10000, size=N))), - 'D': np.random.randn(N), - 'E': np.arange(N), - 'F': pd.date_range('20110101', freq='s', periods=N), - 'G': pd.timedelta_range('1 day', freq='s', periods=N), - }) - self.df['C'] = self.df['B'].astype('category') - self.df.iloc[10:20] = np.nan - - def time_frame(self): - self.df.hash() - - def time_series_int(self): - self.df.E.hash() - - def time_series_string(self): - self.df.B.hash() - - def time_series_categorical(self): - self.df.C.hash() +from .pandas_vb_common import setup # noqa: F401 isort:skip diff --git a/asv_bench/benchmarks/attrs_caching.py b/asv_bench/benchmarks/attrs_caching.py index 9210f1f2878d4..d061755208c9e 100644 --- a/asv_bench/benchmarks/attrs_caching.py +++ b/asv_bench/benchmarks/attrs_caching.py @@ -1,9 +1,12 @@ -from .pandas_vb_common import * -from pandas.util.decorators import cache_readonly +import numpy as np +from pandas import DataFrame +try: + from pandas.util import cache_readonly +except ImportError: + from pandas.util.decorators import cache_readonly class DataFrameAttributes(object): - goal_time = 0.2 def setup(self): self.df = DataFrame(np.random.randn(10, 6)) @@ -17,7 +20,6 @@ def time_set_index(self): class CacheReadonly(object): - goal_time = 0.2 def setup(self): @@ -30,3 +32,6 @@ def prop(self): def time_cache_readonly(self): self.obj.prop + + +from .pandas_vb_common import setup # noqa: F401 diff --git a/asv_bench/benchmarks/binary_ops.py b/asv_bench/benchmarks/binary_ops.py index 72700c3de282e..22b8ed80f3d07 100644 --- a/asv_bench/benchmarks/binary_ops.py +++ b/asv_bench/benchmarks/binary_ops.py @@ -1,9 +1,13 @@ -from .pandas_vb_common import * -import pandas.computation.expressions as expr +import numpy as np +from pandas import DataFrame, Series, date_range +from pandas.core.algorithms import checked_add_with_arr +try: + import pandas.core.computation.expressions as expr +except ImportError: + import pandas.computation.expressions as expr class Ops(object): - goal_time = 0.2 params = [[True, False], ['default', 1]] param_names = ['use_numexpr', 'threads'] @@ -17,18 +21,17 @@ def setup(self, use_numexpr, threads): if not use_numexpr: expr.set_use_numexpr(False) - def time_frame_add(self, use_numexpr, threads): - (self.df + self.df2) + self.df + self.df2 def time_frame_mult(self, use_numexpr, threads): - (self.df * self.df2) + self.df * self.df2 def time_frame_multi_and(self, use_numexpr, threads): - self.df[((self.df > 0) & (self.df2 > 0))] + self.df[(self.df > 0) & (self.df2 > 0)] def time_frame_comparison(self, use_numexpr, threads): - (self.df > self.df2) + self.df > self.df2 def teardown(self, use_numexpr, threads): expr.set_use_numexpr(True) @@ -36,75 +39,117 @@ def teardown(self, use_numexpr, threads): class Ops2(object): - goal_time = 0.2 def setup(self): - self.df = DataFrame(np.random.randn(1000, 1000)) - self.df2 = DataFrame(np.random.randn(1000, 1000)) + N = 10**3 + self.df = DataFrame(np.random.randn(N, N)) + self.df2 = DataFrame(np.random.randn(N, N)) + + self.df_int = DataFrame(np.random.randint(np.iinfo(np.int16).min, + np.iinfo(np.int16).max, + size=(N, N))) + self.df2_int = DataFrame(np.random.randint(np.iinfo(np.int16).min, + np.iinfo(np.int16).max, + size=(N, N))) - self.df_int = DataFrame( - np.random.random_integers(np.iinfo(np.int16).min, - np.iinfo(np.int16).max, - size=(1000, 1000))) - self.df2_int = DataFrame( - np.random.random_integers(np.iinfo(np.int16).min, - np.iinfo(np.int16).max, - size=(1000, 1000))) + self.s = Series(np.random.randn(N)) - ## Division + # Division def time_frame_float_div(self): - (self.df // self.df2) + self.df // self.df2 def time_frame_float_div_by_zero(self): - (self.df / 0) + self.df / 0 def time_frame_float_floor_by_zero(self): - (self.df // 0) + self.df // 0 def time_frame_int_div_by_zero(self): - (self.df_int / 0) + self.df_int / 0 - ## Modulo + # Modulo def time_frame_int_mod(self): - (self.df / self.df2) + self.df_int % self.df2_int def time_frame_float_mod(self): - (self.df / self.df2) + self.df % self.df2 + + # Dot product + + def time_frame_dot(self): + self.df.dot(self.df2) + + def time_series_dot(self): + self.s.dot(self.s) + + def time_frame_series_dot(self): + self.df.dot(self.s) class Timeseries(object): - goal_time = 0.2 - def setup(self): - self.N = 1000000 - self.halfway = ((self.N // 2) - 1) - self.s = Series(date_range('20010101', periods=self.N, freq='T')) - self.ts = self.s[self.halfway] + params = [None, 'US/Eastern'] + param_names = ['tz'] - self.s2 = Series(date_range('20010101', periods=self.N, freq='s')) + def setup(self, tz): + N = 10**6 + halfway = (N // 2) - 1 + self.s = Series(date_range('20010101', periods=N, freq='T', tz=tz)) + self.ts = self.s[halfway] - def time_series_timestamp_compare(self): - (self.s <= self.ts) + self.s2 = Series(date_range('20010101', periods=N, freq='s', tz=tz)) - def time_timestamp_series_compare(self): - (self.ts >= self.s) + def time_series_timestamp_compare(self, tz): + self.s <= self.ts - def time_timestamp_ops_diff1(self): + def time_timestamp_series_compare(self, tz): + self.ts >= self.s + + def time_timestamp_ops_diff(self, tz): self.s2.diff() - def time_timestamp_ops_diff2(self): - (self.s - self.s.shift()) + def time_timestamp_ops_diff_with_shift(self, tz): + self.s - self.s.shift() + +class AddOverflowScalar(object): + params = [1, -1, 0] + param_names = ['scalar'] -class TimeseriesTZ(Timeseries): + def setup(self, scalar): + N = 10**6 + self.arr = np.arange(N) + + def time_add_overflow_scalar(self, scalar): + checked_add_with_arr(self.arr, scalar) + + +class AddOverflowArray(object): def setup(self): - self.N = 1000000 - self.halfway = ((self.N // 2) - 1) - self.s = Series(date_range('20010101', periods=self.N, freq='T', tz='US/Eastern')) - self.ts = self.s[self.halfway] + N = 10**6 + self.arr = np.arange(N) + self.arr_rev = np.arange(-N, 0) + self.arr_mixed = np.array([1, -1]).repeat(N / 2) + self.arr_nan_1 = np.random.choice([True, False], size=N) + self.arr_nan_2 = np.random.choice([True, False], size=N) + + def time_add_overflow_arr_rev(self): + checked_add_with_arr(self.arr, self.arr_rev) + + def time_add_overflow_arr_mask_nan(self): + checked_add_with_arr(self.arr, self.arr_mixed, arr_mask=self.arr_nan_1) + + def time_add_overflow_b_mask_nan(self): + checked_add_with_arr(self.arr, self.arr_mixed, + b_mask=self.arr_nan_1) + + def time_add_overflow_both_arg_nan(self): + checked_add_with_arr(self.arr, self.arr_mixed, arr_mask=self.arr_nan_1, + b_mask=self.arr_nan_2) + - self.s2 = Series(date_range('20010101', periods=self.N, freq='s', tz='US/Eastern')) +from .pandas_vb_common import setup # noqa: F401 diff --git a/asv_bench/benchmarks/categoricals.py b/asv_bench/benchmarks/categoricals.py index 153107911ca2c..4b5b2848f7e0f 100644 --- a/asv_bench/benchmarks/categoricals.py +++ b/asv_bench/benchmarks/categoricals.py @@ -1,99 +1,296 @@ -from .pandas_vb_common import * +import warnings + +import numpy as np +import pandas as pd +import pandas.util.testing as tm try: - from pandas.types.concat import union_categoricals + from pandas.api.types import union_categoricals except ImportError: - pass + try: + from pandas.types.concat import union_categoricals + except ImportError: + pass -class Categoricals(object): - goal_time = 0.2 +class Concat(object): def setup(self): - N = 100000 - self.s = pd.Series((list('aabbcd') * N)).astype('category') + N = 10**5 + self.s = pd.Series(list('aabbcd') * N).astype('category') + + self.a = pd.Categorical(list('aabbcd') * N) + self.b = pd.Categorical(list('bbcdjk') * N) + + def time_concat(self): + pd.concat([self.s, self.s]) + + def time_union(self): + union_categoricals([self.a, self.b]) + - self.a = pd.Categorical((list('aabbcd') * N)) - self.b = pd.Categorical((list('bbcdjk') * N)) +class Constructor(object): + def setup(self): + N = 10**5 self.categories = list('abcde') - self.cat_idx = Index(self.categories) + self.cat_idx = pd.Index(self.categories) self.values = np.tile(self.categories, N) self.codes = np.tile(range(len(self.categories)), N) - self.datetimes = pd.Series(pd.date_range( - '1995-01-01 00:00:00', periods=10000, freq='s')) + self.datetimes = pd.Series(pd.date_range('1995-01-01 00:00:00', + periods=N / 10, + freq='s')) + self.datetimes_with_nat = self.datetimes.copy() + self.datetimes_with_nat.iloc[-1] = pd.NaT - def time_concat(self): - concat([self.s, self.s]) + self.values_some_nan = list(np.tile(self.categories + [np.nan], N)) + self.values_all_nan = [np.nan] * len(self.values) + self.values_all_int8 = np.ones(N, 'int8') + self.categorical = pd.Categorical(self.values, self.categories) + self.series = pd.Series(self.categorical) - def time_union(self): - union_categoricals([self.a, self.b]) + def time_regular(self): + pd.Categorical(self.values, self.categories) - def time_constructor_regular(self): - Categorical(self.values, self.categories) + def time_fastpath(self): + pd.Categorical(self.codes, self.cat_idx, fastpath=True) - def time_constructor_fastpath(self): - Categorical(self.codes, self.cat_idx, fastpath=True) + def time_datetimes(self): + pd.Categorical(self.datetimes) - def time_constructor_datetimes(self): - Categorical(self.datetimes) + def time_datetimes_with_nat(self): + pd.Categorical(self.datetimes_with_nat) - def time_constructor_datetimes_with_nat(self): - t = self.datetimes - t.iloc[-1] = pd.NaT - Categorical(t) + def time_with_nan(self): + pd.Categorical(self.values_some_nan) + def time_all_nan(self): + pd.Categorical(self.values_all_nan) -class Categoricals2(object): - goal_time = 0.2 + def time_from_codes_all_int8(self): + pd.Categorical.from_codes(self.values_all_int8, self.categories) + + def time_existing_categorical(self): + pd.Categorical(self.categorical) + + def time_existing_series(self): + pd.Categorical(self.series) - def setup(self): - n = 500000 - np.random.seed(2718281) - arr = ['s%04d' % i for i in np.random.randint(0, n // 10, size=n)] - self.ts = Series(arr).astype('category') - self.sel = self.ts.loc[[0]] +class ValueCounts(object): - def time_value_counts(self): - self.ts.value_counts(dropna=False) + params = [True, False] + param_names = ['dropna'] - def time_value_counts_dropna(self): - self.ts.value_counts(dropna=True) + def setup(self, dropna): + n = 5 * 10**5 + arr = ['s{:04d}'.format(i) for i in np.random.randint(0, n // 10, + size=n)] + self.ts = pd.Series(arr).astype('category') + + def time_value_counts(self, dropna): + self.ts.value_counts(dropna=dropna) + + +class Repr(object): + + def setup(self): + self.sel = pd.Series(['s1234']).astype('category') def time_rendering(self): str(self.sel) -class Categoricals3(object): - goal_time = 0.2 +class SetCategories(object): + + def setup(self): + n = 5 * 10**5 + arr = ['s{:04d}'.format(i) for i in np.random.randint(0, n // 10, + size=n)] + self.ts = pd.Series(arr).astype('category') + + def time_set_categories(self): + self.ts.cat.set_categories(self.ts.cat.categories[::2]) + + +class RemoveCategories(object): + + def setup(self): + n = 5 * 10**5 + arr = ['s{:04d}'.format(i) for i in np.random.randint(0, n // 10, + size=n)] + self.ts = pd.Series(arr).astype('category') + + def time_remove_categories(self): + self.ts.cat.remove_categories(self.ts.cat.categories[::2]) + + +class Rank(object): def setup(self): - N = 100000 + N = 10**5 ncats = 100 - self.s1 = Series(np.array(tm.makeCategoricalIndex(N, ncats))) - self.s1_cat = self.s1.astype('category') - self.s1_cat_ordered = self.s1.astype('category', ordered=True) + self.s_str = pd.Series(tm.makeCategoricalIndex(N, ncats)).astype(str) + self.s_str_cat = self.s_str.astype('category') + with warnings.catch_warnings(record=True): + self.s_str_cat_ordered = self.s_str.astype('category', + ordered=True) - self.s2 = Series(np.random.randint(0, ncats, size=N)) - self.s2_cat = self.s2.astype('category') - self.s2_cat_ordered = self.s2.astype('category', ordered=True) + self.s_int = pd.Series(np.random.randint(0, ncats, size=N)) + self.s_int_cat = self.s_int.astype('category') + with warnings.catch_warnings(record=True): + self.s_int_cat_ordered = self.s_int.astype('category', + ordered=True) def time_rank_string(self): - self.s1.rank() + self.s_str.rank() def time_rank_string_cat(self): - self.s1_cat.rank() + self.s_str_cat.rank() def time_rank_string_cat_ordered(self): - self.s1_cat_ordered.rank() + self.s_str_cat_ordered.rank() def time_rank_int(self): - self.s2.rank() + self.s_int.rank() def time_rank_int_cat(self): - self.s2_cat.rank() + self.s_int_cat.rank() def time_rank_int_cat_ordered(self): - self.s2_cat_ordered.rank() + self.s_int_cat_ordered.rank() + + +class Isin(object): + + params = ['object', 'int64'] + param_names = ['dtype'] + + def setup(self, dtype): + np.random.seed(1234) + n = 5 * 10**5 + sample_size = 100 + arr = [i for i in np.random.randint(0, n // 10, size=n)] + if dtype == 'object': + arr = ['s{:04d}'.format(i) for i in arr] + self.sample = np.random.choice(arr, sample_size) + self.series = pd.Series(arr).astype('category') + + def time_isin_categorical(self, dtype): + self.series.isin(self.sample) + + +class IsMonotonic(object): + + def setup(self): + N = 1000 + self.c = pd.CategoricalIndex(list('a' * N + 'b' * N + 'c' * N)) + self.s = pd.Series(self.c) + + def time_categorical_index_is_monotonic_increasing(self): + self.c.is_monotonic_increasing + + def time_categorical_index_is_monotonic_decreasing(self): + self.c.is_monotonic_decreasing + + def time_categorical_series_is_monotonic_increasing(self): + self.s.is_monotonic_increasing + + def time_categorical_series_is_monotonic_decreasing(self): + self.s.is_monotonic_decreasing + + +class Contains(object): + + def setup(self): + N = 10**5 + self.ci = tm.makeCategoricalIndex(N) + self.c = self.ci.values + self.key = self.ci.categories[0] + + def time_categorical_index_contains(self): + self.key in self.ci + + def time_categorical_contains(self): + self.key in self.c + + +class CategoricalSlicing(object): + + params = ['monotonic_incr', 'monotonic_decr', 'non_monotonic'] + param_names = ['index'] + + def setup(self, index): + N = 10**6 + categories = ['a', 'b', 'c'] + values = [0] * N + [1] * N + [2] * N + if index == 'monotonic_incr': + self.data = pd.Categorical.from_codes(values, + categories=categories) + elif index == 'monotonic_decr': + self.data = pd.Categorical.from_codes(list(reversed(values)), + categories=categories) + elif index == 'non_monotonic': + self.data = pd.Categorical.from_codes([0, 1, 2] * N, + categories=categories) + else: + raise ValueError('Invalid index param: {}'.format(index)) + + self.scalar = 10000 + self.list = list(range(10000)) + self.cat_scalar = 'b' + + def time_getitem_scalar(self, index): + self.data[self.scalar] + + def time_getitem_slice(self, index): + self.data[:self.scalar] + + def time_getitem_list_like(self, index): + self.data[[self.scalar]] + + def time_getitem_list(self, index): + self.data[self.list] + + def time_getitem_bool_array(self, index): + self.data[self.data == self.cat_scalar] + + +class Indexing(object): + + def setup(self): + N = 10**5 + self.index = pd.CategoricalIndex(range(N), range(N)) + self.series = pd.Series(range(N), index=self.index).sort_index() + self.category = self.index[500] + + def time_get_loc(self): + self.index.get_loc(self.category) + + def time_shape(self): + self.index.shape + + def time_shallow_copy(self): + self.index._shallow_copy() + + def time_align(self): + pd.DataFrame({'a': self.series, 'b': self.series[:500]}) + + def time_intersection(self): + self.index[:750].intersection(self.index[250:]) + + def time_unique(self): + self.index.unique() + + def time_reindex(self): + self.index.reindex(self.index[:500]) + + def time_reindex_missing(self): + self.index.reindex(['a', 'b', 'c', 'd']) + + def time_sort_values(self): + self.index.sort_values(ascending=False) + + +from .pandas_vb_common import setup # noqa: F401 diff --git a/asv_bench/benchmarks/ctors.py b/asv_bench/benchmarks/ctors.py index b5694a3a21502..5715c4fb2d0d4 100644 --- a/asv_bench/benchmarks/ctors.py +++ b/asv_bench/benchmarks/ctors.py @@ -1,30 +1,103 @@ -from .pandas_vb_common import * +import numpy as np +import pandas.util.testing as tm +from pandas import Series, Index, DatetimeIndex, Timestamp, MultiIndex -class Constructors(object): - goal_time = 0.2 +def no_change(arr): + return arr + + +def list_of_str(arr): + return list(arr.astype(str)) + + +def gen_of_str(arr): + return (x for x in arr.astype(str)) + + +def arr_dict(arr): + return dict(zip(range(len(arr)), arr)) + + +def list_of_tuples(arr): + return [(i, -i) for i in arr] + + +def gen_of_tuples(arr): + return ((i, -i) for i in arr) + + +def list_of_lists(arr): + return [[i, -i] for i in arr] + + +def list_of_tuples_with_none(arr): + return [(i, -i) for i in arr][:-1] + [None] - def setup(self): - self.arr = np.random.randn(100, 100) - self.arr_str = np.array(['foo', 'bar', 'baz'], dtype=object) - self.data = np.random.randn(100) - self.index = Index(np.arange(100)) +def list_of_lists_with_none(arr): + return [[i, -i] for i in arr][:-1] + [None] - self.s = Series(([Timestamp('20110101'), Timestamp('20120101'), - Timestamp('20130101')] * 1000)) - def time_frame_from_ndarray(self): - DataFrame(self.arr) +class SeriesConstructors(object): - def time_series_from_ndarray(self): - pd.Series(self.data, index=self.index) + param_names = ["data_fmt", "with_index", "dtype"] + params = [[no_change, + list, + list_of_str, + gen_of_str, + arr_dict, + list_of_tuples, + gen_of_tuples, + list_of_lists, + list_of_tuples_with_none, + list_of_lists_with_none], + [False, True], + ['float', 'int']] + + def setup(self, data_fmt, with_index, dtype): + N = 10**4 + if dtype == 'float': + arr = np.random.randn(N) + else: + arr = np.arange(N) + self.data = data_fmt(arr) + self.index = np.arange(N) if with_index else None + + def time_series_constructor(self, data_fmt, with_index, dtype): + Series(self.data, index=self.index) + + +class SeriesDtypesConstructors(object): + + def setup(self): + N = 10**4 + self.arr = np.random.randn(N) + self.arr_str = np.array(['foo', 'bar', 'baz'], dtype=object) + self.s = Series([Timestamp('20110101'), Timestamp('20120101'), + Timestamp('20130101')] * N * 10) def time_index_from_array_string(self): Index(self.arr_str) + def time_index_from_array_floats(self): + Index(self.arr) + def time_dtindex_from_series(self): DatetimeIndex(self.s) - def time_dtindex_from_series2(self): + def time_dtindex_from_index_with_series(self): Index(self.s) + + +class MultiIndexConstructor(object): + + def setup(self): + N = 10**4 + self.iterables = [tm.makeStringIndex(N), range(20)] + + def time_multiindex_from_iterables(self): + MultiIndex.from_product(self.iterables) + + +from .pandas_vb_common import setup # noqa: F401 diff --git a/asv_bench/benchmarks/dtypes.py b/asv_bench/benchmarks/dtypes.py new file mode 100644 index 0000000000000..e59154cd99965 --- /dev/null +++ b/asv_bench/benchmarks/dtypes.py @@ -0,0 +1,39 @@ +from pandas.api.types import pandas_dtype + +import numpy as np +from .pandas_vb_common import ( + numeric_dtypes, datetime_dtypes, string_dtypes, extension_dtypes) + + +_numpy_dtypes = [np.dtype(dtype) + for dtype in (numeric_dtypes + + datetime_dtypes + + string_dtypes)] +_dtypes = _numpy_dtypes + extension_dtypes + + +class Dtypes(object): + params = (_dtypes + + list(map(lambda dt: dt.name, _dtypes))) + param_names = ['dtype'] + + def time_pandas_dtype(self, dtype): + pandas_dtype(dtype) + + +class DtypesInvalid(object): + param_names = ['dtype'] + params = ['scalar-string', 'scalar-int', 'list-string', 'array-string'] + data_dict = {'scalar-string': 'foo', + 'scalar-int': 1, + 'list-string': ['foo'] * 1000, + 'array-string': np.array(['foo'] * 1000)} + + def time_pandas_dtype_invalid(self, dtype): + try: + pandas_dtype(self.data_dict[dtype]) + except TypeError: + pass + + +from .pandas_vb_common import setup # noqa: F401 diff --git a/asv_bench/benchmarks/eval.py b/asv_bench/benchmarks/eval.py index a0819e33dc254..68df38cd50742 100644 --- a/asv_bench/benchmarks/eval.py +++ b/asv_bench/benchmarks/eval.py @@ -1,67 +1,64 @@ -from .pandas_vb_common import * +import numpy as np import pandas as pd -import pandas.computation.expressions as expr +try: + import pandas.core.computation.expressions as expr +except ImportError: + import pandas.computation.expressions as expr class Eval(object): - goal_time = 0.2 params = [['numexpr', 'python'], [1, 'all']] param_names = ['engine', 'threads'] def setup(self, engine, threads): - self.df = DataFrame(np.random.randn(20000, 100)) - self.df2 = DataFrame(np.random.randn(20000, 100)) - self.df3 = DataFrame(np.random.randn(20000, 100)) - self.df4 = DataFrame(np.random.randn(20000, 100)) + self.df = pd.DataFrame(np.random.randn(20000, 100)) + self.df2 = pd.DataFrame(np.random.randn(20000, 100)) + self.df3 = pd.DataFrame(np.random.randn(20000, 100)) + self.df4 = pd.DataFrame(np.random.randn(20000, 100)) if threads == 1: expr.set_numexpr_threads(1) def time_add(self, engine, threads): - df, df2, df3, df4 = self.df, self.df2, self.df3, self.df4 - pd.eval('df + df2 + df3 + df4', engine=engine) + pd.eval('self.df + self.df2 + self.df3 + self.df4', engine=engine) def time_and(self, engine, threads): - df, df2, df3, df4 = self.df, self.df2, self.df3, self.df4 - pd.eval('(df > 0) & (df2 > 0) & (df3 > 0) & (df4 > 0)', engine=engine) + pd.eval('(self.df > 0) & (self.df2 > 0) & ' + '(self.df3 > 0) & (self.df4 > 0)', engine=engine) def time_chained_cmp(self, engine, threads): - df, df2, df3, df4 = self.df, self.df2, self.df3, self.df4 - pd.eval('df < df2 < df3 < df4', engine=engine) + pd.eval('self.df < self.df2 < self.df3 < self.df4', engine=engine) def time_mult(self, engine, threads): - df, df2, df3, df4 = self.df, self.df2, self.df3, self.df4 - pd.eval('df * df2 * df3 * df4', engine=engine) + pd.eval('self.df * self.df2 * self.df3 * self.df4', engine=engine) def teardown(self, engine, threads): expr.set_numexpr_threads() class Query(object): - goal_time = 0.2 def setup(self): - self.N = 1000000 - self.halfway = ((self.N // 2) - 1) - self.index = date_range('20010101', periods=self.N, freq='T') - self.s = Series(self.index) - self.ts = self.s.iloc[self.halfway] - self.df = DataFrame({'a': np.random.randn(self.N), }, index=self.index) - self.df2 = DataFrame({'dates': self.s.values,}) - - self.df3 = DataFrame({'a': np.random.randn(self.N),}) - self.min_val = self.df3['a'].min() - self.max_val = self.df3['a'].max() + N = 10**6 + halfway = (N // 2) - 1 + index = pd.date_range('20010101', periods=N, freq='T') + s = pd.Series(index) + self.ts = s.iloc[halfway] + self.df = pd.DataFrame({'a': np.random.randn(N), 'dates': index}, + index=index) + data = np.random.randn(N) + self.min_val = data.min() + self.max_val = data.max() def time_query_datetime_index(self): - ts = self.ts - self.df.query('index < @ts') + self.df.query('index < @self.ts') - def time_query_datetime_series(self): - ts = self.ts - self.df2.query('dates < @ts') + def time_query_datetime_column(self): + self.df.query('dates < @self.ts') def time_query_with_boolean_selection(self): - min_val, max_val = self.min_val, self.max_val - self.df.query('(a >= @min_val) & (a <= @max_val)') + self.df.query('(a >= @self.min_val) & (a <= @self.max_val)') + + +from .pandas_vb_common import setup # noqa: F401 diff --git a/asv_bench/benchmarks/frame_ctor.py b/asv_bench/benchmarks/frame_ctor.py index 05c1a27fdf8ca..dfb6ab5b189b2 100644 --- a/asv_bench/benchmarks/frame_ctor.py +++ b/asv_bench/benchmarks/frame_ctor.py @@ -1,138 +1,107 @@ -from .pandas_vb_common import * +import numpy as np +import pandas.util.testing as tm +from pandas import DataFrame, Series, MultiIndex, Timestamp, date_range try: - from pandas.tseries.offsets import * -except: - from pandas.core.datetools import * + from pandas.tseries.offsets import Nano, Hour +except ImportError: + # For compatibility with older versions + from pandas.core.datetools import * # noqa -#---------------------------------------------------------------------- -# Creation from nested dict - class FromDicts(object): - goal_time = 0.2 def setup(self): - (N, K) = (5000, 50) + N, K = 5000, 50 self.index = tm.makeStringIndex(N) self.columns = tm.makeStringIndex(K) - self.frame = DataFrame(np.random.randn(N, K), index=self.index, columns=self.columns) - try: - self.data = self.frame.to_dict() - except: - self.data = self.frame.toDict() - self.some_dict = self.data.values()[0] - self.dict_list = [dict(zip(self.columns, row)) for row in self.frame.values] - - self.data2 = dict( - ((i, dict(((j, float(j)) for j in range(100)))) for i in - xrange(2000))) - - def time_frame_ctor_list_of_dict(self): + frame = DataFrame(np.random.randn(N, K), index=self.index, + columns=self.columns) + self.data = frame.to_dict() + self.dict_list = frame.to_dict(orient='records') + self.data2 = {i: {j: float(j) for j in range(100)} + for i in range(2000)} + + def time_list_of_dict(self): DataFrame(self.dict_list) - def time_frame_ctor_nested_dict(self): + def time_nested_dict(self): DataFrame(self.data) - def time_series_ctor_from_dict(self): - Series(self.some_dict) + def time_nested_dict_index(self): + DataFrame(self.data, index=self.index) - def time_frame_ctor_nested_dict_int64(self): - # nested dict, integer indexes, regression described in #621 - DataFrame(self.data) + def time_nested_dict_columns(self): + DataFrame(self.data, columns=self.columns) + def time_nested_dict_index_columns(self): + DataFrame(self.data, index=self.index, columns=self.columns) -# from a mi-series + def time_nested_dict_int64(self): + # nested dict, integer indexes, regression described in #621 + DataFrame(self.data2) -class frame_from_series(object): - goal_time = 0.2 + +class FromSeries(object): def setup(self): - self.mi = MultiIndex.from_tuples([(x, y) for x in range(100) for y in range(100)]) - self.s = Series(randn(10000), index=self.mi) + mi = MultiIndex.from_product([range(100), range(100)]) + self.s = Series(np.random.randn(10000), index=mi) - def time_frame_from_mi_series(self): + def time_mi_series(self): DataFrame(self.s) -#---------------------------------------------------------------------- -# get_numeric_data - -class frame_get_numeric_data(object): - goal_time = 0.2 - - def setup(self): - self.df = DataFrame(randn(10000, 25)) - self.df['foo'] = 'bar' - self.df['bar'] = 'baz' - self.df = self.df.consolidate() - - def time_frame_get_numeric_data(self): - self.df._get_numeric_data() +class FromDictwithTimestamp(object): + params = [Nano(1), Hour(1)] + param_names = ['offset'] -# ---------------------------------------------------------------------- -# From dict with DatetimeIndex with all offsets - -# dynamically generate benchmarks for every offset -# -# get_period_count & get_index_for_offset are there because blindly taking each -# offset times 1000 can easily go out of Timestamp bounds and raise errors. + def setup(self, offset): + N = 10**3 + np.random.seed(1234) + idx = date_range(Timestamp('1/1/1900'), freq=offset, periods=N) + df = DataFrame(np.random.randn(N, 10), index=idx) + self.d = df.to_dict() + def time_dict_with_timestamp_offsets(self, offset): + DataFrame(self.d) -def get_period_count(start_date, off): - ten_offsets_in_days = ((start_date + (off * 10)) - start_date).days - if (ten_offsets_in_days == 0): - return 1000 - else: - return min((9 * ((Timestamp.max - start_date).days // ten_offsets_in_days)), 1000) +class FromRecords(object): -def get_index_for_offset(off): - start_date = Timestamp('1/1/1900') - return date_range(start_date, periods=min(1000, get_period_count( - start_date, off)), freq=off) + params = [None, 1000] + param_names = ['nrows'] + def setup(self, nrows): + N = 100000 + self.gen = ((x, (x * 20), (x * 100)) for x in range(N)) -all_offsets = offsets.__all__ -# extra cases -for off in ['FY5253', 'FY5253Quarter']: - all_offsets.pop(all_offsets.index(off)) - all_offsets.extend([off + '_1', off + '_2']) + def time_frame_from_records_generator(self, nrows): + # issue-6700 + self.df = DataFrame.from_records(self.gen, nrows=nrows) -class FrameConstructorDTIndexFromOffsets(object): +class FromNDArray(object): - params = [all_offsets, [1, 2]] - param_names = ['offset', 'n_steps'] + def setup(self): + N = 100000 + self.data = np.random.randn(N) - offset_kwargs = {'WeekOfMonth': {'weekday': 1, 'week': 1}, - 'LastWeekOfMonth': {'weekday': 1, 'week': 1}, - 'FY5253': {'startingMonth': 1, 'weekday': 1}, - 'FY5253Quarter': {'qtr_with_extra_week': 1, 'startingMonth': 1, 'weekday': 1}} + def time_frame_from_ndarray(self): + self.df = DataFrame(self.data) - offset_extra_cases = {'FY5253': {'variation': ['nearest', 'last']}, - 'FY5253Quarter': {'variation': ['nearest', 'last']}} - def setup(self, offset, n_steps): +class FromLists(object): - extra = False - if offset.endswith("_", None, -1): - extra = int(offset[-1]) - offset = offset[:-2] + goal_time = 0.2 - kwargs = {} - if offset in self.offset_kwargs: - kwargs = self.offset_kwargs[offset] + def setup(self): + N = 1000 + M = 100 + self.data = [[j for j in range(M)] for i in range(N)] - if extra: - extras = self.offset_extra_cases[offset] - for extra_arg in extras: - kwargs[extra_arg] = extras[extra_arg][extra -1] + def time_frame_from_lists(self): + self.df = DataFrame(self.data) - offset = getattr(offsets, offset) - self.idx = get_index_for_offset(offset(n_steps, **kwargs)) - self.df = DataFrame(np.random.randn(len(self.idx), 10), index=self.idx) - self.d = dict([(col, self.df[col]) for col in self.df.columns]) - def time_frame_ctor(self, offset, n_steps): - DataFrame(self.d) +from .pandas_vb_common import setup # noqa: F401 diff --git a/asv_bench/benchmarks/frame_methods.py b/asv_bench/benchmarks/frame_methods.py index 9f491302a4d6f..ba2e63c20d3f8 100644 --- a/asv_bench/benchmarks/frame_methods.py +++ b/asv_bench/benchmarks/frame_methods.py @@ -1,20 +1,36 @@ -from .pandas_vb_common import * import string +import numpy as np -#---------------------------------------------------------------------- -# lookup +from pandas import ( + DataFrame, MultiIndex, NaT, Series, date_range, isnull, period_range) +import pandas.util.testing as tm -class frame_fancy_lookup(object): - goal_time = 0.2 + +class GetNumericData(object): + + def setup(self): + self.df = DataFrame(np.random.randn(10000, 25)) + self.df['foo'] = 'bar' + self.df['bar'] = 'baz' + self.df = self.df._consolidate() + + def time_frame_get_numeric_data(self): + self.df._get_numeric_data() + + +class Lookup(object): def setup(self): - self.df = DataFrame(np.random.randn(10000, 8), columns=list('abcdefgh')) + self.df = DataFrame(np.random.randn(10000, 8), + columns=list('abcdefgh')) self.df['foo'] = 'bar' self.row_labels = list(self.df.index[::10])[:900] - self.col_labels = (list(self.df.columns) * 100) - self.row_labels_all = np.array((list(self.df.index) * len(self.df.columns)), dtype='object') - self.col_labels_all = np.array((list(self.df.columns) * len(self.df.index)), dtype='object') + self.col_labels = list(self.df.columns) * 100 + self.row_labels_all = np.array( + list(self.df.index) * len(self.df.columns), dtype='object') + self.col_labels_all = np.array( + list(self.df.columns) * len(self.df.index), dtype='object') def time_frame_fancy_lookup(self): self.df.lookup(self.row_labels, self.col_labels) @@ -23,25 +39,18 @@ def time_frame_fancy_lookup_all(self): self.df.lookup(self.row_labels_all, self.col_labels_all) -#---------------------------------------------------------------------- -# reindex - class Reindex(object): - goal_time = 0.2 def setup(self): - self.df = DataFrame(randn(10000, 1000)) - self.idx = np.arange(4000, 7000) - + N = 10**3 + self.df = DataFrame(np.random.randn(N * 10, N)) + self.idx = np.arange(4 * N, 7 * N) self.df2 = DataFrame( - dict([(c, {0: randint(0, 2, 1000).astype(np.bool_), - 1: randint(0, 1000, 1000).astype( - np.int16), - 2: randint(0, 1000, 1000).astype( - np.int32), - 3: randint(0, 1000, 1000).astype( - np.int64),}[randint(0, 4)]) for c in - range(1000)])) + {c: {0: np.random.randint(0, 2, N).astype(np.bool_), + 1: np.random.randint(0, N, N).astype(np.int16), + 2: np.random.randint(0, N, N).astype(np.int32), + 3: np.random.randint(0, N, N).astype(np.int64)} + [np.random.randint(0, 4)] for c in range(N)}) def time_reindex_axis0(self): self.df.reindex(self.idx) @@ -52,82 +61,167 @@ def time_reindex_axis1(self): def time_reindex_both_axes(self): self.df.reindex(index=self.idx, columns=self.idx) - def time_reindex_both_axes_ix(self): - self.df.ix[(self.idx, self.idx)] - def time_reindex_upcast(self): - self.df2.reindex(permutation(range(1200))) + self.df2.reindex(np.random.permutation(range(1200))) -#---------------------------------------------------------------------- -# iteritems (monitor no-copying behaviour) +class Rename(object): + + def setup(self): + N = 10**3 + self.df = DataFrame(np.random.randn(N * 10, N)) + self.idx = np.arange(4 * N, 7 * N) + self.dict_idx = {k: k for k in self.idx} + self.df2 = DataFrame( + {c: {0: np.random.randint(0, 2, N).astype(np.bool_), + 1: np.random.randint(0, N, N).astype(np.int16), + 2: np.random.randint(0, N, N).astype(np.int32), + 3: np.random.randint(0, N, N).astype(np.int64)} + [np.random.randint(0, 4)] for c in range(N)}) + + def time_rename_single(self): + self.df.rename({0: 0}) + + def time_rename_axis0(self): + self.df.rename(self.dict_idx) + + def time_rename_axis1(self): + self.df.rename(columns=self.dict_idx) + + def time_rename_both_axes(self): + self.df.rename(index=self.dict_idx, columns=self.dict_idx) + + def time_dict_rename_both_axes(self): + self.df.rename(index=self.dict_idx, columns=self.dict_idx) + class Iteration(object): - goal_time = 0.2 def setup(self): - self.df = DataFrame(randn(10000, 1000)) - self.df2 = DataFrame(np.random.randn(50000, 10)) - self.df3 = pd.DataFrame(np.random.randn(1000,5000), - columns=['C'+str(c) for c in range(5000)]) + N = 1000 + self.df = DataFrame(np.random.randn(N * 10, N)) + self.df2 = DataFrame(np.random.randn(N * 50, 10)) + self.df3 = DataFrame(np.random.randn(N, 5 * N), + columns=['C' + str(c) for c in range(N * 5)]) + self.df4 = DataFrame(np.random.randn(N * 1000, 10)) - def f(self): + def time_iteritems(self): + # (monitor no-copying behaviour) if hasattr(self.df, '_item_cache'): self.df._item_cache.clear() - for (name, col) in self.df.iteritems(): + for name, col in self.df.iteritems(): pass - def g(self): - for (name, col) in self.df.iteritems(): + def time_iteritems_cached(self): + for name, col in self.df.iteritems(): pass - def time_iteritems(self): - self.f() + def time_iteritems_indexing(self): + for col in self.df3: + self.df3[col] - def time_iteritems_cached(self): - self.g() + def time_itertuples_start(self): + self.df4.itertuples() - def time_iteritems_indexing(self): - df = self.df3 - for col in df: - df[col] + def time_itertuples_read_first(self): + next(self.df4.itertuples()) def time_itertuples(self): - for row in self.df2.itertuples(): + for row in self.df4.itertuples(): pass + def time_itertuples_to_list(self): + list(self.df4.itertuples()) -#---------------------------------------------------------------------- -# to_string, to_html, repr + def mem_itertuples_start(self): + return self.df4.itertuples() -class Formatting(object): - goal_time = 0.2 + def peakmem_itertuples_start(self): + self.df4.itertuples() - def setup(self): - self.df = DataFrame(randn(100, 10)) + def mem_itertuples_read_first(self): + return next(self.df4.itertuples()) + + def peakmem_itertuples(self): + for row in self.df4.itertuples(): + pass + + def mem_itertuples_to_list(self): + return list(self.df4.itertuples()) + + def peakmem_itertuples_to_list(self): + list(self.df4.itertuples()) + + def time_itertuples_raw_start(self): + self.df4.itertuples(index=False, name=None) + + def time_itertuples_raw_read_first(self): + next(self.df4.itertuples(index=False, name=None)) + + def time_itertuples_raw_tuples(self): + for row in self.df4.itertuples(index=False, name=None): + pass + + def time_itertuples_raw_tuples_to_list(self): + list(self.df4.itertuples(index=False, name=None)) + + def mem_itertuples_raw_start(self): + return self.df4.itertuples(index=False, name=None) + + def peakmem_itertuples_raw_start(self): + self.df4.itertuples(index=False, name=None) + + def peakmem_itertuples_raw_read_first(self): + next(self.df4.itertuples(index=False, name=None)) + + def peakmem_itertuples_raw(self): + for row in self.df4.itertuples(index=False, name=None): + pass + + def mem_itertuples_raw_to_list(self): + return list(self.df4.itertuples(index=False, name=None)) - self.nrows = 500 - self.df2 = DataFrame(randn(self.nrows, 10)) - self.df2[0] = period_range('2000', '2010', self.nrows) - self.df2[1] = range(self.nrows) + def peakmem_itertuples_raw_to_list(self): + list(self.df4.itertuples(index=False, name=None)) - self.nrows = 10000 - self.data = randn(self.nrows, 10) - self.idx = MultiIndex.from_arrays(np.tile(randn(3, int(self.nrows / 100)), 100)) - self.df3 = DataFrame(self.data, index=self.idx) - self.idx = randn(self.nrows) - self.df4 = DataFrame(self.data, index=self.idx) + def time_iterrows(self): + for row in self.df.iterrows(): + pass - self.df_tall = pandas.DataFrame(np.random.randn(10000, 10)) - self.df_wide = pandas.DataFrame(np.random.randn(10, 10000)) +class ToString(object): + + def setup(self): + self.df = DataFrame(np.random.randn(100, 10)) def time_to_string_floats(self): self.df.to_string() + +class ToHTML(object): + + def setup(self): + nrows = 500 + self.df2 = DataFrame(np.random.randn(nrows, 10)) + self.df2[0] = period_range('2000', periods=nrows) + self.df2[1] = range(nrows) + def time_to_html_mixed(self): self.df2.to_html() + +class Repr(object): + + def setup(self): + nrows = 10000 + data = np.random.randn(nrows, 10) + arrays = np.tile(np.random.randn(3, int(nrows / 100)), 100) + idx = MultiIndex.from_arrays(arrays) + self.df3 = DataFrame(data, index=idx) + self.df4 = DataFrame(data, index=np.random.randn(nrows)) + self.df_tall = DataFrame(np.random.randn(nrows, 10)) + self.df_wide = DataFrame(np.random.randn(10, nrows)) + def time_html_repr_trunc_mi(self): self.df3._repr_html_() @@ -141,21 +235,14 @@ def time_frame_repr_wide(self): repr(self.df_wide) -#---------------------------------------------------------------------- -# nulls/masking - - -## masking - -class frame_mask_bools(object): - goal_time = 0.2 +class MaskBool(object): def setup(self): - self.data = np.random.randn(1000, 500) - self.df = DataFrame(self.data) - self.df = self.df.where((self.df > 0)) - self.bools = (self.df > 0) - self.mask = isnull(self.df) + data = np.random.randn(1000, 500) + df = DataFrame(data) + df = df.where(df > 0) + self.bools = df > 0 + self.mask = isnull(df) def time_frame_mask_bools(self): self.bools.mask(self.mask) @@ -164,31 +251,24 @@ def time_frame_mask_floats(self): self.bools.astype(float).mask(self.mask) -## isnull - -class FrameIsnull(object): - goal_time = 0.2 +class Isnull(object): def setup(self): - self.df_no_null = DataFrame(np.random.randn(1000, 1000)) - - np.random.seed(1234) - self.sample = np.array([np.nan, 1.0]) - self.data = np.random.choice(self.sample, (1000, 1000)) - self.df = DataFrame(self.data) - - np.random.seed(1234) - self.sample = np.array(list(string.ascii_lowercase) + - list(string.ascii_uppercase) + - list(string.whitespace)) - self.data = np.random.choice(self.sample, (1000, 1000)) - self.df_strings= DataFrame(self.data) - - np.random.seed(1234) - self.sample = np.array([NaT, np.nan, None, np.datetime64('NaT'), - np.timedelta64('NaT'), 0, 1, 2.0, '', 'abcd']) - self.data = np.random.choice(self.sample, (1000, 1000)) - self.df_obj = DataFrame(self.data) + N = 10**3 + self.df_no_null = DataFrame(np.random.randn(N, N)) + + sample = np.array([np.nan, 1.0]) + data = np.random.choice(sample, (N, N)) + self.df = DataFrame(data) + + sample = np.array(list(string.ascii_letters + string.whitespace)) + data = np.random.choice(sample, (N, N)) + self.df_strings = DataFrame(data) + + sample = np.array([NaT, np.nan, None, np.datetime64('NaT'), + np.timedelta64('NaT'), 0, 1, 2.0, '', 'abcd']) + data = np.random.choice(sample, (N, N)) + self.df_obj = DataFrame(data) def time_isnull_floats_no_null(self): isnull(self.df_no_null) @@ -203,126 +283,97 @@ def time_isnull_obj(self): isnull(self.df_obj) -# ---------------------------------------------------------------------- -# fillna in place - -class frame_fillna_inplace(object): - goal_time = 0.2 - - def setup(self): - self.df = DataFrame(randn(10000, 100)) - self.df.values[::2] = np.nan - - def time_frame_fillna_inplace(self): - self.df.fillna(0, inplace=True) +class Fillna(object): + params = ([True, False], ['pad', 'bfill']) + param_names = ['inplace', 'method'] + def setup(self, inplace, method): + values = np.random.randn(10000, 100) + values[::2] = np.nan + self.df = DataFrame(values) -class frame_fillna_many_columns_pad(object): - goal_time = 0.2 - - def setup(self): - self.values = np.random.randn(1000, 1000) - self.values[::2] = np.nan - self.df = DataFrame(self.values) - - def time_frame_fillna_many_columns_pad(self): - self.df.fillna(method='pad') - + def time_frame_fillna(self, inplace, method): + self.df.fillna(inplace=inplace, method=method) class Dropna(object): - goal_time = 0.2 - def setup(self): - self.data = np.random.randn(10000, 1000) - self.df = DataFrame(self.data) + params = (['all', 'any'], [0, 1]) + param_names = ['how', 'axis'] + + def setup(self, how, axis): + self.df = DataFrame(np.random.randn(10000, 1000)) self.df.ix[50:1000, 20:50] = np.nan self.df.ix[2000:3000] = np.nan self.df.ix[:, 60:70] = np.nan self.df_mixed = self.df.copy() self.df_mixed['foo'] = 'bar' - self.df_mi = self.df.copy() - self.df_mi.index = MultiIndex.from_tuples(self.df_mi.index.map((lambda x: (x, x)))) - self.df_mi.columns = MultiIndex.from_tuples(self.df_mi.columns.map((lambda x: (x, x)))) - - self.df_mixed_mi = self.df_mixed.copy() - self.df_mixed_mi.index = MultiIndex.from_tuples(self.df_mixed_mi.index.map((lambda x: (x, x)))) - self.df_mixed_mi.columns = MultiIndex.from_tuples(self.df_mixed_mi.columns.map((lambda x: (x, x)))) - - def time_dropna_axis0_all(self): - self.df.dropna(how='all', axis=0) - - def time_dropna_axis0_any(self): - self.df.dropna(how='any', axis=0) - - def time_dropna_axis1_all(self): - self.df.dropna(how='all', axis=1) + def time_dropna(self, how, axis): + self.df.dropna(how=how, axis=axis) - def time_dropna_axis1_any(self): - self.df.dropna(how='any', axis=1) + def time_dropna_axis_mixed_dtypes(self, how, axis): + self.df_mixed.dropna(how=how, axis=axis) - def time_dropna_axis0_all_mixed_dtypes(self): - self.df_mixed.dropna(how='all', axis=0) - def time_dropna_axis0_any_mixed_dtypes(self): - self.df_mixed.dropna(how='any', axis=0) +class Count(object): - def time_dropna_axis1_all_mixed_dtypes(self): - self.df_mixed.dropna(how='all', axis=1) + params = [0, 1] + param_names = ['axis'] - def time_dropna_axis1_any_mixed_dtypes(self): - self.df_mixed.dropna(how='any', axis=1) - - def time_count_level_axis0_multi(self): - self.df_mi.count(axis=0, level=1) + def setup(self, axis): + self.df = DataFrame(np.random.randn(10000, 1000)) + self.df.ix[50:1000, 20:50] = np.nan + self.df.ix[2000:3000] = np.nan + self.df.ix[:, 60:70] = np.nan + self.df_mixed = self.df.copy() + self.df_mixed['foo'] = 'bar' - def time_count_level_axis1_multi(self): - self.df_mi.count(axis=1, level=1) + self.df.index = MultiIndex.from_arrays([self.df.index, self.df.index]) + self.df.columns = MultiIndex.from_arrays([self.df.columns, + self.df.columns]) + self.df_mixed.index = MultiIndex.from_arrays([self.df_mixed.index, + self.df_mixed.index]) + self.df_mixed.columns = MultiIndex.from_arrays([self.df_mixed.columns, + self.df_mixed.columns]) - def time_count_level_axis0_mixed_dtypes_multi(self): - self.df_mixed_mi.count(axis=0, level=1) + def time_count_level_multi(self, axis): + self.df.count(axis=axis, level=1) - def time_count_level_axis1_mixed_dtypes_multi(self): - self.df_mixed_mi.count(axis=1, level=1) + def time_count_level_mixed_dtypes_multi(self, axis): + self.df_mixed.count(axis=axis, level=1) class Apply(object): - goal_time = 0.2 def setup(self): self.df = DataFrame(np.random.randn(1000, 100)) self.s = Series(np.arange(1028.0)) self.df2 = DataFrame({i: self.s for i in range(1028)}) - self.df3 = DataFrame(np.random.randn(1000, 3), columns=list('ABC')) def time_apply_user_func(self): - self.df2.apply((lambda x: np.corrcoef(x, self.s)[(0, 1)])) + self.df2.apply(lambda x: np.corrcoef(x, self.s)[(0, 1)]) def time_apply_axis_1(self): - self.df.apply((lambda x: (x + 1)), axis=1) + self.df.apply(lambda x: x + 1, axis=1) def time_apply_lambda_mean(self): - self.df.apply((lambda x: x.mean())) + self.df.apply(lambda x: x.mean()) def time_apply_np_mean(self): self.df.apply(np.mean) def time_apply_pass_thru(self): - self.df.apply((lambda x: x)) + self.df.apply(lambda x: x) def time_apply_ref_by_name(self): - self.df3.apply((lambda x: (x['A'] + x['B'])), axis=1) + self.df3.apply(lambda x: x['A'] + x['B'], axis=1) -#---------------------------------------------------------------------- -# dtypes - -class frame_dtypes(object): - goal_time = 0.2 +class Dtypes(object): def setup(self): self.df = DataFrame(np.random.randn(1000, 1000)) @@ -330,331 +381,205 @@ def setup(self): def time_frame_dtypes(self): self.df.dtypes -#---------------------------------------------------------------------- -# equals class Equals(object): - goal_time = 0.2 def setup(self): - self.float_df = DataFrame(np.random.randn(1000, 1000)) - self.object_df = DataFrame(([(['foo'] * 1000)] * 1000)) - self.nonunique_cols = self.object_df.copy() - self.nonunique_cols.columns = (['A'] * len(self.nonunique_cols.columns)) - self.pairs = dict([(name, self.make_pair(frame)) for (name, frame) in ( - ('float_df', self.float_df), ('object_df', self.object_df), - ('nonunique_cols', self.nonunique_cols))]) + N = 10**3 + self.float_df = DataFrame(np.random.randn(N, N)) + self.float_df_nan = self.float_df.copy() + self.float_df_nan.iloc[-1, -1] = np.nan - def make_pair(self, frame): - self.df = frame - self.df2 = self.df.copy() - self.df2.ix[((-1), (-1))] = np.nan - return (self.df, self.df2) + self.object_df = DataFrame('foo', index=range(N), columns=range(N)) + self.object_df_nan = self.object_df.copy() + self.object_df_nan.iloc[-1, -1] = np.nan - def test_equal(self, name): - (self.df, self.df2) = self.pairs[name] - return self.df.equals(self.df) - - def test_unequal(self, name): - (self.df, self.df2) = self.pairs[name] - return self.df.equals(self.df2) + self.nonunique_cols = self.object_df.copy() + self.nonunique_cols.columns = ['A'] * len(self.nonunique_cols.columns) + self.nonunique_cols_nan = self.nonunique_cols.copy() + self.nonunique_cols_nan.iloc[-1, -1] = np.nan def time_frame_float_equal(self): - self.test_equal('float_df') + self.float_df.equals(self.float_df) def time_frame_float_unequal(self): - self.test_unequal('float_df') + self.float_df.equals(self.float_df_nan) def time_frame_nonunique_equal(self): - self.test_equal('nonunique_cols') + self.nonunique_cols.equals(self.nonunique_cols) def time_frame_nonunique_unequal(self): - self.test_unequal('nonunique_cols') + self.nonunique_cols.equals(self.nonunique_cols_nan) def time_frame_object_equal(self): - self.test_equal('object_df') + self.object_df.equals(self.object_df) def time_frame_object_unequal(self): - self.test_unequal('object_df') + self.object_df.equals(self.object_df_nan) class Interpolate(object): - goal_time = 0.2 - def setup(self): + params = [None, 'infer'] + param_names = ['downcast'] + + def setup(self, downcast): + N = 10000 # this is the worst case, where every column has NaNs. - self.df = DataFrame(randn(10000, 100)) + self.df = DataFrame(np.random.randn(N, 100)) self.df.values[::2] = np.nan - self.df2 = DataFrame( - {'A': np.arange(0, 10000), 'B': np.random.randint(0, 100, 10000), - 'C': randn(10000), 'D': randn(10000),}) + self.df2 = DataFrame({'A': np.arange(0, N), + 'B': np.random.randint(0, 100, N), + 'C': np.random.randn(N), + 'D': np.random.randn(N)}) self.df2.loc[1::5, 'A'] = np.nan self.df2.loc[1::5, 'C'] = np.nan - def time_interpolate(self): - self.df.interpolate() + def time_interpolate(self, downcast): + self.df.interpolate(downcast=downcast) - def time_interpolate_some_good(self): - self.df2.interpolate() - - def time_interpolate_some_good_infer(self): - self.df2.interpolate(downcast='infer') + def time_interpolate_some_good(self, downcast): + self.df2.interpolate(downcast=downcast) class Shift(object): # frame shift speedup issue-5609 - goal_time = 0.2 + params = [0, 1] + param_names = ['axis'] - def setup(self): + def setup(self, axis): self.df = DataFrame(np.random.rand(10000, 500)) - def time_shift_axis0(self): - self.df.shift(1, axis=0) - - def time_shift_axis_1(self): - self.df.shift(1, axis=1) - - -#----------------------------------------------------------------------------- -# from_records issue-6700 - -class frame_from_records_generator(object): - goal_time = 0.2 - - def get_data(self, n=100000): - return ((x, (x * 20), (x * 100)) for x in range(n)) - - def time_frame_from_records_generator(self): - self.df = DataFrame.from_records(self.get_data()) - - def time_frame_from_records_generator_nrows(self): - self.df = DataFrame.from_records(self.get_data(), nrows=1000) - + def time_shift(self, axis): + self.df.shift(1, axis=axis) -#----------------------------------------------------------------------------- -# nunique - -class frame_nunique(object): +class Nunique(object): def setup(self): - self.data = np.random.randn(10000, 1000) - self.df = DataFrame(self.data) + self.df = DataFrame(np.random.randn(10000, 1000)) def time_frame_nunique(self): self.df.nunique() - -#----------------------------------------------------------------------------- -# duplicated - -class frame_duplicated(object): - goal_time = 0.2 +class Duplicated(object): def setup(self): - self.n = (1 << 20) - self.t = date_range('2015-01-01', freq='S', periods=(self.n // 64)) - self.xs = np.random.randn((self.n // 64)).round(2) - self.df = DataFrame({'a': np.random.randint(((-1) << 8), (1 << 8), self.n), 'b': np.random.choice(self.t, self.n), 'c': np.random.choice(self.xs, self.n), }) - - self.df2 = DataFrame(np.random.randn(1000, 100).astype(str)) + n = (1 << 20) + t = date_range('2015-01-01', freq='S', periods=(n // 64)) + xs = np.random.randn(n // 64).round(2) + self.df = DataFrame({'a': np.random.randint(-1 << 8, 1 << 8, n), + 'b': np.random.choice(t, n), + 'c': np.random.choice(xs, n)}) + self.df2 = DataFrame(np.random.randn(1000, 100).astype(str)).T def time_frame_duplicated(self): self.df.duplicated() def time_frame_duplicated_wide(self): - self.df2.T.duplicated() - + self.df2.duplicated() +class XS(object): + params = [0, 1] + param_names = ['axis'] + def setup(self, axis): + self.N = 10**4 + self.df = DataFrame(np.random.randn(self.N, self.N)) + def time_frame_xs(self, axis): + self.df.xs(self.N / 2, axis=axis) +class SortValues(object): + params = [True, False] + param_names = ['ascending'] + def setup(self, ascending): + self.df = DataFrame(np.random.randn(1000000, 2), columns=list('AB')) + def time_frame_sort_values(self, ascending): + self.df.sort_values(by='A', ascending=ascending) - - - - -class frame_xs_col(object): - goal_time = 0.2 +class SortIndexByColumns(object): def setup(self): - self.df = DataFrame(randn(1, 100000)) + N = 10000 + K = 10 + self.df = DataFrame({'key1': tm.makeStringIndex(N).values.repeat(K), + 'key2': tm.makeStringIndex(N).values.repeat(K), + 'value': np.random.randn(N * K)}) - def time_frame_xs_col(self): - self.df.xs(50000, axis=1) + def time_frame_sort_values_by_columns(self): + self.df.sort_values(by=['key1', 'key2']) -class frame_xs_row(object): - goal_time = 0.2 +class Quantile(object): - def setup(self): - self.df = DataFrame(randn(100000, 1)) + params = [0, 1] + param_names = ['axis'] - def time_frame_xs_row(self): - self.df.xs(50000) + def setup(self, axis): + self.df = DataFrame(np.random.randn(1000, 3), columns=list('ABC')) + def time_frame_quantile(self, axis): + self.df.quantile([0.1, 0.5], axis=axis) -class frame_sort_index(object): - goal_time = 0.2 +class GetDtypeCounts(object): + # 2807 def setup(self): - self.df = DataFrame(randn(1000000, 2), columns=list('AB')) - - def time_frame_sort_index(self): - self.df.sort_index() + self.df = DataFrame(np.random.randn(10, 10000)) + def time_frame_get_dtype_counts(self): + self.df.get_dtype_counts() -class frame_sort_index_by_columns(object): - goal_time = 0.2 + def time_info(self): + self.df.info() - def setup(self): - self.N = 10000 - self.K = 10 - self.key1 = tm.makeStringIndex(self.N).values.repeat(self.K) - self.key2 = tm.makeStringIndex(self.N).values.repeat(self.K) - self.df = DataFrame({'key1': self.key1, 'key2': self.key2, 'value': np.random.randn((self.N * self.K)), }) - self.col_array_list = list(self.df.values.T) - def time_frame_sort_index_by_columns(self): - self.df.sort_index(by=['key1', 'key2']) +class NSort(object): + params = ['first', 'last', 'all'] + param_names = ['keep'] -class frame_quantile_axis1(object): - goal_time = 0.2 - - def setup(self): - self.df = DataFrame(np.random.randn(1000, 3), + def setup(self, keep): + self.df = DataFrame(np.random.randn(100000, 3), columns=list('ABC')) - def time_frame_quantile_axis1(self): - self.df.quantile([0.1, 0.5], axis=1) - - -#---------------------------------------------------------------------- -# boolean indexing + def time_nlargest_one_column(self, keep): + self.df.nlargest(100, 'A', keep=keep) -class frame_boolean_row_select(object): - goal_time = 0.2 - - def setup(self): - self.df = DataFrame(randn(10000, 100)) - self.bool_arr = np.zeros(10000, dtype=bool) - self.bool_arr[:1000] = True - - def time_frame_boolean_row_select(self): - self.df[self.bool_arr] - -class frame_getitem_single_column(object): - goal_time = 0.2 - - def setup(self): - self.df = DataFrame(randn(10000, 1000)) - self.df2 = DataFrame(randn(3000, 1), columns=['A']) - self.df3 = DataFrame(randn(3000, 1)) - - def h(self): - for i in range(10000): - self.df2['A'] - - def j(self): - for i in range(10000): - self.df3[0] - - def time_frame_getitem_single_column(self): - self.h() - - def time_frame_getitem_single_column2(self): - self.j() + def time_nlargest_two_columns(self, keep): + self.df.nlargest(100, ['A', 'B'], keep=keep) + def time_nsmallest_one_column(self, keep): + self.df.nsmallest(100, 'A', keep=keep) -#---------------------------------------------------------------------- -# assignment - -class frame_assign_timeseries_index(object): - goal_time = 0.2 - - def setup(self): - self.idx = date_range('1/1/2000', periods=100000, freq='D') - self.df = DataFrame(randn(100000, 1), columns=['A'], index=self.idx) - - def time_frame_assign_timeseries_index(self): - self.f(self.df) - - def f(self, df): - self.x = self.df.copy() - self.x['date'] = self.x.index + def time_nsmallest_two_columns(self, keep): + self.df.nsmallest(100, ['A', 'B'], keep=keep) - -# insert many columns - -class frame_insert_100_columns_begin(object): - goal_time = 0.2 +class Describe(object): def setup(self): - self.N = 1000 - - def f(self, K=100): - self.df = DataFrame(index=range(self.N)) - self.new_col = np.random.randn(self.N) - for i in range(K): - self.df.insert(0, i, self.new_col) - - def g(self, K=500): - self.df = DataFrame(index=range(self.N)) - self.new_col = np.random.randn(self.N) - for i in range(K): - self.df[i] = self.new_col - - def time_frame_insert_100_columns_begin(self): - self.f() + self.df = DataFrame({ + 'a': np.random.randint(0, 100, int(1e6)), + 'b': np.random.randint(0, 100, int(1e6)), + 'c': np.random.randint(0, 100, int(1e6)) + }) - def time_frame_insert_500_columns_end(self): - self.g() + def time_series_describe(self): + self.df['a'].describe() + def time_dataframe_describe(self): + self.df.describe() -#---------------------------------------------------------------------- -# strings methods, #2602 - -class series_string_vector_slice(object): - goal_time = 0.2 - - def setup(self): - self.s = Series((['abcdefg', np.nan] * 500000)) - - def time_series_string_vector_slice(self): - self.s.str[:5] - - -#---------------------------------------------------------------------- -# df.info() and get_dtype_counts() # 2807 - -class frame_get_dtype_counts(object): - goal_time = 0.2 - - def setup(self): - self.df = DataFrame(np.random.randn(10, 10000)) - - def time_frame_get_dtype_counts(self): - self.df.get_dtype_counts() - - -class frame_nlargest(object): - goal_time = 0.2 - - def setup(self): - self.df = DataFrame(np.random.randn(1000, 3), - columns=list('ABC')) - - def time_frame_nlargest(self): - self.df.nlargest(100, 'A') +from .pandas_vb_common import setup # noqa: F401 diff --git a/asv_bench/benchmarks/gil.py b/asv_bench/benchmarks/gil.py index 1c5e59672cb57..6819a296c81df 100644 --- a/asv_bench/benchmarks/gil.py +++ b/asv_bench/benchmarks/gil.py @@ -1,235 +1,134 @@ -from .pandas_vb_common import * -from pandas.core import common as com - +import numpy as np +import pandas.util.testing as tm +from pandas import DataFrame, Series, read_csv, factorize, date_range +from pandas.core.algorithms import take_1d try: - from cStringIO import StringIO + from pandas import (rolling_median, rolling_mean, rolling_min, rolling_max, + rolling_var, rolling_skew, rolling_kurt, rolling_std) + have_rolling_methods = True except ImportError: - from io import StringIO - + have_rolling_methods = False +try: + from pandas._libs import algos +except ImportError: + from pandas import algos try: from pandas.util.testing import test_parallel - have_real_test_parallel = True except ImportError: have_real_test_parallel = False - def test_parallel(num_threads=1): - def wrapper(fname): return fname - return wrapper +from .pandas_vb_common import BaseIO -class NoGilGroupby(object): - goal_time = 0.2 - def setup(self): - self.N = 1000000 - self.ngroups = 1000 - np.random.seed(1234) - self.df = DataFrame({'key': np.random.randint(0, self.ngroups, size=self.N), 'data': np.random.randn(self.N), }) +class ParallelGroupbyMethods(object): - np.random.seed(1234) - self.size = 2 ** 22 - self.ngroups = 100 - self.data = Series(np.random.randint(0, self.ngroups, size=self.size)) + params = ([2, 4, 8], ['count', 'last', 'max', 'mean', 'min', 'prod', + 'sum', 'var']) + param_names = ['threads', 'method'] - if (not have_real_test_parallel): + def setup(self, threads, method): + if not have_real_test_parallel: raise NotImplementedError + N = 10**6 + ngroups = 10**3 + df = DataFrame({'key': np.random.randint(0, ngroups, size=N), + 'data': np.random.randn(N)}) - @test_parallel(num_threads=2) - def _pg2_count(self): - self.df.groupby('key')['data'].count() - - def time_count_2(self): - self._pg2_count() - - @test_parallel(num_threads=2) - def _pg2_last(self): - self.df.groupby('key')['data'].last() - - def time_last_2(self): - self._pg2_last() - - @test_parallel(num_threads=2) - def _pg2_max(self): - self.df.groupby('key')['data'].max() - - def time_max_2(self): - self._pg2_max() - - @test_parallel(num_threads=2) - def _pg2_mean(self): - self.df.groupby('key')['data'].mean() - - def time_mean_2(self): - self._pg2_mean() - - @test_parallel(num_threads=2) - def _pg2_min(self): - self.df.groupby('key')['data'].min() - - def time_min_2(self): - self._pg2_min() - - @test_parallel(num_threads=2) - def _pg2_prod(self): - self.df.groupby('key')['data'].prod() - - def time_prod_2(self): - self._pg2_prod() - - @test_parallel(num_threads=2) - def _pg2_sum(self): - self.df.groupby('key')['data'].sum() - - def time_sum_2(self): - self._pg2_sum() - - @test_parallel(num_threads=4) - def _pg4_sum(self): - self.df.groupby('key')['data'].sum() - - def time_sum_4(self): - self._pg4_sum() - - def time_sum_4_notp(self): - for i in range(4): - self.df.groupby('key')['data'].sum() - - def _f_sum(self): - self.df.groupby('key')['data'].sum() - - @test_parallel(num_threads=8) - def _pg8_sum(self): - self._f_sum() - - def time_sum_8(self): - self._pg8_sum() - - def time_sum_8_notp(self): - for i in range(8): - self._f_sum() - - @test_parallel(num_threads=2) - def _pg2_var(self): - self.df.groupby('key')['data'].var() - - def time_var_2(self): - self._pg2_var() - - # get groups - - def _groups(self): - self.data.groupby(self.data).groups - - @test_parallel(num_threads=2) - def _pg2_groups(self): - self._groups() - - def time_groups_2(self): - self._pg2_groups() - - @test_parallel(num_threads=4) - def _pg4_groups(self): - self._groups() + @test_parallel(num_threads=threads) + def parallel(): + getattr(df.groupby('key')['data'], method)() + self.parallel = parallel - def time_groups_4(self): - self._pg4_groups() + def loop(): + getattr(df.groupby('key')['data'], method)() + self.loop = loop - @test_parallel(num_threads=8) - def _pg8_groups(self): - self._groups() + def time_parallel(self, threads, method): + self.parallel() - def time_groups_8(self): - self._pg8_groups() + def time_loop(self, threads, method): + for i in range(threads): + self.loop() +class ParallelGroups(object): -class nogil_take1d_float64(object): - goal_time = 0.2 + params = [2, 4, 8] + param_names = ['threads'] - def setup(self): - self.N = 1000000 - self.ngroups = 1000 - np.random.seed(1234) - self.df = DataFrame({'key': np.random.randint(0, self.ngroups, size=self.N), 'data': np.random.randn(self.N), }) - if (not have_real_test_parallel): + def setup(self, threads): + if not have_real_test_parallel: raise NotImplementedError - self.N = 10000000.0 - self.df = DataFrame({'int64': np.arange(self.N, dtype='int64'), 'float64': np.arange(self.N, dtype='float64'), }) - self.indexer = np.arange(100, (len(self.df) - 100)) + size = 2**22 + ngroups = 10**3 + data = Series(np.random.randint(0, ngroups, size=size)) - def time_nogil_take1d_float64(self): - self.take_1d_pg2_int64() + @test_parallel(num_threads=threads) + def get_groups(): + data.groupby(data).groups + self.get_groups = get_groups - @test_parallel(num_threads=2) - def take_1d_pg2_int64(self): - com.take_1d(self.df.int64.values, self.indexer) + def time_get_groups(self, threads): + self.get_groups() - @test_parallel(num_threads=2) - def take_1d_pg2_float64(self): - com.take_1d(self.df.float64.values, self.indexer) +class ParallelTake1D(object): -class nogil_take1d_int64(object): - goal_time = 0.2 + params = ['int64', 'float64'] + param_names = ['dtype'] - def setup(self): - self.N = 1000000 - self.ngroups = 1000 - np.random.seed(1234) - self.df = DataFrame({'key': np.random.randint(0, self.ngroups, size=self.N), 'data': np.random.randn(self.N), }) - if (not have_real_test_parallel): + def setup(self, dtype): + if not have_real_test_parallel: raise NotImplementedError - self.N = 10000000.0 - self.df = DataFrame({'int64': np.arange(self.N, dtype='int64'), 'float64': np.arange(self.N, dtype='float64'), }) - self.indexer = np.arange(100, (len(self.df) - 100)) + N = 10**6 + df = DataFrame({'col': np.arange(N, dtype=dtype)}) + indexer = np.arange(100, len(df) - 100) - def time_nogil_take1d_int64(self): - self.take_1d_pg2_float64() + @test_parallel(num_threads=2) + def parallel_take1d(): + take_1d(df['col'].values, indexer) + self.parallel_take1d = parallel_take1d - @test_parallel(num_threads=2) - def take_1d_pg2_int64(self): - com.take_1d(self.df.int64.values, self.indexer) + def time_take1d(self, dtype): + self.parallel_take1d() - @test_parallel(num_threads=2) - def take_1d_pg2_float64(self): - com.take_1d(self.df.float64.values, self.indexer) +class ParallelKth(object): -class nogil_kth_smallest(object): number = 1 repeat = 5 def setup(self): - if (not have_real_test_parallel): + if not have_real_test_parallel: raise NotImplementedError - np.random.seed(1234) - self.N = 10000000 - self.k = 500000 - self.a = np.random.randn(self.N) - self.b = self.a.copy() - self.kwargs_list = [{'arr': self.a}, {'arr': self.b}] + N = 10**7 + k = 5 * 10**5 + kwargs_list = [{'arr': np.random.randn(N)}, + {'arr': np.random.randn(N)}] - def time_nogil_kth_smallest(self): - @test_parallel(num_threads=2, kwargs_list=self.kwargs_list) - def run(arr): - algos.kth_smallest(arr, self.k) - run() + @test_parallel(num_threads=2, kwargs_list=kwargs_list) + def parallel_kth_smallest(arr): + algos.kth_smallest(arr, k) + self.parallel_kth_smallest = parallel_kth_smallest + def time_kth_smallest(self): + self.parallel_kth_smallest() -class nogil_datetime_fields(object): - goal_time = 0.2 + +class ParallelDatetimeFields(object): def setup(self): - self.N = 100000000 - self.dti = pd.date_range('1900-01-01', periods=self.N, freq='D') - self.period = self.dti.to_period('D') - if (not have_real_test_parallel): + if not have_real_test_parallel: raise NotImplementedError + N = 10**6 + self.dti = date_range('1900-01-01', periods=N, freq='T') + self.period = self.dti.to_period('D') def time_datetime_field_year(self): @test_parallel(num_threads=2) @@ -268,149 +167,106 @@ def run(period): run(self.period) -class nogil_rolling_algos_slow(object): - goal_time = 0.2 - - def setup(self): - self.win = 100 - np.random.seed(1234) - self.arr = np.random.rand(100000) - if (not have_real_test_parallel): - raise NotImplementedError - - def time_nogil_rolling_median(self): - @test_parallel(num_threads=2) - def run(arr, win): - rolling_median(arr, win) - run(self.arr, self.win) - +class ParallelRolling(object): -class nogil_rolling_algos_fast(object): - goal_time = 0.2 + params = ['median', 'mean', 'min', 'max', 'var', 'skew', 'kurt', 'std'] + param_names = ['method'] - def setup(self): - self.win = 100 - np.random.seed(1234) - self.arr = np.random.rand(1000000) - if (not have_real_test_parallel): + def setup(self, method): + if not have_real_test_parallel: + raise NotImplementedError + win = 100 + arr = np.random.rand(100000) + if hasattr(DataFrame, 'rolling'): + df = DataFrame(arr).rolling(win) + + @test_parallel(num_threads=2) + def parallel_rolling(): + getattr(df, method)() + self.parallel_rolling = parallel_rolling + elif have_rolling_methods: + rolling = {'median': rolling_median, + 'mean': rolling_mean, + 'min': rolling_min, + 'max': rolling_max, + 'var': rolling_var, + 'skew': rolling_skew, + 'kurt': rolling_kurt, + 'std': rolling_std} + + @test_parallel(num_threads=2) + def parallel_rolling(): + rolling[method](arr, win) + self.parallel_rolling = parallel_rolling + else: raise NotImplementedError - def time_nogil_rolling_mean(self): - @test_parallel(num_threads=2) - def run(arr, win): - rolling_mean(arr, win) - run(self.arr, self.win) - - def time_nogil_rolling_min(self): - @test_parallel(num_threads=2) - def run(arr, win): - rolling_min(arr, win) - run(self.arr, self.win) - - def time_nogil_rolling_max(self): - @test_parallel(num_threads=2) - def run(arr, win): - rolling_max(arr, win) - run(self.arr, self.win) - - def time_nogil_rolling_var(self): - @test_parallel(num_threads=2) - def run(arr, win): - rolling_var(arr, win) - run(self.arr, self.win) - - def time_nogil_rolling_skew(self): - @test_parallel(num_threads=2) - def run(arr, win): - rolling_skew(arr, win) - run(self.arr, self.win) - - def time_nogil_rolling_kurt(self): - @test_parallel(num_threads=2) - def run(arr, win): - rolling_kurt(arr, win) - run(self.arr, self.win) + def time_rolling(self, method): + self.parallel_rolling() - def time_nogil_rolling_std(self): - @test_parallel(num_threads=2) - def run(arr, win): - rolling_std(arr, win) - run(self.arr, self.win) +class ParallelReadCSV(BaseIO): -class nogil_read_csv(object): number = 1 repeat = 5 + params = ['float', 'object', 'datetime'] + param_names = ['dtype'] - def setup(self): - if (not have_real_test_parallel): + def setup(self, dtype): + if not have_real_test_parallel: raise NotImplementedError - # Using the values - self.df = DataFrame(np.random.randn(10000, 50)) - self.df.to_csv('__test__.csv') + rows = 10000 + cols = 50 + data = {'float': DataFrame(np.random.randn(rows, cols)), + 'datetime': DataFrame(np.random.randn(rows, cols), + index=date_range('1/1/2000', + periods=rows)), + 'object': DataFrame('foo', + index=range(rows), + columns=['object%03d'.format(i) + for i in range(5)])} + + self.fname = '__test_{}__.csv'.format(dtype) + df = data[dtype] + df.to_csv(self.fname) - self.rng = date_range('1/1/2000', periods=10000) - self.df_date_time = DataFrame(np.random.randn(10000, 50), index=self.rng) - self.df_date_time.to_csv('__test_datetime__.csv') - - self.df_object = DataFrame('foo', index=self.df.index, columns=self.create_cols('object')) - self.df_object.to_csv('__test_object__.csv') - - def create_cols(self, name): - return [('%s%03d' % (name, i)) for i in range(5)] - - @test_parallel(num_threads=2) - def pg_read_csv(self): - read_csv('__test__.csv', sep=',', header=None, float_precision=None) - - def time_read_csv(self): - self.pg_read_csv() - - @test_parallel(num_threads=2) - def pg_read_csv_object(self): - read_csv('__test_object__.csv', sep=',') - - def time_read_csv_object(self): - self.pg_read_csv_object() + @test_parallel(num_threads=2) + def parallel_read_csv(): + read_csv(self.fname) + self.parallel_read_csv = parallel_read_csv - @test_parallel(num_threads=2) - def pg_read_csv_datetime(self): - read_csv('__test_datetime__.csv', sep=',', header=None) + def time_read_csv(self, dtype): + self.parallel_read_csv() - def time_read_csv_datetime(self): - self.pg_read_csv_datetime() +class ParallelFactorize(object): -class nogil_factorize(object): number = 1 repeat = 5 + params = [2, 4, 8] + param_names = ['threads'] - def setup(self): - if (not have_real_test_parallel): + def setup(self, threads): + if not have_real_test_parallel: raise NotImplementedError - np.random.seed(1234) - self.strings = tm.makeStringIndex(100000) + strings = tm.makeStringIndex(100000) - def factorize_strings(self): - pd.factorize(self.strings) + @test_parallel(num_threads=threads) + def parallel(): + factorize(strings) + self.parallel = parallel - @test_parallel(num_threads=4) - def _pg_factorize_strings_4(self): - self.factorize_strings() + def loop(): + factorize(strings) + self.loop = loop - def time_factorize_strings_4(self): - for i in range(2): - self._pg_factorize_strings_4() + def time_parallel(self, threads): + self.parallel() - @test_parallel(num_threads=2) - def _pg_factorize_strings_2(self): - self.factorize_strings() + def time_loop(self, threads): + for i in range(threads): + self.loop() - def time_factorize_strings_2(self): - for i in range(4): - self._pg_factorize_strings_2() - def time_factorize_strings(self): - for i in range(8): - self.factorize_strings() +from .pandas_vb_common import setup # noqa: F401 diff --git a/asv_bench/benchmarks/groupby.py b/asv_bench/benchmarks/groupby.py index b8d8e8b7912d7..27d279bb90a31 100644 --- a/asv_bench/benchmarks/groupby.py +++ b/asv_bench/benchmarks/groupby.py @@ -1,129 +1,54 @@ -from .pandas_vb_common import * -from string import ascii_letters, digits +from functools import partial from itertools import product +from string import ascii_letters +import warnings +import numpy as np -class groupby_agg_builtins(object): - goal_time = 0.2 +from pandas import ( + Categorical, DataFrame, MultiIndex, Series, TimeGrouper, Timestamp, + date_range, period_range) +import pandas.util.testing as tm - def setup(self): - np.random.seed(27182) - self.n = 100000 - self.df = DataFrame(np.random.randint(1, (self.n / 100), (self.n, 3)), columns=['jim', 'joe', 'jolie']) - - def time_groupby_agg_builtins1(self): - self.df.groupby('jim').agg([sum, min, max]) - - def time_groupby_agg_builtins2(self): - self.df.groupby(['jim', 'joe']).agg([sum, min, max]) -#---------------------------------------------------------------------- -# dict return values +method_blacklist = { + 'object': {'median', 'prod', 'sem', 'cumsum', 'sum', 'cummin', 'mean', + 'max', 'skew', 'cumprod', 'cummax', 'rank', 'pct_change', 'min', + 'var', 'mad', 'describe', 'std', 'quantile'}, + 'datetime': {'median', 'prod', 'sem', 'cumsum', 'sum', 'mean', 'skew', + 'cumprod', 'cummax', 'pct_change', 'var', 'mad', 'describe', + 'std'} +} -class groupby_apply_dict_return(object): - goal_time = 0.2 +class ApplyDictReturn(object): def setup(self): self.labels = np.arange(1000).repeat(10) - self.data = Series(randn(len(self.labels))) - self.f = (lambda x: {'first': x.values[0], 'last': x.values[(-1)], }) + self.data = Series(np.random.randn(len(self.labels))) def time_groupby_apply_dict_return(self): - self.data.groupby(self.labels).apply(self.f) - - -#---------------------------------------------------------------------- -# groups - -class Groups(object): - goal_time = 0.1 - - size = 2 ** 22 - data = { - 'int64_small': Series(np.random.randint(0, 100, size=size)), - 'int64_large' : Series(np.random.randint(0, 10000, size=size)), - 'object_small': Series(tm.makeStringIndex(100).take(np.random.randint(0, 100, size=size))), - 'object_large': Series(tm.makeStringIndex(10000).take(np.random.randint(0, 10000, size=size))) - } - - param_names = ['df'] - params = ['int64_small', 'int64_large', 'object_small', 'object_large'] - - def setup(self, df): - self.df = self.data[df] - - def time_groupby_groups(self, df): - self.df.groupby(self.df).groups - - -#---------------------------------------------------------------------- -# First / last functions - -class FirstLast(object): - goal_time = 0.2 - - param_names = ['dtype'] - params = ['float32', 'float64', 'datetime', 'object'] - - # with datetimes (GH7555) + self.data.groupby(self.labels).apply(lambda x: {'first': x.values[0], + 'last': x.values[-1]}) - def setup(self, dtype): - - if dtype == 'datetime': - self.df = DataFrame( - {'values': date_range('1/1/2011', periods=100000, freq='s'), - 'key': range(100000),}) - elif dtype == 'object': - self.df = DataFrame( - {'values': (['foo'] * 100000), - 'key': range(100000)}) - else: - labels = np.arange(10000).repeat(10) - data = Series(randn(len(labels)), dtype=dtype) - data[::3] = np.nan - data[1::3] = np.nan - labels = labels.take(np.random.permutation(len(labels))) - self.df = DataFrame({'values': data, 'key': labels}) - def time_groupby_first(self, dtype): - self.df.groupby('key').first() - - def time_groupby_last(self, dtype): - self.df.groupby('key').last() - - def time_groupby_nth_any(self, dtype): - self.df.groupby('key').nth(0, dropna='all') - - def time_groupby_nth_none(self, dtype): - self.df.groupby('key').nth(0) +class Apply(object): + def setup_cache(self): + N = 10**4 + labels = np.random.randint(0, 2000, size=N) + labels2 = np.random.randint(0, 3, size=N) + df = DataFrame({'key': labels, + 'key2': labels2, + 'value1': np.random.randn(N), + 'value2': ['foo', 'bar', 'baz', 'qux'] * (N // 4) + }) + return df -#---------------------------------------------------------------------- -# DataFrame Apply overhead + def time_scalar_function_multi_col(self, df): + df.groupby(['key', 'key2']).apply(lambda x: 1) -class groupby_frame_apply(object): - goal_time = 0.2 - - def setup(self): - self.N = 10000 - self.labels = np.random.randint(0, 2000, size=self.N) - self.labels2 = np.random.randint(0, 3, size=self.N) - self.df = DataFrame({ - 'key': self.labels, - 'key2': self.labels2, - 'value1': np.random.randn(self.N), - 'value2': (['foo', 'bar', 'baz', 'qux'] * (self.N // 4)), - }) - - @staticmethod - def scalar_function(g): - return 1 - - def time_groupby_frame_apply_scalar_function(self): - self.df.groupby(['key', 'key2']).apply(self.scalar_function) - - def time_groupby_frame_apply_scalar_function_overhead(self): - self.df.groupby('key').apply(self.scalar_function) + def time_scalar_function_single_col(self, df): + df.groupby('key').apply(lambda x: 1) @staticmethod def df_copy_function(g): @@ -131,374 +56,329 @@ def df_copy_function(g): g.name return g.copy() - def time_groupby_frame_df_copy_function(self): - self.df.groupby(['key', 'key2']).apply(self.df_copy_function) - - def time_groupby_frame_apply_df_copy_overhead(self): - self.df.groupby('key').apply(self.df_copy_function) - - -#---------------------------------------------------------------------- -# 2d grouping, aggregate many columns + def time_copy_function_multi_col(self, df): + df.groupby(['key', 'key2']).apply(self.df_copy_function) -class groupby_frame_cython_many_columns(object): - goal_time = 0.2 - - def setup(self): - self.labels = np.random.randint(0, 100, size=1000) - self.df = DataFrame(randn(1000, 1000)) - - def time_sum(self): - self.df.groupby(self.labels).sum() + def time_copy_overhead_single_col(self, df): + df.groupby('key').apply(self.df_copy_function) -#---------------------------------------------------------------------- -# single key, long, integer key - -class groupby_frame_singlekey_integer(object): - goal_time = 0.2 - - def setup(self): - self.data = np.random.randn(100000, 1) - self.labels = np.random.randint(0, 1000, size=100000) - self.df = DataFrame(self.data) - - def time_sum(self): - self.df.groupby(self.labels).sum() - - -#---------------------------------------------------------------------- -# DataFrame nth - -class groupby_nth(object): - goal_time = 0.2 +class Groups(object): - def setup(self): - self.df = DataFrame(np.random.randint(1, 100, (10000, 2))) + param_names = ['key'] + params = ['int64_small', 'int64_large', 'object_small', 'object_large'] - def time_groupby_frame_nth_any(self): - self.df.groupby(0).nth(0, dropna='any') + def setup_cache(self): + size = 10**6 + data = {'int64_small': Series(np.random.randint(0, 100, size=size)), + 'int64_large': Series(np.random.randint(0, 10000, size=size)), + 'object_small': Series( + tm.makeStringIndex(100).take( + np.random.randint(0, 100, size=size))), + 'object_large': Series( + tm.makeStringIndex(10000).take( + np.random.randint(0, 10000, size=size)))} + return data - def time_groupby_frame_nth_none(self): - self.df.groupby(0).nth(0) + def setup(self, data, key): + self.ser = data[key] - def time_groupby_series_nth_any(self): - self.df[1].groupby(self.df[0]).nth(0, dropna='any') + def time_series_groups(self, data, key): + self.ser.groupby(self.ser).groups - def time_groupby_series_nth_none(self): - self.df[1].groupby(self.df[0]).nth(0) +class GroupManyLabels(object): -#---------------------------------------------------------------------- -# groupby_indices replacement, chop up Series + params = [1, 1000] + param_names = ['ncols'] -class groupby_indices(object): - goal_time = 0.2 + def setup(self, ncols): + N = 1000 + data = np.random.randn(N, ncols) + self.labels = np.random.randint(0, 100, size=N) + self.df = DataFrame(data) - def setup(self): - try: - self.rng = date_range('1/1/2000', '12/31/2005', freq='H') - (self.year, self.month, self.day) = (self.rng.year, self.rng.month, self.rng.day) - except: - self.rng = date_range('1/1/2000', '12/31/2000', offset=datetools.Hour()) - self.year = self.rng.map((lambda x: x.year)) - self.month = self.rng.map((lambda x: x.month)) - self.day = self.rng.map((lambda x: x.day)) - self.ts = Series(np.random.randn(len(self.rng)), index=self.rng) - - def time_groupby_indices(self): - len(self.ts.groupby([self.year, self.month, self.day])) + def time_sum(self, ncols): + self.df.groupby(self.labels).sum() -class groupby_int64_overflow(object): - goal_time = 0.2 +class Nth(object): - def setup(self): - self.arr = np.random.randint(((-1) << 12), (1 << 12), ((1 << 17), 5)) - self.i = np.random.choice(len(self.arr), (len(self.arr) * 5)) - self.arr = np.vstack((self.arr, self.arr[self.i])) - self.i = np.random.permutation(len(self.arr)) - self.arr = self.arr[self.i] - self.df = DataFrame(self.arr, columns=list('abcde')) - (self.df['jim'], self.df['joe']) = (np.random.randn(2, len(self.df)) * 10) + param_names = ['dtype'] + params = ['float32', 'float64', 'datetime', 'object'] - def time_groupby_int64_overflow(self): - self.df.groupby(list('abcde')).max() + def setup(self, dtype): + N = 10**5 + # with datetimes (GH7555) + if dtype == 'datetime': + values = date_range('1/1/2011', periods=N, freq='s') + elif dtype == 'object': + values = ['foo'] * N + else: + values = np.arange(N).astype(dtype) + key = np.arange(N) + self.df = DataFrame({'key': key, 'values': values}) + self.df.iloc[1, 1] = np.nan # insert missing data -#---------------------------------------------------------------------- -# count() speed + def time_frame_nth_any(self, dtype): + self.df.groupby('key').nth(0, dropna='any') -class groupby_multi_count(object): - goal_time = 0.2 + def time_groupby_nth_all(self, dtype): + self.df.groupby('key').nth(0, dropna='all') - def setup(self): - self.n = 10000 - self.offsets = np.random.randint(self.n, size=self.n).astype('timedelta64[ns]') - self.dates = (np.datetime64('now') + self.offsets) - self.dates[(np.random.rand(self.n) > 0.5)] = np.datetime64('nat') - self.offsets[(np.random.rand(self.n) > 0.5)] = np.timedelta64('nat') - self.value2 = np.random.randn(self.n) - self.value2[(np.random.rand(self.n) > 0.5)] = np.nan - self.obj = np.random.choice(list('ab'), size=self.n).astype(object) - self.obj[(np.random.randn(self.n) > 0.5)] = np.nan - self.df = DataFrame({'key1': np.random.randint(0, 500, size=self.n), - 'key2': np.random.randint(0, 100, size=self.n), - 'dates': self.dates, - 'value2': self.value2, - 'value3': np.random.randn(self.n), - 'ints': np.random.randint(0, 1000, size=self.n), - 'obj': self.obj, - 'offsets': self.offsets, }) - - def time_groupby_multi_count(self): - self.df.groupby(['key1', 'key2']).count() - - -class groupby_int_count(object): - goal_time = 0.2 + def time_frame_nth(self, dtype): + self.df.groupby('key').nth(0) - def setup(self): - self.n = 10000 - self.df = DataFrame({'key1': randint(0, 500, size=self.n), - 'key2': randint(0, 100, size=self.n), - 'ints': randint(0, 1000, size=self.n), - 'ints2': randint(0, 1000, size=self.n), }) + def time_series_nth_any(self, dtype): + self.df['values'].groupby(self.df['key']).nth(0, dropna='any') - def time_groupby_int_count(self): - self.df.groupby(['key1', 'key2']).count() + def time_series_nth_all(self, dtype): + self.df['values'].groupby(self.df['key']).nth(0, dropna='all') + def time_series_nth(self, dtype): + self.df['values'].groupby(self.df['key']).nth(0) -#---------------------------------------------------------------------- -# nunique() speed -class groupby_nunique(object): +class DateAttributes(object): def setup(self): - self.n = 10000 - self.df = DataFrame({'key1': randint(0, 500, size=self.n), - 'key2': randint(0, 100, size=self.n), - 'ints': randint(0, 1000, size=self.n), - 'ints2': randint(0, 1000, size=self.n), }) - - def time_groupby_nunique(self): - self.df.groupby(['key1', 'key2']).nunique() + rng = date_range('1/1/2000', '12/31/2005', freq='H') + self.year, self.month, self.day = rng.year, rng.month, rng.day + self.ts = Series(np.random.randn(len(rng)), index=rng) + def time_len_groupby_object(self): + len(self.ts.groupby([self.year, self.month, self.day])) -#---------------------------------------------------------------------- -# group with different functions per column -class groupby_agg_multi(object): - goal_time = 0.2 +class Int64(object): def setup(self): - self.fac1 = np.array(['A', 'B', 'C'], dtype='O') - self.fac2 = np.array(['one', 'two'], dtype='O') - self.df = DataFrame({'key1': self.fac1.take(np.random.randint(0, 3, size=100000)), 'key2': self.fac2.take(np.random.randint(0, 2, size=100000)), 'value1': np.random.randn(100000), 'value2': np.random.randn(100000), 'value3': np.random.randn(100000), }) - - def time_groupby_multi_different_functions(self): - self.df.groupby(['key1', 'key2']).agg({'value1': 'mean', 'value2': 'var', 'value3': 'sum'}) - - def time_groupby_multi_different_numpy_functions(self): - self.df.groupby(['key1', 'key2']).agg({'value1': np.mean, 'value2': np.var, 'value3': np.sum}) - - -class groupby_multi_index(object): - goal_time = 0.2 + arr = np.random.randint(-1 << 12, 1 << 12, (1 << 17, 5)) + i = np.random.choice(len(arr), len(arr) * 5) + arr = np.vstack((arr, arr[i])) + i = np.random.permutation(len(arr)) + arr = arr[i] + self.cols = list('abcde') + self.df = DataFrame(arr, columns=self.cols) + self.df['jim'], self.df['joe'] = np.random.randn(2, len(self.df)) * 10 + + def time_overflow(self): + self.df.groupby(self.cols).max() + + +class CountMultiDtype(object): + + def setup_cache(self): + n = 10000 + offsets = np.random.randint(n, size=n).astype('timedelta64[ns]') + dates = np.datetime64('now') + offsets + dates[np.random.rand(n) > 0.5] = np.datetime64('nat') + offsets[np.random.rand(n) > 0.5] = np.timedelta64('nat') + value2 = np.random.randn(n) + value2[np.random.rand(n) > 0.5] = np.nan + obj = np.random.choice(list('ab'), size=n).astype(object) + obj[np.random.randn(n) > 0.5] = np.nan + df = DataFrame({'key1': np.random.randint(0, 500, size=n), + 'key2': np.random.randint(0, 100, size=n), + 'dates': dates, + 'value2': value2, + 'value3': np.random.randn(n), + 'ints': np.random.randint(0, 1000, size=n), + 'obj': obj, + 'offsets': offsets}) + return df + + def time_multi_count(self, df): + df.groupby(['key1', 'key2']).count() + + +class CountMultiInt(object): + + def setup_cache(self): + n = 10000 + df = DataFrame({'key1': np.random.randint(0, 500, size=n), + 'key2': np.random.randint(0, 100, size=n), + 'ints': np.random.randint(0, 1000, size=n), + 'ints2': np.random.randint(0, 1000, size=n)}) + return df + + def time_multi_int_count(self, df): + df.groupby(['key1', 'key2']).count() + + def time_multi_int_nunique(self, df): + df.groupby(['key1', 'key2']).nunique() + + +class AggFunctions(object): + + def setup_cache(self): + N = 10**5 + fac1 = np.array(['A', 'B', 'C'], dtype='O') + fac2 = np.array(['one', 'two'], dtype='O') + df = DataFrame({'key1': fac1.take(np.random.randint(0, 3, size=N)), + 'key2': fac2.take(np.random.randint(0, 2, size=N)), + 'value1': np.random.randn(N), + 'value2': np.random.randn(N), + 'value3': np.random.randn(N)}) + return df + + def time_different_str_functions(self, df): + df.groupby(['key1', 'key2']).agg({'value1': 'mean', + 'value2': 'var', + 'value3': 'sum'}) + + def time_different_numpy_functions(self, df): + df.groupby(['key1', 'key2']).agg({'value1': np.mean, + 'value2': np.var, + 'value3': np.sum}) + + def time_different_python_functions_multicol(self, df): + df.groupby(['key1', 'key2']).agg([sum, min, max]) + + def time_different_python_functions_singlecol(self, df): + df.groupby('key1').agg([sum, min, max]) + + +class GroupStrings(object): def setup(self): - self.n = (((5 * 7) * 11) * (1 << 9)) - self.alpha = list(map(''.join, product((ascii_letters + digits), repeat=4))) - self.f = (lambda k: np.repeat(np.random.choice(self.alpha, (self.n // k)), k)) - self.df = DataFrame({'a': self.f(11), 'b': self.f(7), 'c': self.f(5), 'd': self.f(1), }) + n = 2 * 10**5 + alpha = list(map(''.join, product(ascii_letters, repeat=4))) + data = np.random.choice(alpha, (n // 5, 4), replace=False) + data = np.repeat(data, 5, axis=0) + self.df = DataFrame(data, columns=list('abcd')) self.df['joe'] = (np.random.randn(len(self.df)) * 10).round(3) - self.i = np.random.permutation(len(self.df)) - self.df = self.df.iloc[self.i].reset_index(drop=True).copy() + self.df = self.df.sample(frac=1).reset_index(drop=True) - def time_groupby_multi_index(self): + def time_multi_columns(self): self.df.groupby(list('abcd')).max() -class groupby_multi(object): - goal_time = 0.2 - - def setup(self): - self.N = 100000 - self.ngroups = 100 - self.df = DataFrame({'key1': self.get_test_data(ngroups=self.ngroups), 'key2': self.get_test_data(ngroups=self.ngroups), 'data1': np.random.randn(self.N), 'data2': np.random.randn(self.N), }) - self.simple_series = Series(np.random.randn(self.N)) - self.key1 = self.df['key1'] - - def get_test_data(self, ngroups=100, n=100000): - self.unique_groups = range(self.ngroups) - self.arr = np.asarray(np.tile(self.unique_groups, (n / self.ngroups)), dtype=object) - if (len(self.arr) < n): - self.arr = np.asarray((list(self.arr) + self.unique_groups[:(n - len(self.arr))]), dtype=object) - random.shuffle(self.arr) - return self.arr - - def f(self): - self.df.groupby(['key1', 'key2']).agg((lambda x: x.values.sum())) +class MultiColumn(object): - def time_groupby_multi_cython(self): - self.df.groupby(['key1', 'key2']).sum() + def setup_cache(self): + N = 10**5 + key1 = np.tile(np.arange(100, dtype=object), 1000) + key2 = key1.copy() + np.random.shuffle(key1) + np.random.shuffle(key2) + df = DataFrame({'key1': key1, + 'key2': key2, + 'data1': np.random.randn(N), + 'data2': np.random.randn(N)}) + return df - def time_groupby_multi_python(self): - self.df.groupby(['key1', 'key2'])['data1'].agg((lambda x: x.values.sum())) + def time_lambda_sum(self, df): + df.groupby(['key1', 'key2']).agg(lambda x: x.values.sum()) - def time_groupby_multi_series_op(self): - self.df.groupby(['key1', 'key2'])['data1'].agg(np.std) + def time_cython_sum(self, df): + df.groupby(['key1', 'key2']).sum() - def time_groupby_series_simple_cython(self): - self.simple_series.groupby(self.key1).sum() + def time_col_select_lambda_sum(self, df): + df.groupby(['key1', 'key2'])['data1'].agg(lambda x: x.values.sum()) - def time_groupby_series_simple_rank(self): - self.df.groupby('key1').rank(pct=True) + def time_col_select_numpy_sum(self, df): + df.groupby(['key1', 'key2'])['data1'].agg(np.sum) -#---------------------------------------------------------------------- -# size() speed - -class groupby_size(object): - goal_time = 0.2 +class Size(object): def setup(self): - self.n = 100000 - self.offsets = np.random.randint(self.n, size=self.n).astype('timedelta64[ns]') - self.dates = (np.datetime64('now') + self.offsets) - self.df = DataFrame({'key1': np.random.randint(0, 500, size=self.n), 'key2': np.random.randint(0, 100, size=self.n), 'value1': np.random.randn(self.n), 'value2': np.random.randn(self.n), 'value3': np.random.randn(self.n), 'dates': self.dates, }) - - def time_groupby_multi_size(self): + n = 10**5 + offsets = np.random.randint(n, size=n).astype('timedelta64[ns]') + dates = np.datetime64('now') + offsets + self.df = DataFrame({'key1': np.random.randint(0, 500, size=n), + 'key2': np.random.randint(0, 100, size=n), + 'value1': np.random.randn(n), + 'value2': np.random.randn(n), + 'value3': np.random.randn(n), + 'dates': dates}) + self.draws = Series(np.random.randn(n)) + labels = Series(['foo', 'bar', 'baz', 'qux'] * (n // 4)) + self.cats = labels.astype('category') + + def time_multi_size(self): self.df.groupby(['key1', 'key2']).size() - def time_groupby_dt_size(self): - self.df.groupby(['dates']).size() + def time_dt_timegrouper_size(self): + with warnings.catch_warnings(record=True): + self.df.groupby(TimeGrouper(key='dates', freq='M')).size() - def time_groupby_dt_timegrouper_size(self): - self.df.groupby(TimeGrouper(key='dates', freq='M')).size() + def time_category_size(self): + self.draws.groupby(self.cats).size() -#---------------------------------------------------------------------- -# groupby with a variable value for ngroups +class GroupByMethods(object): -class GroupBySuite(object): - goal_time = 0.2 + param_names = ['dtype', 'method', 'application'] + params = [['int', 'float', 'object', 'datetime'], + ['all', 'any', 'bfill', 'count', 'cumcount', 'cummax', 'cummin', + 'cumprod', 'cumsum', 'describe', 'ffill', 'first', 'head', + 'last', 'mad', 'max', 'min', 'median', 'mean', 'nunique', + 'pct_change', 'prod', 'quantile', 'rank', 'sem', 'shift', + 'size', 'skew', 'std', 'sum', 'tail', 'unique', 'value_counts', + 'var'], + ['direct', 'transformation']] - param_names = ['dtype', 'ngroups'] - params = [['int', 'float'], [100, 10000]] - - def setup(self, dtype, ngroups): - np.random.seed(1234) + def setup(self, dtype, method, application): + if method in method_blacklist.get(dtype, {}): + raise NotImplementedError # skip benchmark + ngroups = 1000 size = ngroups * 2 rng = np.arange(ngroups) values = rng.take(np.random.randint(0, ngroups, size=size)) if dtype == 'int': key = np.random.randint(0, size, size=size) - else: + elif dtype == 'float': key = np.concatenate([np.random.random(ngroups) * 0.1, np.random.random(ngroups) * 10.0]) + elif dtype == 'object': + key = ['foo'] * size + elif dtype == 'datetime': + key = date_range('1/1/2011', periods=size, freq='s') - self.df = DataFrame({'values': values, - 'key': key}) - - def time_all(self, dtype, ngroups): - self.df.groupby('key')['values'].all() - - def time_any(self, dtype, ngroups): - self.df.groupby('key')['values'].any() - - def time_count(self, dtype, ngroups): - self.df.groupby('key')['values'].count() - - def time_cumcount(self, dtype, ngroups): - self.df.groupby('key')['values'].cumcount() - - def time_cummax(self, dtype, ngroups): - self.df.groupby('key')['values'].cummax() - - def time_cummin(self, dtype, ngroups): - self.df.groupby('key')['values'].cummin() - - def time_cumprod(self, dtype, ngroups): - self.df.groupby('key')['values'].cumprod() - - def time_cumsum(self, dtype, ngroups): - self.df.groupby('key')['values'].cumsum() - - def time_describe(self, dtype, ngroups): - self.df.groupby('key')['values'].describe() - - def time_diff(self, dtype, ngroups): - self.df.groupby('key')['values'].diff() - - def time_first(self, dtype, ngroups): - self.df.groupby('key')['values'].first() - - def time_head(self, dtype, ngroups): - self.df.groupby('key')['values'].head() - - def time_last(self, dtype, ngroups): - self.df.groupby('key')['values'].last() - - def time_mad(self, dtype, ngroups): - self.df.groupby('key')['values'].mad() - - def time_max(self, dtype, ngroups): - self.df.groupby('key')['values'].max() - - def time_mean(self, dtype, ngroups): - self.df.groupby('key')['values'].mean() - - def time_median(self, dtype, ngroups): - self.df.groupby('key')['values'].median() - - def time_min(self, dtype, ngroups): - self.df.groupby('key')['values'].min() - - def time_nunique(self, dtype, ngroups): - self.df.groupby('key')['values'].nunique() - - def time_pct_change(self, dtype, ngroups): - self.df.groupby('key')['values'].pct_change() - - def time_prod(self, dtype, ngroups): - self.df.groupby('key')['values'].prod() - - def time_rank(self, dtype, ngroups): - self.df.groupby('key')['values'].rank() - - def time_sem(self, dtype, ngroups): - self.df.groupby('key')['values'].sem() + df = DataFrame({'values': values, 'key': key}) - def time_size(self, dtype, ngroups): - self.df.groupby('key')['values'].size() + if application == 'transform': + if method == 'describe': + raise NotImplementedError - def time_skew(self, dtype, ngroups): - self.df.groupby('key')['values'].skew() + self.as_group_method = lambda: df.groupby( + 'key')['values'].transform(method) + self.as_field_method = lambda: df.groupby( + 'values')['key'].transform(method) + else: + self.as_group_method = getattr(df.groupby('key')['values'], method) + self.as_field_method = getattr(df.groupby('values')['key'], method) - def time_std(self, dtype, ngroups): - self.df.groupby('key')['values'].std() + def time_dtype_as_group(self, dtype, method, application): + self.as_group_method() - def time_sum(self, dtype, ngroups): - self.df.groupby('key')['values'].sum() + def time_dtype_as_field(self, dtype, method, application): + self.as_field_method() - def time_tail(self, dtype, ngroups): - self.df.groupby('key')['values'].tail() - def time_unique(self, dtype, ngroups): - self.df.groupby('key')['values'].unique() +class RankWithTies(object): + # GH 21237 + param_names = ['dtype', 'tie_method'] + params = [['float64', 'float32', 'int64', 'datetime64'], + ['first', 'average', 'dense', 'min', 'max']] - def time_value_counts(self, dtype, ngroups): - self.df.groupby('key')['values'].value_counts() + def setup(self, dtype, tie_method): + N = 10**4 + if dtype == 'datetime64': + data = np.array([Timestamp("2011/01/01")] * N, dtype=dtype) + else: + data = np.array([1] * N, dtype=dtype) + self.df = DataFrame({'values': data, 'key': ['foo'] * N}) - def time_var(self, dtype, ngroups): - self.df.groupby('key')['values'].var() + def time_rank_ties(self, dtype, tie_method): + self.df.groupby('key').rank(method=tie_method) -class groupby_float32(object): +class Float32(object): # GH 13335 - goal_time = 0.2 - def setup(self): tmp1 = (np.random.random(10000) * 0.1).astype(np.float32) tmp2 = (np.random.random(10000) * 10.0).astype(np.float32) @@ -506,27 +386,26 @@ def setup(self): arr = np.repeat(tmp, 10) self.df = DataFrame(dict(a=arr, b=arr)) - def time_groupby_sum(self): + def time_sum(self): self.df.groupby(['a'])['b'].sum() -class groupby_categorical(object): - goal_time = 0.2 +class Categories(object): def setup(self): - N = 100000 + N = 10**5 arr = np.random.random(N) - - self.df = DataFrame(dict( - a=Categorical(np.random.randint(10000, size=N)), - b=arr)) - self.df_ordered = DataFrame(dict( - a=Categorical(np.random.randint(10000, size=N), ordered=True), - b=arr)) - self.df_extra_cat = DataFrame(dict( - a=Categorical(np.random.randint(100, size=N), - categories=np.arange(10000)), - b=arr)) + data = {'a': Categorical(np.random.randint(10000, size=N)), + 'b': arr} + self.df = DataFrame(data) + data = {'a': Categorical(np.random.randint(10000, size=N), + ordered=True), + 'b': arr} + self.df_ordered = DataFrame(data) + data = {'a': Categorical(np.random.randint(100, size=N), + categories=np.arange(10000)), + 'b': arr} + self.df_extra_cat = DataFrame(data) def time_groupby_sort(self): self.df.groupby('a')['b'].count() @@ -547,130 +426,64 @@ def time_groupby_extra_cat_nosort(self): self.df_extra_cat.groupby('a', sort=False)['b'].count() -class groupby_period(object): +class Datelike(object): # GH 14338 - goal_time = 0.2 - - def make_grouper(self, N): - return pd.period_range('1900-01-01', freq='D', periods=N) - - def setup(self): - N = 10000 - self.grouper = self.make_grouper(N) - self.df = pd.DataFrame(np.random.randn(N, 2)) - - def time_groupby_sum(self): + params = ['period_range', 'date_range', 'date_range_tz'] + param_names = ['grouper'] + + def setup(self, grouper): + N = 10**4 + rng_map = {'period_range': period_range, + 'date_range': date_range, + 'date_range_tz': partial(date_range, tz='US/Central')} + self.grouper = rng_map[grouper]('1900-01-01', freq='D', periods=N) + self.df = DataFrame(np.random.randn(10**4, 2)) + + def time_sum(self, grouper): self.df.groupby(self.grouper).sum() -class groupby_datetime(groupby_period): - def make_grouper(self, N): - return pd.date_range('1900-01-01', freq='D', periods=N) - - -class groupby_datetimetz(groupby_period): - def make_grouper(self, N): - return pd.date_range('1900-01-01', freq='D', periods=N, - tz='US/Central') - -#---------------------------------------------------------------------- -# Series.value_counts - -class series_value_counts(object): - goal_time = 0.2 - - def setup(self): - self.s = Series(np.random.randint(0, 1000, size=100000)) - self.s2 = self.s.astype(float) - - self.K = 1000 - self.N = 100000 - self.uniques = tm.makeStringIndex(self.K).values - self.s3 = Series(np.tile(self.uniques, (self.N // self.K))) - - def time_value_counts_int64(self): - self.s.value_counts() - - def time_value_counts_float64(self): - self.s2.value_counts() - - def time_value_counts_strings(self): - self.s.value_counts() - - -#---------------------------------------------------------------------- -# pivot_table - -class groupby_pivot_table(object): - goal_time = 0.2 - - def setup(self): - self.fac1 = np.array(['A', 'B', 'C'], dtype='O') - self.fac2 = np.array(['one', 'two'], dtype='O') - self.ind1 = np.random.randint(0, 3, size=100000) - self.ind2 = np.random.randint(0, 2, size=100000) - self.df = DataFrame({'key1': self.fac1.take(self.ind1), 'key2': self.fac2.take(self.ind2), 'key3': self.fac2.take(self.ind2), 'value1': np.random.randn(100000), 'value2': np.random.randn(100000), 'value3': np.random.randn(100000), }) - - def time_groupby_pivot_table(self): - self.df.pivot_table(index='key1', columns=['key2', 'key3']) - - -#---------------------------------------------------------------------- -# Sum booleans #2692 - -class groupby_sum_booleans(object): - goal_time = 0.2 - +class SumBools(object): + # GH 2692 def setup(self): - self.N = 500 - self.df = DataFrame({'ii': range(self.N), 'bb': [True for x in range(self.N)], }) + N = 500 + self.df = DataFrame({'ii': range(N), + 'bb': [True] * N}) def time_groupby_sum_booleans(self): self.df.groupby('ii').sum() -#---------------------------------------------------------------------- -# multi-indexed group sum #9049 - -class groupby_sum_multiindex(object): - goal_time = 0.2 +class SumMultiLevel(object): + # GH 9049 + timeout = 120.0 def setup(self): - self.N = 50 - self.df = DataFrame({'A': (list(range(self.N)) * 2), 'B': list(range((self.N * 2))), 'C': 1, }).set_index(['A', 'B']) + N = 50 + self.df = DataFrame({'A': list(range(N)) * 2, + 'B': range(N * 2), + 'C': 1}).set_index(['A', 'B']) def time_groupby_sum_multiindex(self): self.df.groupby(level=[0, 1]).sum() -#------------------------------------------------------------------------------- -# Transform testing - class Transform(object): - goal_time = 0.2 def setup(self): n1 = 400 n2 = 250 - - index = MultiIndex( - levels=[np.arange(n1), pd.util.testing.makeStringIndex(n2)], - labels=[[i for i in range(n1) for _ in range(n2)], - (list(range(n2)) * n1)], - names=['lev1', 'lev2']) - - data = DataFrame(np.random.randn(n1 * n2, 3), - index=index, columns=['col1', 'col20', 'col3']) - step = int((n1 * n2 * 0.1)) - for col in range(len(data.columns)): - idx = col - while (idx < len(data)): - data.set_value(data.index[idx], data.columns[col], np.nan) - idx += step + index = MultiIndex(levels=[np.arange(n1), tm.makeStringIndex(n2)], + codes=[np.repeat(range(n1), n2).tolist(), + list(range(n2)) * n1], + names=['lev1', 'lev2']) + arr = np.random.randn(n1 * n2, 3) + arr[::10000, 0] = np.nan + arr[1::10000, 1] = np.nan + arr[2::10000, 2] = np.nan + data = DataFrame(arr, index=index, columns=['col1', 'col20', 'col3']) self.df = data - self.f_fillna = (lambda x: x.fillna(method='pad')) - np.random.seed(2718281) n = 20000 self.df1 = DataFrame(np.random.randint(1, n, (n, 3)), columns=['jim', 'joe', 'jolie']) @@ -682,10 +495,10 @@ def setup(self): self.df4 = self.df3.copy() self.df4['jim'] = self.df4['joe'] - def time_transform_func(self): - self.df.groupby(level='lev2').transform(self.f_fillna) + def time_transform_lambda_max(self): + self.df.groupby(level='lev1').transform(lambda x: max(x)) - def time_transform_ufunc(self): + def time_transform_ufunc_max(self): self.df.groupby(level='lev1').transform(np.max) def time_transform_multi_key1(self): @@ -701,63 +514,30 @@ def time_transform_multi_key4(self): self.df4.groupby(['jim', 'joe'])['jolie'].transform('max') - - -np.random.seed(0) -N = 120000 -N_TRANSITIONS = 1400 -transition_points = np.random.permutation(np.arange(N))[:N_TRANSITIONS] -transition_points.sort() -transitions = np.zeros((N,), dtype=np.bool) -transitions[transition_points] = True -g = transitions.cumsum() -df = DataFrame({'signal': np.random.rand(N), }) - - - - - -class groupby_transform_series(object): - goal_time = 0.2 +class TransformBools(object): def setup(self): - np.random.seed(0) N = 120000 transition_points = np.sort(np.random.choice(np.arange(N), 1400)) - transitions = np.zeros((N,), dtype=np.bool) + transitions = np.zeros(N, dtype=np.bool) transitions[transition_points] = True self.g = transitions.cumsum() self.df = DataFrame({'signal': np.random.rand(N)}) - def time_groupby_transform_series(self): + def time_transform_mean(self): self.df['signal'].groupby(self.g).transform(np.mean) -class groupby_transform_series2(object): - goal_time = 0.2 - +class TransformNaN(object): + # GH 12737 def setup(self): - np.random.seed(0) - self.df = DataFrame({'key': (np.arange(100000) // 3), - 'val': np.random.randn(100000)}) - - self.df_nans = pd.DataFrame({'key': np.repeat(np.arange(1000), 10), - 'B': np.nan, - 'C': np.nan}) - self.df_nans.ix[4::10, 'B':'C'] = 5 + self.df_nans = DataFrame({'key': np.repeat(np.arange(1000), 10), + 'B': np.nan, + 'C': np.nan}) + self.df_nans.loc[4::10, 'B':'C'] = 5 - def time_transform_series2(self): - self.df.groupby('key')['val'].transform(np.mean) - - def time_cumprod(self): - self.df.groupby('key').cumprod() - - def time_cumsum(self): - self.df.groupby('key').cumsum() + def time_first(self): + self.df_nans.groupby('key').transform('first') - def time_shift(self): - self.df.groupby('key').shift() - def time_transform_dataframe(self): - # GH 12737 - self.df_nans.groupby('key').transform('first') +from .pandas_vb_common import setup # noqa: F401 diff --git a/asv_bench/benchmarks/hdfstore_bench.py b/asv_bench/benchmarks/hdfstore_bench.py deleted file mode 100644 index 78de5267a2969..0000000000000 --- a/asv_bench/benchmarks/hdfstore_bench.py +++ /dev/null @@ -1,122 +0,0 @@ -from .pandas_vb_common import * -import os - - -class HDF5(object): - goal_time = 0.2 - - def setup(self): - self.index = tm.makeStringIndex(25000) - self.df = DataFrame({'float1': randn(25000), 'float2': randn(25000),}, - index=self.index) - - self.df_mixed = DataFrame( - {'float1': randn(25000), 'float2': randn(25000), - 'string1': (['foo'] * 25000), - 'bool1': ([True] * 25000), - 'int1': np.random.randint(0, 250000, size=25000),}, - index=self.index) - - self.df_wide = DataFrame(np.random.randn(25000, 100)) - - self.df2 = DataFrame({'float1': randn(25000), 'float2': randn(25000)}, - index=date_range('1/1/2000', periods=25000)) - self.df_wide2 = DataFrame(np.random.randn(25000, 100), - index=date_range('1/1/2000', periods=25000)) - - self.df_dc = DataFrame(np.random.randn(10000, 10), - columns=[('C%03d' % i) for i in range(10)]) - - self.f = '__test__.h5' - self.remove(self.f) - - self.store = HDFStore(self.f) - self.store.put('df1', self.df) - self.store.put('df_mixed', self.df_mixed) - - self.store.append('df5', self.df_mixed) - self.store.append('df7', self.df) - - self.store.append('df9', self.df_wide) - - self.store.append('df11', self.df_wide2) - self.store.append('df12', self.df2) - - def teardown(self): - self.store.close() - - def remove(self, f): - try: - os.remove(self.f) - except: - pass - - def time_read_store(self): - self.store.get('df1') - - def time_read_store_mixed(self): - self.store.get('df_mixed') - - def time_write_store(self): - self.store.put('df2', self.df) - - def time_write_store_mixed(self): - self.store.put('df_mixed2', self.df_mixed) - - def time_read_store_table_mixed(self): - self.store.select('df5') - - def time_write_store_table_mixed(self): - self.store.append('df6', self.df_mixed) - - def time_read_store_table(self): - self.store.select('df7') - - def time_write_store_table(self): - self.store.append('df8', self.df) - - def time_read_store_table_wide(self): - self.store.select('df9') - - def time_write_store_table_wide(self): - self.store.append('df10', self.df_wide) - - def time_write_store_table_dc(self): - self.store.append('df15', self.df, data_columns=True) - - def time_query_store_table_wide(self): - self.store.select('df11', [('index', '>', self.df_wide2.index[10000]), - ('index', '<', self.df_wide2.index[15000])]) - - def time_query_store_table(self): - self.store.select('df12', [('index', '>', self.df2.index[10000]), - ('index', '<', self.df2.index[15000])]) - - -class HDF5Panel(object): - goal_time = 0.2 - - def setup(self): - self.f = '__test__.h5' - self.p = Panel(randn(20, 1000, 25), - items=[('Item%03d' % i) for i in range(20)], - major_axis=date_range('1/1/2000', periods=1000), - minor_axis=[('E%03d' % i) for i in range(25)]) - self.remove(self.f) - self.store = HDFStore(self.f) - self.store.append('p1', self.p) - - def teardown(self): - self.store.close() - - def remove(self, f): - try: - os.remove(self.f) - except: - pass - - def time_read_store_table_panel(self): - self.store.select('p1') - - def time_write_store_table_panel(self): - self.store.append('p2', self.p) diff --git a/asv_bench/benchmarks/index_object.py b/asv_bench/benchmarks/index_object.py index 3fb53ce9b3c98..bbe164d4858ab 100644 --- a/asv_bench/benchmarks/index_object.py +++ b/asv_bench/benchmarks/index_object.py @@ -1,201 +1,184 @@ -from .pandas_vb_common import * +import numpy as np +import pandas.util.testing as tm +from pandas import (Series, date_range, DatetimeIndex, Index, RangeIndex, + Float64Index) class SetOperations(object): - goal_time = 0.2 - def setup(self): - self.rng = date_range('1/1/2000', periods=10000, freq='T') - self.rng2 = self.rng[:(-1)] + params = (['datetime', 'date_string', 'int', 'strings'], + ['intersection', 'union', 'symmetric_difference']) + param_names = ['dtype', 'method'] - # object index with datetime values - if (self.rng.dtype == object): - self.idx_rng = self.rng.view(Index) - else: - self.idx_rng = self.rng.asobject - self.idx_rng2 = self.idx_rng[:(-1)] + def setup(self, dtype, method): + N = 10**5 + dates_left = date_range('1/1/2000', periods=N, freq='T') + fmt = '%Y-%m-%d %H:%M:%S' + date_str_left = Index(dates_left.strftime(fmt)) + int_left = Index(np.arange(N)) + str_left = tm.makeStringIndex(N) + data = {'datetime': {'left': dates_left, 'right': dates_left[:-1]}, + 'date_string': {'left': date_str_left, + 'right': date_str_left[:-1]}, + 'int': {'left': int_left, 'right': int_left[:-1]}, + 'strings': {'left': str_left, 'right': str_left[:-1]}} + self.left = data[dtype]['left'] + self.right = data[dtype]['right'] - # other datetime - N = 100000 - A = N - 20000 - B = N + 20000 - self.dtidx1 = DatetimeIndex(range(N)) - self.dtidx2 = DatetimeIndex(range(A, B)) - self.dtidx3 = DatetimeIndex(range(N, B)) - - # integer - self.N = 1000000 - self.options = np.arange(self.N) - self.left = Index( - self.options.take(np.random.permutation(self.N)[:(self.N // 2)])) - self.right = Index( - self.options.take(np.random.permutation(self.N)[:(self.N // 2)])) - - # strings - N = 10000 - strs = tm.rands_array(10, N) - self.leftstr = Index(strs[:N * 2 // 3]) - self.rightstr = Index(strs[N // 3:]) + def time_operation(self, dtype, method): + getattr(self.left, method)(self.right) - def time_datetime_intersection(self): - self.rng.intersection(self.rng2) - def time_datetime_union(self): - self.rng.union(self.rng2) +class SetDisjoint(object): - def time_datetime_difference(self): - self.dtidx1.difference(self.dtidx2) + def setup(self): + N = 10**5 + B = N + 20000 + self.datetime_left = DatetimeIndex(range(N)) + self.datetime_right = DatetimeIndex(range(N, B)) def time_datetime_difference_disjoint(self): - self.dtidx1.difference(self.dtidx3) - - def time_datetime_symmetric_difference(self): - self.dtidx1.symmetric_difference(self.dtidx2) - - def time_index_datetime_intersection(self): - self.idx_rng.intersection(self.idx_rng2) - - def time_index_datetime_union(self): - self.idx_rng.union(self.idx_rng2) - - def time_int64_intersection(self): - self.left.intersection(self.right) - - def time_int64_union(self): - self.left.union(self.right) - - def time_int64_difference(self): - self.left.difference(self.right) - - def time_int64_symmetric_difference(self): - self.left.symmetric_difference(self.right) - - def time_str_difference(self): - self.leftstr.difference(self.rightstr) - - def time_str_symmetric_difference(self): - self.leftstr.symmetric_difference(self.rightstr) + self.datetime_left.difference(self.datetime_right) class Datetime(object): - goal_time = 0.2 def setup(self): - self.dr = pd.date_range('20000101', freq='D', periods=10000) + self.dr = date_range('20000101', freq='D', periods=10000) def time_is_dates_only(self): self.dr._is_dates_only -class Float64(object): - goal_time = 0.2 - - def setup(self): - self.idx = tm.makeFloatIndex(1000000) - self.mask = ((np.arange(self.idx.size) % 3) == 0) - self.series_mask = Series(self.mask) - - self.baseidx = np.arange(1000000.0) +class Ops(object): - def time_boolean_indexer(self): - self.idx[self.mask] + sample_time = 0.2 + params = ['float', 'int'] + param_names = ['dtype'] - def time_boolean_series_indexer(self): - self.idx[self.series_mask] + def setup(self, dtype): + N = 10**6 + indexes = {'int': 'makeIntIndex', 'float': 'makeFloatIndex'} + self.index = getattr(tm, indexes[dtype])(N) - def time_construct(self): - Index(self.baseidx) + def time_add(self, dtype): + self.index + 2 - def time_div(self): - (self.idx / 2) + def time_subtract(self, dtype): + self.index - 2 - def time_get(self): - self.idx[1] + def time_multiply(self, dtype): + self.index * 2 - def time_mul(self): - (self.idx * 2) + def time_divide(self, dtype): + self.index / 2 - def time_slice_indexer_basic(self): - self.idx[:(-1)] - - def time_slice_indexer_even(self): - self.idx[::2] + def time_modulo(self, dtype): + self.index % 2 -class StringIndex(object): - goal_time = 0.2 +class Range(object): def setup(self): - self.idx = tm.makeStringIndex(1000000) - self.mask = ((np.arange(1000000) % 3) == 0) - self.series_mask = Series(self.mask) + self.idx_inc = RangeIndex(start=0, stop=10**7, step=3) + self.idx_dec = RangeIndex(start=10**7, stop=-1, step=-3) - def time_boolean_indexer(self): - self.idx[self.mask] + def time_max(self): + self.idx_inc.max() - def time_boolean_series_indexer(self): - self.idx[self.series_mask] + def time_max_trivial(self): + self.idx_dec.max() - def time_slice_indexer_basic(self): - self.idx[:(-1)] + def time_min(self): + self.idx_dec.min() - def time_slice_indexer_even(self): - self.idx[::2] + def time_min_trivial(self): + self.idx_inc.min() -class Multi1(object): - goal_time = 0.2 +class IndexAppend(object): def setup(self): - (n, k) = (200, 5000) - self.levels = [np.arange(n), tm.makeStringIndex(n).values, (1000 + np.arange(n))] - self.labels = [np.random.choice(n, (k * n)) for lev in self.levels] - self.mi = MultiIndex(levels=self.levels, labels=self.labels) - - self.iterables = [tm.makeStringIndex(10000), range(20)] - - def time_duplicated(self): - self.mi.duplicated() - - def time_from_product(self): - MultiIndex.from_product(self.iterables) + N = 10000 + self.range_idx = RangeIndex(0, 100) + self.int_idx = self.range_idx.astype(int) + self.obj_idx = self.int_idx.astype(str) + self.range_idxs = [] + self.int_idxs = [] + self.object_idxs = [] + for i in range(1, N): + r_idx = RangeIndex(i * 100, (i + 1) * 100) + self.range_idxs.append(r_idx) + i_idx = r_idx.astype(int) + self.int_idxs.append(i_idx) + o_idx = i_idx.astype(str) + self.object_idxs.append(o_idx) + + def time_append_range_list(self): + self.range_idx.append(self.range_idxs) + + def time_append_int_list(self): + self.int_idx.append(self.int_idxs) + + def time_append_obj_list(self): + self.obj_idx.append(self.object_idxs) + + +class Indexing(object): + + params = ['String', 'Float', 'Int'] + param_names = ['dtype'] + + def setup(self, dtype): + N = 10**6 + self.idx = getattr(tm, 'make{}Index'.format(dtype))(N) + self.array_mask = (np.arange(N) % 3) == 0 + self.series_mask = Series(self.array_mask) + self.sorted = self.idx.sort_values() + half = N // 2 + self.non_unique = self.idx[:half].append(self.idx[:half]) + self.non_unique_sorted = (self.sorted[:half].append(self.sorted[:half]) + .sort_values()) + self.key = self.sorted[N // 4] + + def time_boolean_array(self, dtype): + self.idx[self.array_mask] + + def time_boolean_series(self, dtype): + self.idx[self.series_mask] -class Multi2(object): - goal_time = 0.2 + def time_get(self, dtype): + self.idx[1] - def setup(self): - self.n = ((((3 * 5) * 7) * 11) * (1 << 10)) - (low, high) = (((-1) << 12), (1 << 12)) - self.f = (lambda k: np.repeat(np.random.randint(low, high, (self.n // k)), k)) - self.i = np.random.permutation(self.n) - self.mi = MultiIndex.from_arrays([self.f(11), self.f(7), self.f(5), self.f(3), self.f(1)])[self.i] + def time_slice(self, dtype): + self.idx[:-1] - self.a = np.repeat(np.arange(100), 1000) - self.b = np.tile(np.arange(1000), 100) - self.midx2 = MultiIndex.from_arrays([self.a, self.b]) - self.midx2 = self.midx2.take(np.random.permutation(np.arange(100000))) + def time_slice_step(self, dtype): + self.idx[::2] - def time_sortlevel_int64(self): - self.mi.sortlevel() + def time_get_loc(self, dtype): + self.idx.get_loc(self.key) - def time_sortlevel_zero(self): - self.midx2.sortlevel(0) + def time_get_loc_sorted(self, dtype): + self.sorted.get_loc(self.key) - def time_sortlevel_one(self): - self.midx2.sortlevel(1) + def time_get_loc_non_unique(self, dtype): + self.non_unique.get_loc(self.key) + def time_get_loc_non_unique_sorted(self, dtype): + self.non_unique_sorted.get_loc(self.key) -class Multi3(object): - goal_time = 0.2 +class Float64IndexMethod(object): + # GH 13166 def setup(self): - self.level1 = range(1000) - self.level2 = date_range(start='1/1/2012', periods=100) - self.mi = MultiIndex.from_product([self.level1, self.level2]) + N = 100000 + a = np.arange(N) + self.ind = Float64Index(a * 4.8000000418824129e-08) + + def time_get_loc(self): + self.ind.get_loc(0) - def time_datetime_level_values_full(self): - self.mi.copy().values - def time_datetime_level_values_sliced(self): - self.mi[:10].values +from .pandas_vb_common import setup # noqa: F401 diff --git a/asv_bench/benchmarks/indexing.py b/asv_bench/benchmarks/indexing.py index d938cc6a6dc4d..57ba9cd80e55c 100644 --- a/asv_bench/benchmarks/indexing.py +++ b/asv_bench/benchmarks/indexing.py @@ -1,237 +1,350 @@ -from .pandas_vb_common import * -try: - import pandas.computation.expressions as expr -except: - expr = None +import warnings +import numpy as np +import pandas.util.testing as tm +from pandas import (Series, DataFrame, Panel, MultiIndex, + Int64Index, UInt64Index, Float64Index, + IntervalIndex, CategoricalIndex, + IndexSlice, concat, date_range) -class Int64Indexing(object): - goal_time = 0.2 - def setup(self): - self.s = Series(np.random.rand(1000000)) - - def time_getitem_scalar(self): - self.s[800000] +class NumericSeriesIndexing(object): - def time_getitem_slice(self): - self.s[:800000] + params = [ + (Int64Index, UInt64Index, Float64Index), + ('unique_monotonic_inc', 'nonunique_monotonic_inc'), + ] + param_names = ['index_dtype', 'index_structure'] - def time_getitem_list_like(self): - self.s[[800000]] + def setup(self, index, index_structure): + N = 10**6 + indices = { + 'unique_monotonic_inc': index(range(N)), + 'nonunique_monotonic_inc': index( + list(range(55)) + [54] + list(range(55, N - 1))), + } + self.data = Series(np.random.rand(N), index=indices[index_structure]) + self.array = np.arange(10000) + self.array_list = self.array.tolist() - def time_getitem_array(self): - self.s[np.arange(10000)] + def time_getitem_scalar(self, index, index_structure): + self.data[800000] - def time_iloc_array(self): - self.s.iloc[np.arange(10000)] + def time_getitem_slice(self, index, index_structure): + self.data[:800000] - def time_iloc_list_like(self): - self.s.iloc[[800000]] + def time_getitem_list_like(self, index, index_structure): + self.data[[800000]] - def time_iloc_scalar(self): - self.s.iloc[800000] + def time_getitem_array(self, index, index_structure): + self.data[self.array] - def time_iloc_slice(self): - self.s.iloc[:800000] + def time_getitem_lists(self, index, index_structure): + self.data[self.array_list] - def time_ix_array(self): - self.s.ix[np.arange(10000)] + def time_iloc_array(self, index, index_structure): + self.data.iloc[self.array] - def time_ix_list_like(self): - self.s.ix[[800000]] + def time_iloc_list_like(self, index, index_structure): + self.data.iloc[[800000]] - def time_ix_scalar(self): - self.s.ix[800000] + def time_iloc_scalar(self, index, index_structure): + self.data.iloc[800000] - def time_ix_slice(self): - self.s.ix[:800000] + def time_iloc_slice(self, index, index_structure): + self.data.iloc[:800000] - def time_loc_array(self): - self.s.loc[np.arange(10000)] + def time_ix_array(self, index, index_structure): + self.data.ix[self.array] - def time_loc_list_like(self): - self.s.loc[[800000]] + def time_ix_list_like(self, index, index_structure): + self.data.ix[[800000]] - def time_loc_scalar(self): - self.s.loc[800000] + def time_ix_scalar(self, index, index_structure): + self.data.ix[800000] - def time_loc_slice(self): - self.s.loc[:800000] + def time_ix_slice(self, index, index_structure): + self.data.ix[:800000] + def time_loc_array(self, index, index_structure): + self.data.loc[self.array] -class StringIndexing(object): - goal_time = 0.2 + def time_loc_list_like(self, index, index_structure): + self.data.loc[[800000]] - def setup(self): - self.index = tm.makeStringIndex(1000000) - self.s = Series(np.random.rand(1000000), index=self.index) - self.lbl = self.s.index[800000] + def time_loc_scalar(self, index, index_structure): + self.data.loc[800000] - def time_getitem_label_slice(self): - self.s[:self.lbl] + def time_loc_slice(self, index, index_structure): + self.data.loc[:800000] - def time_getitem_pos_slice(self): - self.s[:800000] - def time_get_value(self): - self.s.get_value(self.lbl) +class NonNumericSeriesIndexing(object): + params = [ + ('string', 'datetime'), + ('unique_monotonic_inc', 'nonunique_monotonic_inc'), + ] + param_names = ['index_dtype', 'index_structure'] -class DatetimeIndexing(object): - goal_time = 0.2 + def setup(self, index, index_structure): + N = 10**6 + indexes = {'string': tm.makeStringIndex(N), + 'datetime': date_range('1900', periods=N, freq='s')} + index = indexes[index] + if index_structure == 'nonunique_monotonic_inc': + index = index.insert(item=index[2], loc=2)[:-1] + self.s = Series(np.random.rand(N), index=index) + self.lbl = index[80000] - def setup(self): - tm.N = 1000 - self.ts = tm.makeTimeSeries() - self.dt = self.ts.index[500] + def time_getitem_label_slice(self, index, index_structure): + self.s[:self.lbl] - def time_getitem_scalar(self): - self.ts[self.dt] + def time_getitem_pos_slice(self, index, index_structure): + self.s[:80000] + def time_get_value(self, index, index_structure): + with warnings.catch_warnings(record=True): + self.s.get_value(self.lbl) -class DataFrameIndexing(object): - goal_time = 0.2 + def time_getitem_scalar(self, index, index_structure): + self.s[self.lbl] - def setup(self): - self.index = tm.makeStringIndex(1000) - self.columns = tm.makeStringIndex(30) - self.df = DataFrame(np.random.randn(1000, 30), index=self.index, - columns=self.columns) - self.idx = self.index[100] - self.col = self.columns[10] + def time_getitem_list_like(self, index, index_structure): + self.s[[self.lbl]] - self.df2 = DataFrame(np.random.randn(10000, 4), - columns=['A', 'B', 'C', 'D']) - self.indexer = (self.df2['B'] > 0) - self.obj_indexer = self.indexer.astype('O') - # duptes - self.idx_dupe = (np.array(range(30)) * 99) - self.df3 = DataFrame({'A': ([0.1] * 1000), 'B': ([1] * 1000),}) - self.df3 = concat([self.df3, (2 * self.df3), (3 * self.df3)]) +class DataFrameStringIndexing(object): - self.df_big = DataFrame(dict(A=(['foo'] * 1000000))) + def setup(self): + index = tm.makeStringIndex(1000) + columns = tm.makeStringIndex(30) + self.df = DataFrame(np.random.randn(1000, 30), index=index, + columns=columns) + self.idx_scalar = index[100] + self.col_scalar = columns[10] + self.bool_indexer = self.df[self.col_scalar] > 0 + self.bool_obj_indexer = self.bool_indexer.astype(object) def time_get_value(self): - self.df.get_value(self.idx, self.col) + with warnings.catch_warnings(record=True): + self.df.get_value(self.idx_scalar, self.col_scalar) + + def time_ix(self): + self.df.ix[self.idx_scalar, self.col_scalar] - def time_get_value_ix(self): - self.df.ix[(self.idx, self.col)] + def time_loc(self): + self.df.loc[self.idx_scalar, self.col_scalar] def time_getitem_scalar(self): - self.df[self.col][self.idx] + self.df[self.col_scalar][self.idx_scalar] def time_boolean_rows(self): - self.df2[self.indexer] + self.df[self.bool_indexer] def time_boolean_rows_object(self): - self.df2[self.obj_indexer] + self.df[self.bool_obj_indexer] + + +class DataFrameNumericIndexing(object): + + def setup(self): + self.idx_dupe = np.array(range(30)) * 99 + self.df = DataFrame(np.random.randn(10000, 5)) + self.df_dup = concat([self.df, 2 * self.df, 3 * self.df]) + self.bool_indexer = [True] * 5000 + [False] * 5000 def time_iloc_dups(self): - self.df3.iloc[self.idx_dupe] + self.df_dup.iloc[self.idx_dupe] def time_loc_dups(self): - self.df3.loc[self.idx_dupe] + self.df_dup.loc[self.idx_dupe] - def time_iloc_big(self): - self.df_big.iloc[:100, 0] + def time_iloc(self): + self.df.iloc[:100, 0] + def time_loc(self): + self.df.loc[:100, 0] -class IndexingMethods(object): - # GH 13166 - goal_time = 0.2 + def time_bool_indexer(self): + self.df[self.bool_indexer] - def setup(self): - a = np.arange(100000) - self.ind = pd.Float64Index(a * 4.8000000418824129e-08) - self.s = Series(np.random.rand(100000)) - self.ts = Series(np.random.rand(100000), - index=date_range('2011-01-01', freq='S', periods=100000)) - self.indexer = ([True, False, True, True, False] * 20000) +class Take(object): - def time_get_loc_float(self): - self.ind.get_loc(0) + params = ['int', 'datetime'] + param_names = ['index'] - def time_take_dtindex(self): - self.ts.take(self.indexer) + def setup(self, index): + N = 100000 + indexes = {'int': Int64Index(np.arange(N)), + 'datetime': date_range('2011-01-01', freq='S', periods=N)} + index = indexes[index] + self.s = Series(np.random.rand(N), index=index) + self.indexer = [True, False, True, True, False] * 20000 - def time_take_intindex(self): + def time_take(self, index): self.s.take(self.indexer) class MultiIndexing(object): - goal_time = 0.2 def setup(self): - self.mi = MultiIndex.from_tuples([(x, y) for x in range(1000) for y in range(1000)]) - self.s = Series(np.random.randn(1000000), index=self.mi) + mi = MultiIndex.from_product([range(1000), range(1000)]) + self.s = Series(np.random.randn(1000000), index=mi) self.df = DataFrame(self.s) - # slicers - np.random.seed(1234) - self.idx = pd.IndexSlice - self.n = 100000 - self.mdt = pandas.DataFrame() - self.mdt['A'] = np.random.choice(range(10000, 45000, 1000), self.n) - self.mdt['B'] = np.random.choice(range(10, 400), self.n) - self.mdt['C'] = np.random.choice(range(1, 150), self.n) - self.mdt['D'] = np.random.choice(range(10000, 45000), self.n) - self.mdt['x'] = np.random.choice(range(400), self.n) - self.mdt['y'] = np.random.choice(range(25), self.n) - self.test_A = 25000 - self.test_B = 25 - self.test_C = 40 - self.test_D = 35000 - self.eps_A = 5000 - self.eps_B = 5 - self.eps_C = 5 - self.eps_D = 5000 - self.mdt2 = self.mdt.set_index(['A', 'B', 'C', 'D']).sortlevel() - self.miint = MultiIndex.from_product( - [np.arange(1000), - np.arange(1000)], names=['one', 'two']) - - import string - self.mistring = MultiIndex.from_product( - [np.arange(1000), - np.arange(20), list(string.ascii_letters)], - names=['one', 'two', 'three']) - - def time_series_xs_mi_ix(self): + n = 100000 + self.mdt = DataFrame({'A': np.random.choice(range(10000, 45000, 1000), + n), + 'B': np.random.choice(range(10, 400), n), + 'C': np.random.choice(range(1, 150), n), + 'D': np.random.choice(range(10000, 45000), n), + 'x': np.random.choice(range(400), n), + 'y': np.random.choice(range(25), n)}) + self.idx = IndexSlice[20000:30000, 20:30, 35:45, 30000:40000] + self.mdt = self.mdt.set_index(['A', 'B', 'C', 'D']).sort_index() + + def time_series_ix(self): self.s.ix[999] - def time_frame_xs_mi_ix(self): + def time_frame_ix(self): self.df.ix[999] - def time_multiindex_slicers(self): - self.mdt2.loc[self.idx[ - (self.test_A - self.eps_A):(self.test_A + self.eps_A), - (self.test_B - self.eps_B):(self.test_B + self.eps_B), - (self.test_C - self.eps_C):(self.test_C + self.eps_C), - (self.test_D - self.eps_D):(self.test_D + self.eps_D)], :] + def time_index_slice(self): + self.mdt.loc[self.idx, :] + + +class IntervalIndexing(object): + + def setup_cache(self): + idx = IntervalIndex.from_breaks(np.arange(1000001)) + monotonic = Series(np.arange(1000000), index=idx) + return monotonic + + def time_getitem_scalar(self, monotonic): + monotonic[80000] + + def time_loc_scalar(self, monotonic): + monotonic.loc[80000] + + def time_getitem_list(self, monotonic): + monotonic[80000:] + + def time_loc_list(self, monotonic): + monotonic.loc[80000:] + + +class CategoricalIndexIndexing(object): + + params = ['monotonic_incr', 'monotonic_decr', 'non_monotonic'] + param_names = ['index'] + + def setup(self, index): + N = 10**5 + values = list('a' * N + 'b' * N + 'c' * N) + indices = { + 'monotonic_incr': CategoricalIndex(values), + 'monotonic_decr': CategoricalIndex(reversed(values)), + 'non_monotonic': CategoricalIndex(list('abc' * N))} + self.data = indices[index] + + self.int_scalar = 10000 + self.int_list = list(range(10000)) + + self.cat_scalar = 'b' + self.cat_list = ['a', 'c'] - def time_multiindex_get_indexer(self): - self.miint.get_indexer( - np.array([(0, 10), (0, 11), (0, 12), - (0, 13), (0, 14), (0, 15), - (0, 16), (0, 17), (0, 18), - (0, 19)], dtype=object)) + def time_getitem_scalar(self, index): + self.data[self.int_scalar] - def time_multiindex_string_get_loc(self): - self.mistring.get_loc((999, 19, 'Z')) + def time_getitem_slice(self, index): + self.data[:self.int_scalar] - def time_is_monotonic(self): - self.miint.is_monotonic + def time_getitem_list_like(self, index): + self.data[[self.int_scalar]] + + def time_getitem_list(self, index): + self.data[self.int_list] + + def time_getitem_bool_array(self, index): + self.data[self.data == self.cat_scalar] + + def time_get_loc_scalar(self, index): + self.data.get_loc(self.cat_scalar) + + def time_get_indexer_list(self, index): + self.data.get_indexer(self.cat_list) class PanelIndexing(object): - goal_time = 0.2 def setup(self): - self.p = Panel(np.random.randn(100, 100, 100)) - self.inds = range(0, 100, 10) + with warnings.catch_warnings(record=True): + self.p = Panel(np.random.randn(100, 100, 100)) + self.inds = range(0, 100, 10) def time_subset(self): - self.p.ix[(self.inds, self.inds, self.inds)] + with warnings.catch_warnings(record=True): + self.p.ix[(self.inds, self.inds, self.inds)] + + +class MethodLookup(object): + + def setup_cache(self): + s = Series() + return s + + def time_lookup_iloc(self, s): + s.iloc + + def time_lookup_ix(self, s): + s.ix + + def time_lookup_loc(self, s): + s.loc + + +class GetItemSingleColumn(object): + + def setup(self): + self.df_string_col = DataFrame(np.random.randn(3000, 1), columns=['A']) + self.df_int_col = DataFrame(np.random.randn(3000, 1)) + + def time_frame_getitem_single_column_label(self): + self.df_string_col['A'] + + def time_frame_getitem_single_column_int(self): + self.df_int_col[0] + + +class AssignTimeseriesIndex(object): + + def setup(self): + N = 100000 + idx = date_range('1/1/2000', periods=N, freq='H') + self.df = DataFrame(np.random.randn(N, 1), columns=['A'], index=idx) + + def time_frame_assign_timeseries_index(self): + self.df['date'] = self.df.index + + +class InsertColumns(object): + + def setup(self): + self.N = 10**3 + self.df = DataFrame(index=range(self.N)) + + def time_insert(self): + np.random.seed(1234) + for i in range(100): + self.df.insert(0, i, np.random.randn(self.N), + allow_duplicates=True) + + def time_assign_with_setitem(self): + np.random.seed(1234) + for i in range(100): + self.df[i] = np.random.randn(self.N) + + +from .pandas_vb_common import setup # noqa: F401 diff --git a/asv_bench/benchmarks/indexing_engines.py b/asv_bench/benchmarks/indexing_engines.py new file mode 100644 index 0000000000000..f3d063ee31bc8 --- /dev/null +++ b/asv_bench/benchmarks/indexing_engines.py @@ -0,0 +1,64 @@ +import numpy as np + +from pandas._libs import index as libindex + + +def _get_numeric_engines(): + engine_names = [ + ('Int64Engine', np.int64), ('Int32Engine', np.int32), + ('Int16Engine', np.int16), ('Int8Engine', np.int8), + ('UInt64Engine', np.uint64), ('UInt32Engine', np.uint32), + ('UInt16engine', np.uint16), ('UInt8Engine', np.uint8), + ('Float64Engine', np.float64), ('Float32Engine', np.float32), + ] + return [(getattr(libindex, engine_name), dtype) + for engine_name, dtype in engine_names + if hasattr(libindex, engine_name)] + + +class NumericEngineIndexing(object): + + params = [_get_numeric_engines(), + ['monotonic_incr', 'monotonic_decr', 'non_monotonic'], + ] + param_names = ['engine_and_dtype', 'index_type'] + + def setup(self, engine_and_dtype, index_type): + engine, dtype = engine_and_dtype + N = 10**5 + values = list([1] * N + [2] * N + [3] * N) + arr = { + 'monotonic_incr': np.array(values, dtype=dtype), + 'monotonic_decr': np.array(list(reversed(values)), + dtype=dtype), + 'non_monotonic': np.array([1, 2, 3] * N, dtype=dtype), + }[index_type] + + self.data = engine(lambda: arr, len(arr)) + # code belows avoids populating the mapping etc. while timing. + self.data.get_loc(2) + + def time_get_loc(self, engine_and_dtype, index_type): + self.data.get_loc(2) + + +class ObjectEngineIndexing(object): + + params = [('monotonic_incr', 'monotonic_decr', 'non_monotonic')] + param_names = ['index_type'] + + def setup(self, index_type): + N = 10**5 + values = list('a' * N + 'b' * N + 'c' * N) + arr = { + 'monotonic_incr': np.array(values, dtype=object), + 'monotonic_decr': np.array(list(reversed(values)), dtype=object), + 'non_monotonic': np.array(list('abc') * N, dtype=object), + }[index_type] + + self.data = libindex.ObjectEngine(lambda: arr, len(arr)) + # code belows avoids populating the mapping etc. while timing. + self.data.get_loc('b') + + def time_get_loc(self, index_type): + self.data.get_loc('b') diff --git a/asv_bench/benchmarks/inference.py b/asv_bench/benchmarks/inference.py index 3635438a7f76b..423bd02b93596 100644 --- a/asv_bench/benchmarks/inference.py +++ b/asv_bench/benchmarks/inference.py @@ -1,77 +1,76 @@ -from .pandas_vb_common import * -import pandas as pd +import numpy as np +import pandas.util.testing as tm +from pandas import DataFrame, Series, to_numeric +from .pandas_vb_common import numeric_dtypes, lib -class DtypeInfer(object): - goal_time = 0.2 +class NumericInferOps(object): # from GH 7332 + params = numeric_dtypes + param_names = ['dtype'] - def setup(self): - self.N = 500000 - self.df_int64 = DataFrame(dict(A=np.arange(self.N, dtype='int64'), - B=np.arange(self.N, dtype='int64'))) - self.df_int32 = DataFrame(dict(A=np.arange(self.N, dtype='int32'), - B=np.arange(self.N, dtype='int32'))) - self.df_uint32 = DataFrame(dict(A=np.arange(self.N, dtype='uint32'), - B=np.arange(self.N, dtype='uint32'))) - self.df_float64 = DataFrame(dict(A=np.arange(self.N, dtype='float64'), - B=np.arange(self.N, dtype='float64'))) - self.df_float32 = DataFrame(dict(A=np.arange(self.N, dtype='float32'), - B=np.arange(self.N, dtype='float32'))) - self.df_datetime64 = DataFrame(dict(A=pd.to_datetime(np.arange(self.N, dtype='int64'), unit='ms'), - B=pd.to_datetime(np.arange(self.N, dtype='int64'), unit='ms'))) - self.df_timedelta64 = DataFrame(dict(A=(self.df_datetime64['A'] - self.df_datetime64['B']), - B=self.df_datetime64['B'])) + def setup(self, dtype): + N = 5 * 10**5 + self.df = DataFrame({'A': np.arange(N).astype(dtype), + 'B': np.arange(N).astype(dtype)}) - def time_int64(self): - (self.df_int64['A'] + self.df_int64['B']) + def time_add(self, dtype): + self.df['A'] + self.df['B'] - def time_int32(self): - (self.df_int32['A'] + self.df_int32['B']) + def time_subtract(self, dtype): + self.df['A'] - self.df['B'] - def time_uint32(self): - (self.df_uint32['A'] + self.df_uint32['B']) + def time_multiply(self, dtype): + self.df['A'] * self.df['B'] - def time_float64(self): - (self.df_float64['A'] + self.df_float64['B']) + def time_divide(self, dtype): + self.df['A'] / self.df['B'] - def time_float32(self): - (self.df_float32['A'] + self.df_float32['B']) + def time_modulo(self, dtype): + self.df['A'] % self.df['B'] - def time_datetime64(self): - (self.df_datetime64['A'] - self.df_datetime64['B']) - def time_timedelta64_1(self): - (self.df_timedelta64['A'] + self.df_timedelta64['B']) +class DateInferOps(object): + # from GH 7332 + def setup_cache(self): + N = 5 * 10**5 + df = DataFrame({'datetime64': np.arange(N).astype('datetime64[ms]')}) + df['timedelta'] = df['datetime64'] - df['datetime64'] + return df - def time_timedelta64_2(self): - (self.df_timedelta64['A'] + self.df_timedelta64['A']) + def time_subtract_datetimes(self, df): + df['datetime64'] - df['datetime64'] + def time_timedelta_plus_datetime(self, df): + df['timedelta'] + df['datetime64'] -class to_numeric(object): - goal_time = 0.2 + def time_add_timedeltas(self, df): + df['timedelta'] + df['timedelta'] - def setup(self): - self.n = 10000 - self.float = Series(np.random.randn(self.n * 100)) - self.numstr = self.float.astype('str') - self.str = Series(tm.makeStringIndex(self.n)) - def time_from_float(self): - pd.to_numeric(self.float) +class ToNumeric(object): + + params = ['ignore', 'coerce'] + param_names = ['errors'] + + def setup(self, errors): + N = 10000 + self.float = Series(np.random.randn(N)) + self.numstr = self.float.astype('str') + self.str = Series(tm.makeStringIndex(N)) - def time_from_numeric_str(self): - pd.to_numeric(self.numstr) + def time_from_float(self, errors): + to_numeric(self.float, errors=errors) - def time_from_str_ignore(self): - pd.to_numeric(self.str, errors='ignore') + def time_from_numeric_str(self, errors): + to_numeric(self.numstr, errors=errors) - def time_from_str_coerce(self): - pd.to_numeric(self.str, errors='coerce') + def time_from_str(self, errors): + to_numeric(self.str, errors=errors) -class to_numeric_downcast(object): +class ToNumericDowncast(object): param_names = ['dtype', 'downcast'] params = [['string-float', 'string-int', 'string-nint', 'datetime64', @@ -81,37 +80,33 @@ class to_numeric_downcast(object): N = 500000 N2 = int(N / 2) - data_dict = { - 'string-int': (['1'] * N2) + ([2] * N2), - 'string-nint': (['-1'] * N2) + ([2] * N2), - 'datetime64': np.repeat(np.array(['1970-01-01', '1970-01-02'], - dtype='datetime64[D]'), N), - 'string-float': (['1.1'] * N2) + ([2] * N2), - 'int-list': ([1] * N2) + ([2] * N2), - 'int32': np.repeat(np.int32(1), N) - } + data_dict = {'string-int': ['1'] * N2 + [2] * N2, + 'string-nint': ['-1'] * N2 + [2] * N2, + 'datetime64': np.repeat(np.array(['1970-01-01', '1970-01-02'], + dtype='datetime64[D]'), N), + 'string-float': ['1.1'] * N2 + [2] * N2, + 'int-list': [1] * N2 + [2] * N2, + 'int32': np.repeat(np.int32(1), N)} def setup(self, dtype, downcast): self.data = self.data_dict[dtype] def time_downcast(self, dtype, downcast): - pd.to_numeric(self.data, downcast=downcast) + to_numeric(self.data, downcast=downcast) class MaybeConvertNumeric(object): - def setup(self): - n = 1000000 - arr = np.repeat([2**63], n) - arr = arr + np.arange(n).astype('uint64') - arr = np.array([arr[i] if i%2 == 0 else - str(arr[i]) for i in range(n)], - dtype=object) - - arr[-1] = -1 - self.data = arr - self.na_values = set() - - def time_convert(self): - pd.lib.maybe_convert_numeric(self.data, self.na_values, - coerce_numeric=False) + def setup_cache(self): + N = 10**6 + arr = np.repeat([2**63], N) + np.arange(N).astype('uint64') + data = arr.astype(object) + data[1::2] = arr[1::2].astype(str) + data[-1] = -1 + return data + + def time_convert(self, data): + lib.maybe_convert_numeric(data, set(), coerce_numeric=False) + + +from .pandas_vb_common import setup # noqa: F401 diff --git a/doc/sphinxext/ipython_sphinxext/__init__.py b/asv_bench/benchmarks/io/__init__.py similarity index 100% rename from doc/sphinxext/ipython_sphinxext/__init__.py rename to asv_bench/benchmarks/io/__init__.py diff --git a/asv_bench/benchmarks/io/csv.py b/asv_bench/benchmarks/io/csv.py new file mode 100644 index 0000000000000..d42a15d61fb0d --- /dev/null +++ b/asv_bench/benchmarks/io/csv.py @@ -0,0 +1,236 @@ +import random +import string + +import numpy as np +import pandas.util.testing as tm +from pandas import DataFrame, Categorical, date_range, read_csv +from pandas.compat import cStringIO as StringIO + +from ..pandas_vb_common import BaseIO + + +class ToCSV(BaseIO): + + fname = '__test__.csv' + params = ['wide', 'long', 'mixed'] + param_names = ['kind'] + + def setup(self, kind): + wide_frame = DataFrame(np.random.randn(3000, 30)) + long_frame = DataFrame({'A': np.arange(50000), + 'B': np.arange(50000) + 1., + 'C': np.arange(50000) + 2., + 'D': np.arange(50000) + 3.}) + mixed_frame = DataFrame({'float': np.random.randn(5000), + 'int': np.random.randn(5000).astype(int), + 'bool': (np.arange(5000) % 2) == 0, + 'datetime': date_range('2001', + freq='s', + periods=5000), + 'object': ['foo'] * 5000}) + mixed_frame.loc[30:500, 'float'] = np.nan + data = {'wide': wide_frame, + 'long': long_frame, + 'mixed': mixed_frame} + self.df = data[kind] + + def time_frame(self, kind): + self.df.to_csv(self.fname) + + +class ToCSVDatetime(BaseIO): + + fname = '__test__.csv' + + def setup(self): + rng = date_range('1/1/2000', periods=1000) + self.data = DataFrame(rng, index=rng) + + def time_frame_date_formatting(self): + self.data.to_csv(self.fname, date_format='%Y%m%d') + + +class StringIORewind(object): + + def data(self, stringio_object): + stringio_object.seek(0) + return stringio_object + + +class ReadCSVDInferDatetimeFormat(StringIORewind): + + params = ([True, False], ['custom', 'iso8601', 'ymd']) + param_names = ['infer_datetime_format', 'format'] + + def setup(self, infer_datetime_format, format): + rng = date_range('1/1/2000', periods=1000) + formats = {'custom': '%m/%d/%Y %H:%M:%S.%f', + 'iso8601': '%Y-%m-%d %H:%M:%S', + 'ymd': '%Y%m%d'} + dt_format = formats[format] + self.StringIO_input = StringIO('\n'.join( + rng.strftime(dt_format).tolist())) + + def time_read_csv(self, infer_datetime_format, format): + read_csv(self.data(self.StringIO_input), + header=None, names=['foo'], parse_dates=['foo'], + infer_datetime_format=infer_datetime_format) + + +class ReadCSVSkipRows(BaseIO): + + fname = '__test__.csv' + params = [None, 10000] + param_names = ['skiprows'] + + def setup(self, skiprows): + N = 20000 + index = tm.makeStringIndex(N) + df = DataFrame({'float1': np.random.randn(N), + 'float2': np.random.randn(N), + 'string1': ['foo'] * N, + 'bool1': [True] * N, + 'int1': np.random.randint(0, N, size=N)}, + index=index) + df.to_csv(self.fname) + + def time_skipprows(self, skiprows): + read_csv(self.fname, skiprows=skiprows) + + +class ReadUint64Integers(StringIORewind): + + def setup(self): + self.na_values = [2**63 + 500] + arr = np.arange(10000).astype('uint64') + 2**63 + self.data1 = StringIO('\n'.join(arr.astype(str).tolist())) + arr = arr.astype(object) + arr[500] = -1 + self.data2 = StringIO('\n'.join(arr.astype(str).tolist())) + + def time_read_uint64(self): + read_csv(self.data(self.data1), header=None, names=['foo']) + + def time_read_uint64_neg_values(self): + read_csv(self.data(self.data2), header=None, names=['foo']) + + def time_read_uint64_na_values(self): + read_csv(self.data(self.data1), header=None, names=['foo'], + na_values=self.na_values) + + +class ReadCSVThousands(BaseIO): + + fname = '__test__.csv' + params = ([',', '|'], [None, ',']) + param_names = ['sep', 'thousands'] + + def setup(self, sep, thousands): + N = 10000 + K = 8 + data = np.random.randn(N, K) * np.random.randint(100, 10000, (N, K)) + df = DataFrame(data) + if thousands is not None: + fmt = ':{}'.format(thousands) + fmt = '{' + fmt + '}' + df = df.applymap(lambda x: fmt.format(x)) + df.to_csv(self.fname, sep=sep) + + def time_thousands(self, sep, thousands): + read_csv(self.fname, sep=sep, thousands=thousands) + + +class ReadCSVComment(StringIORewind): + + def setup(self): + data = ['A,B,C'] + (['1,2,3 # comment'] * 100000) + self.StringIO_input = StringIO('\n'.join(data)) + + def time_comment(self): + read_csv(self.data(self.StringIO_input), comment='#', + header=None, names=list('abc')) + + +class ReadCSVFloatPrecision(StringIORewind): + + params = ([',', ';'], ['.', '_'], [None, 'high', 'round_trip']) + param_names = ['sep', 'decimal', 'float_precision'] + + def setup(self, sep, decimal, float_precision): + floats = [''.join(random.choice(string.digits) for _ in range(28)) + for _ in range(15)] + rows = sep.join(['0{}'.format(decimal) + '{}'] * 3) + '\n' + data = rows * 5 + data = data.format(*floats) * 200 # 1000 x 3 strings csv + self.StringIO_input = StringIO(data) + + def time_read_csv(self, sep, decimal, float_precision): + read_csv(self.data(self.StringIO_input), sep=sep, header=None, + names=list('abc'), float_precision=float_precision) + + def time_read_csv_python_engine(self, sep, decimal, float_precision): + read_csv(self.data(self.StringIO_input), sep=sep, header=None, + engine='python', float_precision=None, names=list('abc')) + + +class ReadCSVCategorical(BaseIO): + + fname = '__test__.csv' + + def setup(self): + N = 100000 + group1 = ['aaaaaaaa', 'bbbbbbb', 'cccccccc', 'dddddddd', 'eeeeeeee'] + df = DataFrame(np.random.choice(group1, (N, 3)), columns=list('abc')) + df.to_csv(self.fname, index=False) + + def time_convert_post(self): + read_csv(self.fname).apply(Categorical) + + def time_convert_direct(self): + read_csv(self.fname, dtype='category') + + +class ReadCSVParseDates(StringIORewind): + + def setup(self): + data = """{},19:00:00,18:56:00,0.8100,2.8100,7.2000,0.0000,280.0000\n + {},20:00:00,19:56:00,0.0100,2.2100,7.2000,0.0000,260.0000\n + {},21:00:00,20:56:00,-0.5900,2.2100,5.7000,0.0000,280.0000\n + {},21:00:00,21:18:00,-0.9900,2.0100,3.6000,0.0000,270.0000\n + {},22:00:00,21:56:00,-0.5900,1.7100,5.1000,0.0000,290.0000\n + """ + two_cols = ['KORD,19990127'] * 5 + data = data.format(*two_cols) + self.StringIO_input = StringIO(data) + + def time_multiple_date(self): + read_csv(self.data(self.StringIO_input), sep=',', header=None, + names=list(string.digits[:9]), + parse_dates=[[1, 2], [1, 3]]) + + def time_baseline(self): + read_csv(self.data(self.StringIO_input), sep=',', header=None, + parse_dates=[1], + names=list(string.digits[:9])) + + +class ReadCSVMemoryGrowth(BaseIO): + + chunksize = 20 + num_rows = 1000 + fname = "__test__.csv" + + def setup(self): + with open(self.fname, "w") as f: + for i in range(self.num_rows): + f.write("{i}\n".format(i=i)) + + def mem_parser_chunks(self): + # see gh-24805. + result = read_csv(self.fname, chunksize=self.chunksize) + + for _ in result: + pass + + +from ..pandas_vb_common import setup # noqa: F401 diff --git a/asv_bench/benchmarks/io/excel.py b/asv_bench/benchmarks/io/excel.py new file mode 100644 index 0000000000000..1bee864fbcf2d --- /dev/null +++ b/asv_bench/benchmarks/io/excel.py @@ -0,0 +1,36 @@ +import numpy as np +from pandas import DataFrame, date_range, ExcelWriter, read_excel +from pandas.compat import BytesIO +import pandas.util.testing as tm + + +class Excel(object): + + params = ['openpyxl', 'xlsxwriter', 'xlwt'] + param_names = ['engine'] + + def setup(self, engine): + N = 2000 + C = 5 + self.df = DataFrame(np.random.randn(N, C), + columns=['float{}'.format(i) for i in range(C)], + index=date_range('20000101', periods=N, freq='H')) + self.df['object'] = tm.makeStringIndex(N) + self.bio_read = BytesIO() + self.writer_read = ExcelWriter(self.bio_read, engine=engine) + self.df.to_excel(self.writer_read, sheet_name='Sheet1') + self.writer_read.save() + self.bio_read.seek(0) + + def time_read_excel(self, engine): + read_excel(self.bio_read) + + def time_write_excel(self, engine): + bio_write = BytesIO() + bio_write.seek(0) + writer_write = ExcelWriter(bio_write, engine=engine) + self.df.to_excel(writer_write, sheet_name='Sheet1') + writer_write.save() + + +from ..pandas_vb_common import setup # noqa: F401 diff --git a/asv_bench/benchmarks/io/hdf.py b/asv_bench/benchmarks/io/hdf.py new file mode 100644 index 0000000000000..a5dc28eb9508c --- /dev/null +++ b/asv_bench/benchmarks/io/hdf.py @@ -0,0 +1,122 @@ +import numpy as np +from pandas import DataFrame, date_range, HDFStore, read_hdf +import pandas.util.testing as tm + +from ..pandas_vb_common import BaseIO + + +class HDFStoreDataFrame(BaseIO): + + def setup(self): + N = 25000 + index = tm.makeStringIndex(N) + self.df = DataFrame({'float1': np.random.randn(N), + 'float2': np.random.randn(N)}, + index=index) + self.df_mixed = DataFrame({'float1': np.random.randn(N), + 'float2': np.random.randn(N), + 'string1': ['foo'] * N, + 'bool1': [True] * N, + 'int1': np.random.randint(0, N, size=N)}, + index=index) + self.df_wide = DataFrame(np.random.randn(N, 100)) + self.start_wide = self.df_wide.index[10000] + self.stop_wide = self.df_wide.index[15000] + self.df2 = DataFrame({'float1': np.random.randn(N), + 'float2': np.random.randn(N)}, + index=date_range('1/1/2000', periods=N)) + self.start = self.df2.index[10000] + self.stop = self.df2.index[15000] + self.df_wide2 = DataFrame(np.random.randn(N, 100), + index=date_range('1/1/2000', periods=N)) + self.df_dc = DataFrame(np.random.randn(N, 10), + columns=['C%03d' % i for i in range(10)]) + + self.fname = '__test__.h5' + + self.store = HDFStore(self.fname) + self.store.put('fixed', self.df) + self.store.put('fixed_mixed', self.df_mixed) + self.store.append('table', self.df2) + self.store.append('table_mixed', self.df_mixed) + self.store.append('table_wide', self.df_wide) + self.store.append('table_wide2', self.df_wide2) + + def teardown(self): + self.store.close() + self.remove(self.fname) + + def time_read_store(self): + self.store.get('fixed') + + def time_read_store_mixed(self): + self.store.get('fixed_mixed') + + def time_write_store(self): + self.store.put('fixed_write', self.df) + + def time_write_store_mixed(self): + self.store.put('fixed_mixed_write', self.df_mixed) + + def time_read_store_table_mixed(self): + self.store.select('table_mixed') + + def time_write_store_table_mixed(self): + self.store.append('table_mixed_write', self.df_mixed) + + def time_read_store_table(self): + self.store.select('table') + + def time_write_store_table(self): + self.store.append('table_write', self.df) + + def time_read_store_table_wide(self): + self.store.select('table_wide') + + def time_write_store_table_wide(self): + self.store.append('table_wide_write', self.df_wide) + + def time_write_store_table_dc(self): + self.store.append('table_dc_write', self.df_dc, data_columns=True) + + def time_query_store_table_wide(self): + self.store.select('table_wide', where="index > self.start_wide and " + "index < self.stop_wide") + + def time_query_store_table(self): + self.store.select('table', where="index > self.start and " + "index < self.stop") + + def time_store_repr(self): + repr(self.store) + + def time_store_str(self): + str(self.store) + + def time_store_info(self): + self.store.info() + + +class HDF(BaseIO): + + params = ['table', 'fixed'] + param_names = ['format'] + + def setup(self, format): + self.fname = '__test__.h5' + N = 100000 + C = 5 + self.df = DataFrame(np.random.randn(N, C), + columns=['float{}'.format(i) for i in range(C)], + index=date_range('20000101', periods=N, freq='H')) + self.df['object'] = tm.makeStringIndex(N) + self.df.to_hdf(self.fname, 'df', format=format) + + def time_read_hdf(self, format): + read_hdf(self.fname, 'df') + + def time_write_hdf(self, format): + self.df.to_hdf(self.fname, 'df', format=format) + + +from ..pandas_vb_common import setup # noqa: F401 diff --git a/asv_bench/benchmarks/io/json.py b/asv_bench/benchmarks/io/json.py new file mode 100644 index 0000000000000..ec2ddc11b7c1d --- /dev/null +++ b/asv_bench/benchmarks/io/json.py @@ -0,0 +1,127 @@ +import numpy as np +import pandas.util.testing as tm +from pandas import DataFrame, date_range, timedelta_range, concat, read_json + +from ..pandas_vb_common import BaseIO + + +class ReadJSON(BaseIO): + + fname = "__test__.json" + params = (['split', 'index', 'records'], ['int', 'datetime']) + param_names = ['orient', 'index'] + + def setup(self, orient, index): + N = 100000 + indexes = {'int': np.arange(N), + 'datetime': date_range('20000101', periods=N, freq='H')} + df = DataFrame(np.random.randn(N, 5), + columns=['float_{}'.format(i) for i in range(5)], + index=indexes[index]) + df.to_json(self.fname, orient=orient) + + def time_read_json(self, orient, index): + read_json(self.fname, orient=orient) + + +class ReadJSONLines(BaseIO): + + fname = "__test_lines__.json" + params = ['int', 'datetime'] + param_names = ['index'] + + def setup(self, index): + N = 100000 + indexes = {'int': np.arange(N), + 'datetime': date_range('20000101', periods=N, freq='H')} + df = DataFrame(np.random.randn(N, 5), + columns=['float_{}'.format(i) for i in range(5)], + index=indexes[index]) + df.to_json(self.fname, orient='records', lines=True) + + def time_read_json_lines(self, index): + read_json(self.fname, orient='records', lines=True) + + def time_read_json_lines_concat(self, index): + concat(read_json(self.fname, orient='records', lines=True, + chunksize=25000)) + + def peakmem_read_json_lines(self, index): + read_json(self.fname, orient='records', lines=True) + + def peakmem_read_json_lines_concat(self, index): + concat(read_json(self.fname, orient='records', lines=True, + chunksize=25000)) + + +class ToJSON(BaseIO): + + fname = "__test__.json" + params = ['split', 'columns', 'index'] + param_names = ['orient'] + + def setup(self, lines_orient): + N = 10**5 + ncols = 5 + index = date_range('20000101', periods=N, freq='H') + timedeltas = timedelta_range(start=1, periods=N, freq='s') + datetimes = date_range(start=1, periods=N, freq='s') + ints = np.random.randint(100000000, size=N) + floats = np.random.randn(N) + strings = tm.makeStringIndex(N) + self.df = DataFrame(np.random.randn(N, ncols), index=np.arange(N)) + self.df_date_idx = DataFrame(np.random.randn(N, ncols), index=index) + self.df_td_int_ts = DataFrame({'td_1': timedeltas, + 'td_2': timedeltas, + 'int_1': ints, + 'int_2': ints, + 'ts_1': datetimes, + 'ts_2': datetimes}, + index=index) + self.df_int_floats = DataFrame({'int_1': ints, + 'int_2': ints, + 'int_3': ints, + 'float_1': floats, + 'float_2': floats, + 'float_3': floats}, + index=index) + self.df_int_float_str = DataFrame({'int_1': ints, + 'int_2': ints, + 'float_1': floats, + 'float_2': floats, + 'str_1': strings, + 'str_2': strings}, + index=index) + + def time_floats_with_int_index(self, orient): + self.df.to_json(self.fname, orient=orient) + + def time_floats_with_dt_index(self, orient): + self.df_date_idx.to_json(self.fname, orient=orient) + + def time_delta_int_tstamp(self, orient): + self.df_td_int_ts.to_json(self.fname, orient=orient) + + def time_float_int(self, orient): + self.df_int_floats.to_json(self.fname, orient=orient) + + def time_float_int_str(self, orient): + self.df_int_float_str.to_json(self.fname, orient=orient) + + def time_floats_with_int_idex_lines(self, orient): + self.df.to_json(self.fname, orient='records', lines=True) + + def time_floats_with_dt_index_lines(self, orient): + self.df_date_idx.to_json(self.fname, orient='records', lines=True) + + def time_delta_int_tstamp_lines(self, orient): + self.df_td_int_ts.to_json(self.fname, orient='records', lines=True) + + def time_float_int_lines(self, orient): + self.df_int_floats.to_json(self.fname, orient='records', lines=True) + + def time_float_int_str_lines(self, orient): + self.df_int_float_str.to_json(self.fname, orient='records', lines=True) + + +from ..pandas_vb_common import setup # noqa: F401 diff --git a/asv_bench/benchmarks/io/msgpack.py b/asv_bench/benchmarks/io/msgpack.py new file mode 100644 index 0000000000000..dc2642d920fd0 --- /dev/null +++ b/asv_bench/benchmarks/io/msgpack.py @@ -0,0 +1,27 @@ +import numpy as np +from pandas import DataFrame, date_range, read_msgpack +import pandas.util.testing as tm + +from ..pandas_vb_common import BaseIO + + +class MSGPack(BaseIO): + + def setup(self): + self.fname = '__test__.msg' + N = 100000 + C = 5 + self.df = DataFrame(np.random.randn(N, C), + columns=['float{}'.format(i) for i in range(C)], + index=date_range('20000101', periods=N, freq='H')) + self.df['object'] = tm.makeStringIndex(N) + self.df.to_msgpack(self.fname) + + def time_read_msgpack(self): + read_msgpack(self.fname) + + def time_write_msgpack(self): + self.df.to_msgpack(self.fname) + + +from ..pandas_vb_common import setup # noqa: F401 diff --git a/asv_bench/benchmarks/io/pickle.py b/asv_bench/benchmarks/io/pickle.py new file mode 100644 index 0000000000000..74a58bbb946aa --- /dev/null +++ b/asv_bench/benchmarks/io/pickle.py @@ -0,0 +1,27 @@ +import numpy as np +from pandas import DataFrame, date_range, read_pickle +import pandas.util.testing as tm + +from ..pandas_vb_common import BaseIO + + +class Pickle(BaseIO): + + def setup(self): + self.fname = '__test__.pkl' + N = 100000 + C = 5 + self.df = DataFrame(np.random.randn(N, C), + columns=['float{}'.format(i) for i in range(C)], + index=date_range('20000101', periods=N, freq='H')) + self.df['object'] = tm.makeStringIndex(N) + self.df.to_pickle(self.fname) + + def time_read_pickle(self): + read_pickle(self.fname) + + def time_write_pickle(self): + self.df.to_pickle(self.fname) + + +from ..pandas_vb_common import setup # noqa: F401 diff --git a/asv_bench/benchmarks/io/sas.py b/asv_bench/benchmarks/io/sas.py new file mode 100644 index 0000000000000..2783f42cad895 --- /dev/null +++ b/asv_bench/benchmarks/io/sas.py @@ -0,0 +1,20 @@ +import os + +from pandas import read_sas + + +class SAS(object): + + params = ['sas7bdat', 'xport'] + param_names = ['format'] + + def setup(self, format): + # Read files that are located in 'pandas/io/tests/sas/data' + files = {'sas7bdat': 'test1.sas7bdat', 'xport': 'paxraw_d_short.xpt'} + file = files[format] + paths = [os.path.dirname(__file__), '..', '..', '..', 'pandas', + 'tests', 'io', 'sas', 'data', file] + self.f = os.path.join(*paths) + + def time_read_msgpack(self, format): + read_sas(self.f, format=format) diff --git a/asv_bench/benchmarks/io/sql.py b/asv_bench/benchmarks/io/sql.py new file mode 100644 index 0000000000000..075d3bdda5ed9 --- /dev/null +++ b/asv_bench/benchmarks/io/sql.py @@ -0,0 +1,127 @@ +import sqlite3 + +import numpy as np +import pandas.util.testing as tm +from pandas import DataFrame, date_range, read_sql_query, read_sql_table +from sqlalchemy import create_engine + + +class SQL(object): + + params = ['sqlalchemy', 'sqlite'] + param_names = ['connection'] + + def setup(self, connection): + N = 10000 + con = {'sqlalchemy': create_engine('sqlite:///:memory:'), + 'sqlite': sqlite3.connect(':memory:')} + self.table_name = 'test_type' + self.query_all = 'SELECT * FROM {}'.format(self.table_name) + self.con = con[connection] + self.df = DataFrame({'float': np.random.randn(N), + 'float_with_nan': np.random.randn(N), + 'string': ['foo'] * N, + 'bool': [True] * N, + 'int': np.random.randint(0, N, size=N), + 'datetime': date_range('2000-01-01', + periods=N, + freq='s')}, + index=tm.makeStringIndex(N)) + self.df.loc[1000:3000, 'float_with_nan'] = np.nan + self.df['datetime_string'] = self.df['datetime'].astype(str) + self.df.to_sql(self.table_name, self.con, if_exists='replace') + + def time_to_sql_dataframe(self, connection): + self.df.to_sql('test1', self.con, if_exists='replace') + + def time_read_sql_query(self, connection): + read_sql_query(self.query_all, self.con) + + +class WriteSQLDtypes(object): + + params = (['sqlalchemy', 'sqlite'], + ['float', 'float_with_nan', 'string', 'bool', 'int', 'datetime']) + param_names = ['connection', 'dtype'] + + def setup(self, connection, dtype): + N = 10000 + con = {'sqlalchemy': create_engine('sqlite:///:memory:'), + 'sqlite': sqlite3.connect(':memory:')} + self.table_name = 'test_type' + self.query_col = 'SELECT {} FROM {}'.format(dtype, self.table_name) + self.con = con[connection] + self.df = DataFrame({'float': np.random.randn(N), + 'float_with_nan': np.random.randn(N), + 'string': ['foo'] * N, + 'bool': [True] * N, + 'int': np.random.randint(0, N, size=N), + 'datetime': date_range('2000-01-01', + periods=N, + freq='s')}, + index=tm.makeStringIndex(N)) + self.df.loc[1000:3000, 'float_with_nan'] = np.nan + self.df['datetime_string'] = self.df['datetime'].astype(str) + self.df.to_sql(self.table_name, self.con, if_exists='replace') + + def time_to_sql_dataframe_column(self, connection, dtype): + self.df[[dtype]].to_sql('test1', self.con, if_exists='replace') + + def time_read_sql_query_select_column(self, connection, dtype): + read_sql_query(self.query_col, self.con) + + +class ReadSQLTable(object): + + def setup(self): + N = 10000 + self.table_name = 'test' + self.con = create_engine('sqlite:///:memory:') + self.df = DataFrame({'float': np.random.randn(N), + 'float_with_nan': np.random.randn(N), + 'string': ['foo'] * N, + 'bool': [True] * N, + 'int': np.random.randint(0, N, size=N), + 'datetime': date_range('2000-01-01', + periods=N, + freq='s')}, + index=tm.makeStringIndex(N)) + self.df.loc[1000:3000, 'float_with_nan'] = np.nan + self.df['datetime_string'] = self.df['datetime'].astype(str) + self.df.to_sql(self.table_name, self.con, if_exists='replace') + + def time_read_sql_table_all(self): + read_sql_table(self.table_name, self.con) + + def time_read_sql_table_parse_dates(self): + read_sql_table(self.table_name, self.con, columns=['datetime_string'], + parse_dates=['datetime_string']) + + +class ReadSQLTableDtypes(object): + + params = ['float', 'float_with_nan', 'string', 'bool', 'int', 'datetime'] + param_names = ['dtype'] + + def setup(self, dtype): + N = 10000 + self.table_name = 'test' + self.con = create_engine('sqlite:///:memory:') + self.df = DataFrame({'float': np.random.randn(N), + 'float_with_nan': np.random.randn(N), + 'string': ['foo'] * N, + 'bool': [True] * N, + 'int': np.random.randint(0, N, size=N), + 'datetime': date_range('2000-01-01', + periods=N, + freq='s')}, + index=tm.makeStringIndex(N)) + self.df.loc[1000:3000, 'float_with_nan'] = np.nan + self.df['datetime_string'] = self.df['datetime'].astype(str) + self.df.to_sql(self.table_name, self.con, if_exists='replace') + + def time_read_sql_table_column(self, dtype): + read_sql_table(self.table_name, self.con, columns=[dtype]) + + +from ..pandas_vb_common import setup # noqa: F401 diff --git a/asv_bench/benchmarks/io/stata.py b/asv_bench/benchmarks/io/stata.py new file mode 100644 index 0000000000000..a7f854a853f50 --- /dev/null +++ b/asv_bench/benchmarks/io/stata.py @@ -0,0 +1,39 @@ +import numpy as np +from pandas import DataFrame, date_range, read_stata +import pandas.util.testing as tm + +from ..pandas_vb_common import BaseIO + + +class Stata(BaseIO): + + params = ['tc', 'td', 'tm', 'tw', 'th', 'tq', 'ty'] + param_names = ['convert_dates'] + + def setup(self, convert_dates): + self.fname = '__test__.dta' + N = 100000 + C = 5 + self.df = DataFrame(np.random.randn(N, C), + columns=['float{}'.format(i) for i in range(C)], + index=date_range('20000101', periods=N, freq='H')) + self.df['object'] = tm.makeStringIndex(N) + self.df['int8_'] = np.random.randint(np.iinfo(np.int8).min, + np.iinfo(np.int8).max - 27, N) + self.df['int16_'] = np.random.randint(np.iinfo(np.int16).min, + np.iinfo(np.int16).max - 27, N) + self.df['int32_'] = np.random.randint(np.iinfo(np.int32).min, + np.iinfo(np.int32).max - 27, N) + self.df['float32_'] = np.array(np.random.randn(N), + dtype=np.float32) + self.convert_dates = {'index': convert_dates} + self.df.to_stata(self.fname, self.convert_dates) + + def time_read_stata(self, convert_dates): + read_stata(self.fname) + + def time_write_stata(self, convert_dates): + self.df.to_stata(self.fname, self.convert_dates) + + +from ..pandas_vb_common import setup # noqa: F401 diff --git a/asv_bench/benchmarks/io_bench.py b/asv_bench/benchmarks/io_bench.py deleted file mode 100644 index 52064d2cdb8a2..0000000000000 --- a/asv_bench/benchmarks/io_bench.py +++ /dev/null @@ -1,194 +0,0 @@ -from .pandas_vb_common import * -from pandas import concat, Timestamp, compat -try: - from StringIO import StringIO -except ImportError: - from io import StringIO -import timeit - - -class frame_to_csv(object): - goal_time = 0.2 - - def setup(self): - self.df = DataFrame(np.random.randn(3000, 30)) - - def time_frame_to_csv(self): - self.df.to_csv('__test__.csv') - - -class frame_to_csv2(object): - goal_time = 0.2 - - def setup(self): - self.df = DataFrame({'A': range(50000), }) - self.df['B'] = (self.df.A + 1.0) - self.df['C'] = (self.df.A + 2.0) - self.df['D'] = (self.df.A + 3.0) - - def time_frame_to_csv2(self): - self.df.to_csv('__test__.csv') - - -class frame_to_csv_date_formatting(object): - goal_time = 0.2 - - def setup(self): - self.rng = date_range('1/1/2000', periods=1000) - self.data = DataFrame(self.rng, index=self.rng) - - def time_frame_to_csv_date_formatting(self): - self.data.to_csv('__test__.csv', date_format='%Y%m%d') - - -class frame_to_csv_mixed(object): - goal_time = 0.2 - - def setup(self): - self.df_float = DataFrame(np.random.randn(5000, 5), dtype='float64', columns=self.create_cols('float')) - self.df_int = DataFrame(np.random.randn(5000, 5), dtype='int64', columns=self.create_cols('int')) - self.df_bool = DataFrame(True, index=self.df_float.index, columns=self.create_cols('bool')) - self.df_object = DataFrame('foo', index=self.df_float.index, columns=self.create_cols('object')) - self.df_dt = DataFrame(Timestamp('20010101'), index=self.df_float.index, columns=self.create_cols('date')) - self.df_float.ix[30:500, 1:3] = np.nan - self.df = concat([self.df_float, self.df_int, self.df_bool, self.df_object, self.df_dt], axis=1) - - def time_frame_to_csv_mixed(self): - self.df.to_csv('__test__.csv') - - def create_cols(self, name): - return [('%s%03d' % (name, i)) for i in range(5)] - - -class read_csv_infer_datetime_format_custom(object): - goal_time = 0.2 - - def setup(self): - self.rng = date_range('1/1/2000', periods=1000) - self.data = '\n'.join(self.rng.map((lambda x: x.strftime('%m/%d/%Y %H:%M:%S.%f')))) - - def time_read_csv_infer_datetime_format_custom(self): - read_csv(StringIO(self.data), header=None, names=['foo'], parse_dates=['foo'], infer_datetime_format=True) - - -class read_csv_infer_datetime_format_iso8601(object): - goal_time = 0.2 - - def setup(self): - self.rng = date_range('1/1/2000', periods=1000) - self.data = '\n'.join(self.rng.map((lambda x: x.strftime('%Y-%m-%d %H:%M:%S')))) - - def time_read_csv_infer_datetime_format_iso8601(self): - read_csv(StringIO(self.data), header=None, names=['foo'], parse_dates=['foo'], infer_datetime_format=True) - - -class read_csv_infer_datetime_format_ymd(object): - goal_time = 0.2 - - def setup(self): - self.rng = date_range('1/1/2000', periods=1000) - self.data = '\n'.join(self.rng.map((lambda x: x.strftime('%Y%m%d')))) - - def time_read_csv_infer_datetime_format_ymd(self): - read_csv(StringIO(self.data), header=None, names=['foo'], parse_dates=['foo'], infer_datetime_format=True) - - -class read_csv_skiprows(object): - goal_time = 0.2 - - def setup(self): - self.index = tm.makeStringIndex(20000) - self.df = DataFrame({'float1': randn(20000), 'float2': randn(20000), 'string1': (['foo'] * 20000), 'bool1': ([True] * 20000), 'int1': np.random.randint(0, 200000, size=20000), }, index=self.index) - self.df.to_csv('__test__.csv') - - def time_read_csv_skiprows(self): - read_csv('__test__.csv', skiprows=10000) - - -class read_csv_standard(object): - goal_time = 0.2 - - def setup(self): - self.index = tm.makeStringIndex(10000) - self.df = DataFrame({'float1': randn(10000), 'float2': randn(10000), 'string1': (['foo'] * 10000), 'bool1': ([True] * 10000), 'int1': np.random.randint(0, 100000, size=10000), }, index=self.index) - self.df.to_csv('__test__.csv') - - def time_read_csv_standard(self): - read_csv('__test__.csv') - - -class read_parse_dates_iso8601(object): - goal_time = 0.2 - - def setup(self): - self.rng = date_range('1/1/2000', periods=1000) - self.data = '\n'.join(self.rng.map((lambda x: x.strftime('%Y-%m-%d %H:%M:%S')))) - - def time_read_parse_dates_iso8601(self): - read_csv(StringIO(self.data), header=None, names=['foo'], parse_dates=['foo']) - - -class read_uint64_integers(object): - goal_time = 0.2 - - def setup(self): - self.na_values = [2**63 + 500] - - self.arr1 = np.arange(10000).astype('uint64') + 2**63 - self.data1 = '\n'.join(map(lambda x: str(x), self.arr1)) - - self.arr2 = self.arr1.copy().astype(object) - self.arr2[500] = -1 - self.data2 = '\n'.join(map(lambda x: str(x), self.arr2)) - - def time_read_uint64(self): - read_csv(StringIO(self.data1), header=None) - - def time_read_uint64_neg_values(self): - read_csv(StringIO(self.data2), header=None) - - def time_read_uint64_na_values(self): - read_csv(StringIO(self.data1), header=None, na_values=self.na_values) - - -class write_csv_standard(object): - goal_time = 0.2 - - def setup(self): - self.index = tm.makeStringIndex(10000) - self.df = DataFrame({'float1': randn(10000), 'float2': randn(10000), 'string1': (['foo'] * 10000), 'bool1': ([True] * 10000), 'int1': np.random.randint(0, 100000, size=10000), }, index=self.index) - - def time_write_csv_standard(self): - self.df.to_csv('__test__.csv') - - -class read_csv_from_s3(object): - # Make sure that we can read part of a file from S3 without - # needing to download the entire thing. Use the timeit.default_timer - # to measure wall time instead of CPU time -- we want to see - # how long it takes to download the data. - timer = timeit.default_timer - params = ([None, "gzip", "bz2"], ["python", "c"]) - param_names = ["compression", "engine"] - - def setup(self, compression, engine): - if compression == "bz2" and engine == "c" and compat.PY2: - # The Python 2 C parser can't read bz2 from open files. - raise NotImplementedError - try: - import s3fs - except ImportError: - # Skip these benchmarks if `boto` is not installed. - raise NotImplementedError - - self.big_fname = "s3://pandas-test/large_random.csv" - - def time_read_nrows(self, compression, engine): - # Read a small number of rows from a huge (100,000 x 50) table. - ext = "" - if compression == "gzip": - ext = ".gz" - elif compression == "bz2": - ext = ".bz2" - pd.read_csv(self.big_fname + ext, nrows=10, - compression=compression, engine=engine) diff --git a/asv_bench/benchmarks/io_sql.py b/asv_bench/benchmarks/io_sql.py deleted file mode 100644 index ec855e5d33525..0000000000000 --- a/asv_bench/benchmarks/io_sql.py +++ /dev/null @@ -1,105 +0,0 @@ -import sqlalchemy -from .pandas_vb_common import * -import sqlite3 -from sqlalchemy import create_engine - - -#------------------------------------------------------------------------------- -# to_sql - -class WriteSQL(object): - goal_time = 0.2 - - def setup(self): - self.engine = create_engine('sqlite:///:memory:') - self.con = sqlite3.connect(':memory:') - self.index = tm.makeStringIndex(10000) - self.df = DataFrame({'float1': randn(10000), 'float2': randn(10000), 'string1': (['foo'] * 10000), 'bool1': ([True] * 10000), 'int1': np.random.randint(0, 100000, size=10000), }, index=self.index) - - def time_fallback(self): - self.df.to_sql('test1', self.con, if_exists='replace') - - def time_sqlalchemy(self): - self.df.to_sql('test1', self.engine, if_exists='replace') - - -#------------------------------------------------------------------------------- -# read_sql - -class ReadSQL(object): - goal_time = 0.2 - - def setup(self): - self.engine = create_engine('sqlite:///:memory:') - self.con = sqlite3.connect(':memory:') - self.index = tm.makeStringIndex(10000) - self.df = DataFrame({'float1': randn(10000), 'float2': randn(10000), 'string1': (['foo'] * 10000), 'bool1': ([True] * 10000), 'int1': np.random.randint(0, 100000, size=10000), }, index=self.index) - self.df.to_sql('test2', self.engine, if_exists='replace') - self.df.to_sql('test2', self.con, if_exists='replace') - - def time_read_query_fallback(self): - read_sql_query('SELECT * FROM test2', self.con) - - def time_read_query_sqlalchemy(self): - read_sql_query('SELECT * FROM test2', self.engine) - - def time_read_table_sqlalchemy(self): - read_sql_table('test2', self.engine) - - -#------------------------------------------------------------------------------- -# type specific write - -class WriteSQLTypes(object): - goal_time = 0.2 - - def setup(self): - self.engine = create_engine('sqlite:///:memory:') - self.con = sqlite3.connect(':memory:') - self.df = DataFrame({'float': randn(10000), 'string': (['foo'] * 10000), 'bool': ([True] * 10000), 'datetime': date_range('2000-01-01', periods=10000, freq='s'), }) - self.df.loc[1000:3000, 'float'] = np.nan - - def time_string_fallback(self): - self.df[['string']].to_sql('test_string', self.con, if_exists='replace') - - def time_string_sqlalchemy(self): - self.df[['string']].to_sql('test_string', self.engine, if_exists='replace') - - def time_float_fallback(self): - self.df[['float']].to_sql('test_float', self.con, if_exists='replace') - - def time_float_sqlalchemy(self): - self.df[['float']].to_sql('test_float', self.engine, if_exists='replace') - - def time_datetime_sqlalchemy(self): - self.df[['datetime']].to_sql('test_datetime', self.engine, if_exists='replace') - - -#------------------------------------------------------------------------------- -# type specific read - -class ReadSQLTypes(object): - goal_time = 0.2 - - def setup(self): - self.engine = create_engine('sqlite:///:memory:') - self.con = sqlite3.connect(':memory:') - self.df = DataFrame({'float': randn(10000), 'datetime': date_range('2000-01-01', periods=10000, freq='s'), }) - self.df['datetime_string'] = self.df['datetime'].map(str) - self.df.to_sql('test_type', self.engine, if_exists='replace') - self.df[['float', 'datetime_string']].to_sql('test_type', self.con, if_exists='replace') - - def time_datetime_read_and_parse_sqlalchemy(self): - read_sql_table('test_type', self.engine, columns=['datetime_string'], parse_dates=['datetime_string']) - - def time_datetime_read_as_native_sqlalchemy(self): - read_sql_table('test_type', self.engine, columns=['datetime']) - - def time_float_read_query_fallback(self): - read_sql_query('SELECT float FROM test_type', self.con) - - def time_float_read_query_sqlalchemy(self): - read_sql_query('SELECT float FROM test_type', self.engine) - - def time_float_read_table_sqlalchemy(self): - read_sql_table('test_type', self.engine, columns=['float']) diff --git a/asv_bench/benchmarks/join_merge.py b/asv_bench/benchmarks/join_merge.py index 776316343e009..6da8287a06d80 100644 --- a/asv_bench/benchmarks/join_merge.py +++ b/asv_bench/benchmarks/join_merge.py @@ -1,4 +1,10 @@ -from .pandas_vb_common import * +import warnings +import string + +import numpy as np +import pandas.util.testing as tm +from pandas import (DataFrame, Series, Panel, MultiIndex, + date_range, concat, merge, merge_asof) try: from pandas import merge_ordered @@ -6,25 +12,18 @@ from pandas import ordered_merge as merge_ordered -# ---------------------------------------------------------------------- -# Append - class Append(object): - goal_time = 0.2 def setup(self): - self.df1 = pd.DataFrame(np.random.randn(10000, 4), - columns=['A', 'B', 'C', 'D']) + self.df1 = DataFrame(np.random.randn(10000, 4), + columns=['A', 'B', 'C', 'D']) self.df2 = self.df1.copy() self.df2.index = np.arange(10000, 20000) self.mdf1 = self.df1.copy() self.mdf1['obj1'] = 'bar' self.mdf1['obj2'] = 'bar' self.mdf1['int1'] = 5 - try: - self.mdf1.consolidate(inplace=True) - except: - pass + self.mdf1 = self.mdf1._consolidate() self.mdf2 = self.mdf1.copy() self.mdf2.index = self.df2.index @@ -35,237 +34,220 @@ def time_append_mixed(self): self.mdf1.append(self.mdf2) -# ---------------------------------------------------------------------- -# Concat - class Concat(object): - goal_time = 0.2 - def setup(self): - self.n = 1000 - self.indices = tm.makeStringIndex(1000) - self.s = Series(self.n, index=self.indices) - self.pieces = [self.s[i:(- i)] for i in range(1, 10)] - self.pieces = (self.pieces * 50) + params = [0, 1] + param_names = ['axis'] - self.df_small = pd.DataFrame(randn(5, 4)) + def setup(self, axis): + N = 1000 + s = Series(N, index=tm.makeStringIndex(N)) + self.series = [s[i:- i] for i in range(1, 10)] * 50 + self.small_frames = [DataFrame(np.random.randn(5, 4))] * 1000 + df = DataFrame({'A': range(N)}, + index=date_range('20130101', periods=N, freq='s')) + self.empty_left = [DataFrame(), df] + self.empty_right = [df, DataFrame()] + self.mixed_ndims = [df, df.head(N // 2)] - # empty - self.df = pd.DataFrame(dict(A=range(10000)), index=date_range('20130101', periods=10000, freq='s')) - self.empty = pd.DataFrame() + def time_concat_series(self, axis): + concat(self.series, axis=axis, sort=False) - def time_concat_series_axis1(self): - concat(self.pieces, axis=1) + def time_concat_small_frames(self, axis): + concat(self.small_frames, axis=axis) - def time_concat_small_frames(self): - concat(([self.df_small] * 1000)) + def time_concat_empty_right(self, axis): + concat(self.empty_right, axis=axis) - def time_concat_empty_frames1(self): - concat([self.df, self.empty]) + def time_concat_empty_left(self, axis): + concat(self.empty_left, axis=axis) - def time_concat_empty_frames2(self): - concat([self.empty, self.df]) + def time_concat_mixed_ndims(self, axis): + concat(self.mixed_ndims, axis=axis) class ConcatPanels(object): - goal_time = 0.2 - def setup(self): - dataset = np.zeros((10000, 200, 2), dtype=np.float32) - self.panels_f = [pd.Panel(np.copy(dataset, order='F')) - for i in range(20)] - self.panels_c = [pd.Panel(np.copy(dataset, order='C')) - for i in range(20)] + params = ([0, 1, 2], [True, False]) + param_names = ['axis', 'ignore_index'] - def time_c_ordered_axis0(self): - concat(self.panels_c, axis=0, ignore_index=True) + def setup(self, axis, ignore_index): + with warnings.catch_warnings(record=True): + panel_c = Panel(np.zeros((10000, 200, 2), + dtype=np.float32, + order='C')) + self.panels_c = [panel_c] * 20 + panel_f = Panel(np.zeros((10000, 200, 2), + dtype=np.float32, + order='F')) + self.panels_f = [panel_f] * 20 - def time_f_ordered_axis0(self): - concat(self.panels_f, axis=0, ignore_index=True) + def time_c_ordered(self, axis, ignore_index): + with warnings.catch_warnings(record=True): + concat(self.panels_c, axis=axis, ignore_index=ignore_index) - def time_c_ordered_axis1(self): - concat(self.panels_c, axis=1, ignore_index=True) + def time_f_ordered(self, axis, ignore_index): + with warnings.catch_warnings(record=True): + concat(self.panels_f, axis=axis, ignore_index=ignore_index) - def time_f_ordered_axis1(self): - concat(self.panels_f, axis=1, ignore_index=True) - def time_c_ordered_axis2(self): - concat(self.panels_c, axis=2, ignore_index=True) +class ConcatDataFrames(object): - def time_f_ordered_axis2(self): - concat(self.panels_f, axis=2, ignore_index=True) + params = ([0, 1], [True, False]) + param_names = ['axis', 'ignore_index'] + def setup(self, axis, ignore_index): + frame_c = DataFrame(np.zeros((10000, 200), + dtype=np.float32, order='C')) + self.frame_c = [frame_c] * 20 + frame_f = DataFrame(np.zeros((10000, 200), + dtype=np.float32, order='F')) + self.frame_f = [frame_f] * 20 -class ConcatFrames(object): - goal_time = 0.2 + def time_c_ordered(self, axis, ignore_index): + concat(self.frame_c, axis=axis, ignore_index=ignore_index) - def setup(self): - dataset = np.zeros((10000, 200), dtype=np.float32) + def time_f_ordered(self, axis, ignore_index): + concat(self.frame_f, axis=axis, ignore_index=ignore_index) - self.frames_f = [pd.DataFrame(np.copy(dataset, order='F')) - for i in range(20)] - self.frames_c = [pd.DataFrame(np.copy(dataset, order='C')) - for i in range(20)] - def time_c_ordered_axis0(self): - concat(self.frames_c, axis=0, ignore_index=True) +class Join(object): - def time_f_ordered_axis0(self): - concat(self.frames_f, axis=0, ignore_index=True) + params = [True, False] + param_names = ['sort'] - def time_c_ordered_axis1(self): - concat(self.frames_c, axis=1, ignore_index=True) + def setup(self, sort): + level1 = tm.makeStringIndex(10).values + level2 = tm.makeStringIndex(1000).values + codes1 = np.arange(10).repeat(1000) + codes2 = np.tile(np.arange(1000), 10) + index2 = MultiIndex(levels=[level1, level2], + codes=[codes1, codes2]) + self.df_multi = DataFrame(np.random.randn(len(index2), 4), + index=index2, + columns=['A', 'B', 'C', 'D']) - def time_f_ordered_axis1(self): - concat(self.frames_f, axis=1, ignore_index=True) + self.key1 = np.tile(level1.take(codes1), 10) + self.key2 = np.tile(level2.take(codes2), 10) + self.df = DataFrame({'data1': np.random.randn(100000), + 'data2': np.random.randn(100000), + 'key1': self.key1, + 'key2': self.key2}) + self.df_key1 = DataFrame(np.random.randn(len(level1), 4), + index=level1, + columns=['A', 'B', 'C', 'D']) + self.df_key2 = DataFrame(np.random.randn(len(level2), 4), + index=level2, + columns=['A', 'B', 'C', 'D']) -# ---------------------------------------------------------------------- -# Joins + shuf = np.arange(100000) + np.random.shuffle(shuf) + self.df_shuf = self.df.reindex(self.df.index[shuf]) -class Join(object): - goal_time = 0.2 + def time_join_dataframe_index_multi(self, sort): + self.df.join(self.df_multi, on=['key1', 'key2'], sort=sort) - def setup(self): - self.level1 = tm.makeStringIndex(10).values - self.level2 = tm.makeStringIndex(1000).values - self.label1 = np.arange(10).repeat(1000) - self.label2 = np.tile(np.arange(1000), 10) - self.key1 = np.tile(self.level1.take(self.label1), 10) - self.key2 = np.tile(self.level2.take(self.label2), 10) - self.shuf = np.arange(100000) - random.shuffle(self.shuf) - try: - self.index2 = MultiIndex(levels=[self.level1, self.level2], - labels=[self.label1, self.label2]) - self.index3 = MultiIndex(levels=[np.arange(10), np.arange(100), np.arange(100)], - labels=[np.arange(10).repeat(10000), np.tile(np.arange(100).repeat(100), 10), np.tile(np.tile(np.arange(100), 100), 10)]) - self.df_multi = DataFrame(np.random.randn(len(self.index2), 4), - index=self.index2, - columns=['A', 'B', 'C', 'D']) - except: - pass - self.df = pd.DataFrame({'data1': np.random.randn(100000), - 'data2': np.random.randn(100000), - 'key1': self.key1, - 'key2': self.key2}) - self.df_key1 = pd.DataFrame(np.random.randn(len(self.level1), 4), - index=self.level1, - columns=['A', 'B', 'C', 'D']) - self.df_key2 = pd.DataFrame(np.random.randn(len(self.level2), 4), - index=self.level2, - columns=['A', 'B', 'C', 'D']) - self.df_shuf = self.df.reindex(self.df.index[self.shuf]) - - def time_join_dataframe_index_multi(self): - self.df.join(self.df_multi, on=['key1', 'key2']) - - def time_join_dataframe_index_single_key_bigger(self): - self.df.join(self.df_key2, on='key2') - - def time_join_dataframe_index_single_key_bigger_sort(self): - self.df_shuf.join(self.df_key2, on='key2', sort=True) - - def time_join_dataframe_index_single_key_small(self): - self.df.join(self.df_key1, on='key1') + def time_join_dataframe_index_single_key_bigger(self, sort): + self.df.join(self.df_key2, on='key2', sort=sort) + + def time_join_dataframe_index_single_key_small(self, sort): + self.df.join(self.df_key1, on='key1', sort=sort) + + def time_join_dataframe_index_shuffle_key_bigger_sort(self, sort): + self.df_shuf.join(self.df_key2, on='key2', sort=sort) class JoinIndex(object): - goal_time = 0.2 def setup(self): - np.random.seed(2718281) - self.n = 50000 - self.left = pd.DataFrame(np.random.randint(1, (self.n / 500), (self.n, 2)), columns=['jim', 'joe']) - self.right = pd.DataFrame(np.random.randint(1, (self.n / 500), (self.n, 2)), columns=['jolie', 'jolia']).set_index('jolie') + N = 50000 + self.left = DataFrame(np.random.randint(1, N / 500, (N, 2)), + columns=['jim', 'joe']) + self.right = DataFrame(np.random.randint(1, N / 500, (N, 2)), + columns=['jolie', 'jolia']).set_index('jolie') def time_left_outer_join_index(self): self.left.join(self.right, on='jim') -class join_non_unique_equal(object): +class JoinNonUnique(object): # outer join of non-unique # GH 6329 - - goal_time = 0.2 - def setup(self): - self.date_index = date_range('01-Jan-2013', '23-Jan-2013', freq='T') - self.daily_dates = self.date_index.to_period('D').to_timestamp('S', 'S') - self.fracofday = (self.date_index.view(np.ndarray) - self.daily_dates.view(np.ndarray)) - self.fracofday = (self.fracofday.astype('timedelta64[ns]').astype(np.float64) / 86400000000000.0) - self.fracofday = Series(self.fracofday, self.daily_dates) - self.index = date_range(self.date_index.min().to_period('A').to_timestamp('D', 'S'), self.date_index.max().to_period('A').to_timestamp('D', 'E'), freq='D') - self.temp = Series(1.0, self.index) + date_index = date_range('01-Jan-2013', '23-Jan-2013', freq='T') + daily_dates = date_index.to_period('D').to_timestamp('S', 'S') + self.fracofday = date_index.values - daily_dates.values + self.fracofday = self.fracofday.astype('timedelta64[ns]') + self.fracofday = self.fracofday.astype(np.float64) / 86400000000000.0 + self.fracofday = Series(self.fracofday, daily_dates) + index = date_range(date_index.min(), date_index.max(), freq='D') + self.temp = Series(1.0, index)[self.fracofday.index] def time_join_non_unique_equal(self): - (self.fracofday * self.temp[self.fracofday.index]) + self.fracofday * self.temp -# ---------------------------------------------------------------------- -# Merges - class Merge(object): - goal_time = 0.2 - def setup(self): - self.N = 10000 - self.indices = tm.makeStringIndex(self.N).values - self.indices2 = tm.makeStringIndex(self.N).values - self.key = np.tile(self.indices[:8000], 10) - self.key2 = np.tile(self.indices2[:8000], 10) - self.left = pd.DataFrame({'key': self.key, 'key2': self.key2, - 'value': np.random.randn(80000)}) - self.right = pd.DataFrame({'key': self.indices[2000:], - 'key2': self.indices2[2000:], - 'value2': np.random.randn(8000)}) - - self.df = pd.DataFrame({'key1': np.tile(np.arange(500).repeat(10), 2), - 'key2': np.tile(np.arange(250).repeat(10), 4), - 'value': np.random.randn(10000)}) - self.df2 = pd.DataFrame({'key1': np.arange(500), 'value2': randn(500)}) + params = [True, False] + param_names = ['sort'] + + def setup(self, sort): + N = 10000 + indices = tm.makeStringIndex(N).values + indices2 = tm.makeStringIndex(N).values + key = np.tile(indices[:8000], 10) + key2 = np.tile(indices2[:8000], 10) + self.left = DataFrame({'key': key, 'key2': key2, + 'value': np.random.randn(80000)}) + self.right = DataFrame({'key': indices[2000:], + 'key2': indices2[2000:], + 'value2': np.random.randn(8000)}) + + self.df = DataFrame({'key1': np.tile(np.arange(500).repeat(10), 2), + 'key2': np.tile(np.arange(250).repeat(10), 4), + 'value': np.random.randn(10000)}) + self.df2 = DataFrame({'key1': np.arange(500), + 'value2': np.random.randn(500)}) self.df3 = self.df[:5000] - def time_merge_2intkey_nosort(self): - merge(self.left, self.right, sort=False) + def time_merge_2intkey(self, sort): + merge(self.left, self.right, sort=sort) - def time_merge_2intkey_sort(self): - merge(self.left, self.right, sort=True) + def time_merge_dataframe_integer_2key(self, sort): + merge(self.df, self.df3, sort=sort) - def time_merge_dataframe_integer_2key(self): - merge(self.df, self.df3) + def time_merge_dataframe_integer_key(self, sort): + merge(self.df, self.df2, on='key1', sort=sort) - def time_merge_dataframe_integer_key(self): - merge(self.df, self.df2, on='key1') +class I8Merge(object): -class i8merge(object): - goal_time = 0.2 + params = ['inner', 'outer', 'left', 'right'] + param_names = ['how'] - def setup(self): - (low, high, n) = (((-1) << 10), (1 << 10), (1 << 20)) - self.left = pd.DataFrame(np.random.randint(low, high, (n, 7)), - columns=list('ABCDEFG')) + def setup(self, how): + low, high, n = -1000, 1000, 10**6 + self.left = DataFrame(np.random.randint(low, high, (n, 7)), + columns=list('ABCDEFG')) self.left['left'] = self.left.sum(axis=1) - self.i = np.random.permutation(len(self.left)) - self.right = self.left.iloc[self.i].copy() - self.right.columns = (self.right.columns[:(-1)].tolist() + ['right']) - self.right.index = np.arange(len(self.right)) - self.right['right'] *= (-1) + self.right = self.left.sample(frac=1).rename({'left': 'right'}, axis=1) + self.right = self.right.reset_index(drop=True) + self.right['right'] *= -1 - def time_i8merge(self): - merge(self.left, self.right, how='outer') + def time_i8merge(self, how): + merge(self.left, self.right, how=how) class MergeCategoricals(object): - goal_time = 0.2 def setup(self): - self.left_object = pd.DataFrame( + self.left_object = DataFrame( {'X': np.random.choice(range(0, 10), size=(10000,)), 'Y': np.random.choice(['one', 'two', 'three'], size=(10000,))}) - self.right_object = pd.DataFrame( + self.right_object = DataFrame( {'X': np.random.choice(range(0, 10), size=(10000,)), 'Z': np.random.choice(['jjj', 'kkk', 'sss'], size=(10000,))}) @@ -281,103 +263,91 @@ def time_merge_cat(self): merge(self.left_cat, self.right_cat, on='X') -# ---------------------------------------------------------------------- -# Ordered merge - class MergeOrdered(object): def setup(self): - groups = tm.makeStringIndex(10).values - - self.left = pd.DataFrame({'group': groups.repeat(5000), - 'key' : np.tile(np.arange(0, 10000, 2), 10), - 'lvalue': np.random.randn(50000)}) - - self.right = pd.DataFrame({'key' : np.arange(10000), - 'rvalue' : np.random.randn(10000)}) + self.left = DataFrame({'group': groups.repeat(5000), + 'key': np.tile(np.arange(0, 10000, 2), 10), + 'lvalue': np.random.randn(50000)}) + self.right = DataFrame({'key': np.arange(10000), + 'rvalue': np.random.randn(10000)}) def time_merge_ordered(self): merge_ordered(self.left, self.right, on='key', left_by='group') -# ---------------------------------------------------------------------- -# asof merge - class MergeAsof(object): + params = [['backward', 'forward', 'nearest']] + param_names = ['direction'] - def setup(self): - import string - np.random.seed(0) + def setup(self, direction): one_count = 200000 two_count = 1000000 - self.df1 = pd.DataFrame( + df1 = DataFrame( {'time': np.random.randint(0, one_count / 20, one_count), - 'key': np.random.choice(list(string.uppercase), one_count), + 'key': np.random.choice(list(string.ascii_uppercase), one_count), 'key2': np.random.randint(0, 25, one_count), 'value1': np.random.randn(one_count)}) - self.df2 = pd.DataFrame( + df2 = DataFrame( {'time': np.random.randint(0, two_count / 20, two_count), - 'key': np.random.choice(list(string.uppercase), two_count), + 'key': np.random.choice(list(string.ascii_uppercase), two_count), 'key2': np.random.randint(0, 25, two_count), 'value2': np.random.randn(two_count)}) - self.df1 = self.df1.sort_values('time') - self.df2 = self.df2.sort_values('time') + df1 = df1.sort_values('time') + df2 = df2.sort_values('time') - self.df1['time32'] = np.int32(self.df1.time) - self.df2['time32'] = np.int32(self.df2.time) + df1['time32'] = np.int32(df1.time) + df2['time32'] = np.int32(df2.time) - self.df1a = self.df1[['time', 'value1']] - self.df2a = self.df2[['time', 'value2']] - self.df1b = self.df1[['time', 'key', 'value1']] - self.df2b = self.df2[['time', 'key', 'value2']] - self.df1c = self.df1[['time', 'key2', 'value1']] - self.df2c = self.df2[['time', 'key2', 'value2']] - self.df1d = self.df1[['time32', 'value1']] - self.df2d = self.df2[['time32', 'value2']] - self.df1e = self.df1[['time', 'key', 'key2', 'value1']] - self.df2e = self.df2[['time', 'key', 'key2', 'value2']] + self.df1a = df1[['time', 'value1']] + self.df2a = df2[['time', 'value2']] + self.df1b = df1[['time', 'key', 'value1']] + self.df2b = df2[['time', 'key', 'value2']] + self.df1c = df1[['time', 'key2', 'value1']] + self.df2c = df2[['time', 'key2', 'value2']] + self.df1d = df1[['time32', 'value1']] + self.df2d = df2[['time32', 'value2']] + self.df1e = df1[['time', 'key', 'key2', 'value1']] + self.df2e = df2[['time', 'key', 'key2', 'value2']] - def time_noby(self): - merge_asof(self.df1a, self.df2a, on='time') + def time_on_int(self, direction): + merge_asof(self.df1a, self.df2a, on='time', direction=direction) - def time_by_object(self): - merge_asof(self.df1b, self.df2b, on='time', by='key') + def time_on_int32(self, direction): + merge_asof(self.df1d, self.df2d, on='time32', direction=direction) - def time_by_int(self): - merge_asof(self.df1c, self.df2c, on='time', by='key2') + def time_by_object(self, direction): + merge_asof(self.df1b, self.df2b, on='time', by='key', + direction=direction) - def time_on_int32(self): - merge_asof(self.df1d, self.df2d, on='time32') + def time_by_int(self, direction): + merge_asof(self.df1c, self.df2c, on='time', by='key2', + direction=direction) - def time_multiby(self): - merge_asof(self.df1e, self.df2e, on='time', by=['key', 'key2']) + def time_multiby(self, direction): + merge_asof(self.df1e, self.df2e, on='time', by=['key', 'key2'], + direction=direction) -# ---------------------------------------------------------------------- -# data alignment - class Align(object): - goal_time = 0.2 def setup(self): - self.n = 1000000 - self.sz = 500000 - self.rng = np.arange(0, 10000000000000, 10000000) - self.stamps = (np.datetime64(datetime.now()).view('i8') + self.rng) - self.idx1 = np.sort(self.sample(self.stamps, self.sz)) - self.idx2 = np.sort(self.sample(self.stamps, self.sz)) - self.ts1 = Series(np.random.randn(self.sz), self.idx1) - self.ts2 = Series(np.random.randn(self.sz), self.idx2) - - def sample(self, values, k): - self.sampler = np.random.permutation(len(values)) - return values.take(self.sampler[:k]) + size = 5 * 10**5 + rng = np.arange(0, 10**13, 10**7) + stamps = np.datetime64('now').view('i8') + rng + idx1 = np.sort(np.random.choice(stamps, size, replace=False)) + idx2 = np.sort(np.random.choice(stamps, size, replace=False)) + self.ts1 = Series(np.random.randn(size), idx1) + self.ts2 = Series(np.random.randn(size), idx2) def time_series_align_int64_index(self): - (self.ts1 + self.ts2) + self.ts1 + self.ts2 def time_series_align_left_monotonic(self): self.ts1.align(self.ts2, join='left') + + +from .pandas_vb_common import setup # noqa: F401 diff --git a/asv_bench/benchmarks/multiindex_object.py b/asv_bench/benchmarks/multiindex_object.py new file mode 100644 index 0000000000000..adc6730dcd946 --- /dev/null +++ b/asv_bench/benchmarks/multiindex_object.py @@ -0,0 +1,129 @@ +import string + +import numpy as np +import pandas.util.testing as tm +from pandas import date_range, MultiIndex + + +class GetLoc(object): + + def setup(self): + self.mi_large = MultiIndex.from_product( + [np.arange(1000), np.arange(20), list(string.ascii_letters)], + names=['one', 'two', 'three']) + self.mi_med = MultiIndex.from_product( + [np.arange(1000), np.arange(10), list('A')], + names=['one', 'two', 'three']) + self.mi_small = MultiIndex.from_product( + [np.arange(100), list('A'), list('A')], + names=['one', 'two', 'three']) + + def time_large_get_loc(self): + self.mi_large.get_loc((999, 19, 'Z')) + + def time_large_get_loc_warm(self): + for _ in range(1000): + self.mi_large.get_loc((999, 19, 'Z')) + + def time_med_get_loc(self): + self.mi_med.get_loc((999, 9, 'A')) + + def time_med_get_loc_warm(self): + for _ in range(1000): + self.mi_med.get_loc((999, 9, 'A')) + + def time_string_get_loc(self): + self.mi_small.get_loc((99, 'A', 'A')) + + def time_small_get_loc_warm(self): + for _ in range(1000): + self.mi_small.get_loc((99, 'A', 'A')) + + +class Duplicates(object): + + def setup(self): + size = 65536 + arrays = [np.random.randint(0, 8192, size), + np.random.randint(0, 1024, size)] + mask = np.random.rand(size) < 0.1 + self.mi_unused_levels = MultiIndex.from_arrays(arrays) + self.mi_unused_levels = self.mi_unused_levels[mask] + + def time_remove_unused_levels(self): + self.mi_unused_levels.remove_unused_levels() + + +class Integer(object): + + def setup(self): + self.mi_int = MultiIndex.from_product([np.arange(1000), + np.arange(1000)], + names=['one', 'two']) + self.obj_index = np.array([(0, 10), (0, 11), (0, 12), + (0, 13), (0, 14), (0, 15), + (0, 16), (0, 17), (0, 18), + (0, 19)], dtype=object) + + def time_get_indexer(self): + self.mi_int.get_indexer(self.obj_index) + + def time_is_monotonic(self): + self.mi_int.is_monotonic + + +class Duplicated(object): + + def setup(self): + n, k = 200, 5000 + levels = [np.arange(n), + tm.makeStringIndex(n).values, + 1000 + np.arange(n)] + codes = [np.random.choice(n, (k * n)) for lev in levels] + self.mi = MultiIndex(levels=levels, codes=codes) + + def time_duplicated(self): + self.mi.duplicated() + + +class Sortlevel(object): + + def setup(self): + n = 1182720 + low, high = -4096, 4096 + arrs = [np.repeat(np.random.randint(low, high, (n // k)), k) + for k in [11, 7, 5, 3, 1]] + self.mi_int = MultiIndex.from_arrays(arrs)[np.random.permutation(n)] + + a = np.repeat(np.arange(100), 1000) + b = np.tile(np.arange(1000), 100) + self.mi = MultiIndex.from_arrays([a, b]) + self.mi = self.mi.take(np.random.permutation(np.arange(100000))) + + def time_sortlevel_int64(self): + self.mi_int.sortlevel() + + def time_sortlevel_zero(self): + self.mi.sortlevel(0) + + def time_sortlevel_one(self): + self.mi.sortlevel(1) + + +class Values(object): + + def setup_cache(self): + + level1 = range(1000) + level2 = date_range(start='1/1/2012', periods=100) + mi = MultiIndex.from_product([level1, level2]) + return mi + + def time_datetime_level_values_copy(self, mi): + mi.copy().values + + def time_datetime_level_values_sliced(self, mi): + mi[:10].values + + +from .pandas_vb_common import setup # noqa: F401 diff --git a/asv_bench/benchmarks/offset.py b/asv_bench/benchmarks/offset.py new file mode 100644 index 0000000000000..4570e73cccc71 --- /dev/null +++ b/asv_bench/benchmarks/offset.py @@ -0,0 +1,118 @@ +# -*- coding: utf-8 -*- +import warnings +from datetime import datetime + +import numpy as np +import pandas as pd +try: + import pandas.tseries.holiday # noqa +except ImportError: + pass + +hcal = pd.tseries.holiday.USFederalHolidayCalendar() +# These offests currently raise a NotImplimentedError with .apply_index() +non_apply = [pd.offsets.Day(), + pd.offsets.BYearEnd(), + pd.offsets.BYearBegin(), + pd.offsets.BQuarterEnd(), + pd.offsets.BQuarterBegin(), + pd.offsets.BMonthEnd(), + pd.offsets.BMonthBegin(), + pd.offsets.CustomBusinessDay(), + pd.offsets.CustomBusinessDay(calendar=hcal), + pd.offsets.CustomBusinessMonthBegin(calendar=hcal), + pd.offsets.CustomBusinessMonthEnd(calendar=hcal), + pd.offsets.CustomBusinessMonthEnd(calendar=hcal)] +other_offsets = [pd.offsets.YearEnd(), pd.offsets.YearBegin(), + pd.offsets.QuarterEnd(), pd.offsets.QuarterBegin(), + pd.offsets.MonthEnd(), pd.offsets.MonthBegin(), + pd.offsets.DateOffset(months=2, days=2), + pd.offsets.BusinessDay(), pd.offsets.SemiMonthEnd(), + pd.offsets.SemiMonthBegin()] +offsets = non_apply + other_offsets + + +class ApplyIndex(object): + + params = other_offsets + param_names = ['offset'] + + def setup(self, offset): + N = 10000 + self.rng = pd.date_range(start='1/1/2000', periods=N, freq='T') + + def time_apply_index(self, offset): + offset.apply_index(self.rng) + + +class OnOffset(object): + + params = offsets + param_names = ['offset'] + + def setup(self, offset): + self.dates = [datetime(2016, m, d) + for m in [10, 11, 12] + for d in [1, 2, 3, 28, 29, 30, 31] + if not (m == 11 and d == 31)] + + def time_on_offset(self, offset): + for date in self.dates: + offset.onOffset(date) + + +class OffsetSeriesArithmetic(object): + + params = offsets + param_names = ['offset'] + + def setup(self, offset): + N = 1000 + rng = pd.date_range(start='1/1/2000', periods=N, freq='T') + self.data = pd.Series(rng) + + def time_add_offset(self, offset): + with warnings.catch_warnings(record=True): + self.data + offset + + +class OffsetDatetimeIndexArithmetic(object): + + params = offsets + param_names = ['offset'] + + def setup(self, offset): + N = 1000 + self.data = pd.date_range(start='1/1/2000', periods=N, freq='T') + + def time_add_offset(self, offset): + with warnings.catch_warnings(record=True): + self.data + offset + + +class OffestDatetimeArithmetic(object): + + params = offsets + param_names = ['offset'] + + def setup(self, offset): + self.date = datetime(2011, 1, 1) + self.dt64 = np.datetime64('2011-01-01 09:00Z') + + def time_apply(self, offset): + offset.apply(self.date) + + def time_apply_np_dt64(self, offset): + offset.apply(self.dt64) + + def time_add(self, offset): + self.date + offset + + def time_add_10(self, offset): + self.date + (10 * offset) + + def time_subtract(self, offset): + self.date - offset + + def time_subtract_10(self, offset): + self.date - (10 * offset) diff --git a/asv_bench/benchmarks/packers.py b/asv_bench/benchmarks/packers.py deleted file mode 100644 index cd43e305ead8f..0000000000000 --- a/asv_bench/benchmarks/packers.py +++ /dev/null @@ -1,316 +0,0 @@ -from .pandas_vb_common import * -from numpy.random import randint -import pandas as pd -from collections import OrderedDict -from pandas.compat import BytesIO -import sqlite3 -import os -from sqlalchemy import create_engine -import numpy as np -from random import randrange - -class _Packers(object): - goal_time = 0.2 - - def _setup(self): - self.f = '__test__.msg' - self.N = 100000 - self.C = 5 - self.index = date_range('20000101', periods=self.N, freq='H') - self.df = DataFrame(dict([('float{0}'.format(i), randn(self.N)) for i in range(self.C)]), index=self.index) - self.df2 = self.df.copy() - self.df2['object'] = [('%08x' % randrange((16 ** 8))) for _ in range(self.N)] - self.remove(self.f) - - def remove(self, f): - try: - os.remove(self.f) - except: - pass - -class Packers(_Packers): - goal_time = 0.2 - - def setup(self): - self._setup() - self.df.to_csv(self.f) - - def time_packers_read_csv(self): - pd.read_csv(self.f) - -class packers_read_excel(_Packers): - goal_time = 0.2 - - def setup(self): - self._setup() - self.bio = BytesIO() - self.writer = pd.io.excel.ExcelWriter(self.bio, engine='xlsxwriter') - self.df[:2000].to_excel(self.writer) - self.writer.save() - - def time_packers_read_excel(self): - self.bio.seek(0) - pd.read_excel(self.bio) - - -class packers_read_hdf_store(_Packers): - goal_time = 0.2 - - def setup(self): - self._setup() - self.df2.to_hdf(self.f, 'df') - - def time_packers_read_hdf_store(self): - pd.read_hdf(self.f, 'df') - - -class packers_read_hdf_table(_Packers): - - def setup(self): - self._setup() - self.df2.to_hdf(self.f, 'df', format='table') - - def time_packers_read_hdf_table(self): - pd.read_hdf(self.f, 'df') - - -class packers_read_json(_Packers): - - def setup(self): - self._setup() - self.df.to_json(self.f, orient='split') - self.df.index = np.arange(self.N) - - def time_packers_read_json(self): - pd.read_json(self.f, orient='split') - - -class packers_read_json_date_index(_Packers): - - def setup(self): - self._setup() - self.remove(self.f) - self.df.to_json(self.f, orient='split') - - def time_packers_read_json_date_index(self): - pd.read_json(self.f, orient='split') - - -class packers_read_pack(_Packers): - - def setup(self): - self._setup() - self.df2.to_msgpack(self.f) - - def time_packers_read_pack(self): - pd.read_msgpack(self.f) - - -class packers_read_pickle(_Packers): - - def setup(self): - self._setup() - self.df2.to_pickle(self.f) - - def time_packers_read_pickle(self): - pd.read_pickle(self.f) - -class packers_read_sql(_Packers): - - def setup(self): - self._setup() - self.engine = create_engine('sqlite:///:memory:') - self.df2.to_sql('table', self.engine, if_exists='replace') - - def time_packers_read_sql(self): - pd.read_sql_table('table', self.engine) - - -class packers_read_stata(_Packers): - - def setup(self): - self._setup() - self.df.to_stata(self.f, {'index': 'tc', }) - - def time_packers_read_stata(self): - pd.read_stata(self.f) - - -class packers_read_stata_with_validation(_Packers): - - def setup(self): - self._setup() - self.df['int8_'] = [randint(np.iinfo(np.int8).min, (np.iinfo(np.int8).max - 27)) for _ in range(self.N)] - self.df['int16_'] = [randint(np.iinfo(np.int16).min, (np.iinfo(np.int16).max - 27)) for _ in range(self.N)] - self.df['int32_'] = [randint(np.iinfo(np.int32).min, (np.iinfo(np.int32).max - 27)) for _ in range(self.N)] - self.df['float32_'] = np.array(randn(self.N), dtype=np.float32) - self.df.to_stata(self.f, {'index': 'tc', }) - - def time_packers_read_stata_with_validation(self): - pd.read_stata(self.f) - - -class packers_read_sas(_Packers): - - def setup(self): - self.f = os.path.join(os.path.dirname(__file__), '..', '..', - 'pandas', 'io', 'tests', 'sas', 'data', - 'test1.sas7bdat') - self.f2 = os.path.join(os.path.dirname(__file__), '..', '..', - 'pandas', 'io', 'tests', 'sas', 'data', - 'paxraw_d_short.xpt') - - def time_read_sas7bdat(self): - pd.read_sas(self.f, format='sas7bdat') - - def time_read_xport(self): - pd.read_sas(self.f, format='xport') - - -class CSV(_Packers): - - def setup(self): - self._setup() - - def time_write_csv(self): - self.df.to_csv(self.f) - - def teardown(self): - self.remove(self.f) - - -class Excel(_Packers): - - def setup(self): - self._setup() - self.bio = BytesIO() - - def time_write_excel_openpyxl(self): - self.bio.seek(0) - self.writer = pd.io.excel.ExcelWriter(self.bio, engine='openpyxl') - self.df[:2000].to_excel(self.writer) - self.writer.save() - - def time_write_excel_xlsxwriter(self): - self.bio.seek(0) - self.writer = pd.io.excel.ExcelWriter(self.bio, engine='xlsxwriter') - self.df[:2000].to_excel(self.writer) - self.writer.save() - - def time_write_excel_xlwt(self): - self.bio.seek(0) - self.writer = pd.io.excel.ExcelWriter(self.bio, engine='xlwt') - self.df[:2000].to_excel(self.writer) - self.writer.save() - - -class HDF(_Packers): - - def setup(self): - self._setup() - - def time_write_hdf_store(self): - self.df2.to_hdf(self.f, 'df') - - def time_write_hdf_table(self): - self.df2.to_hdf(self.f, 'df', table=True) - - def teardown(self): - self.remove(self.f) - -class JSON(_Packers): - - def setup(self): - self._setup() - self.df_date = self.df.copy() - self.df.index = np.arange(self.N) - self.cols = [(lambda i: ('{0}_timedelta'.format(i), [pd.Timedelta(('%d seconds' % randrange(1000000.0))) for _ in range(self.N)])), (lambda i: ('{0}_int'.format(i), randint(100000000.0, size=self.N))), (lambda i: ('{0}_timestamp'.format(i), [pd.Timestamp((1418842918083256000 + randrange(1000000000.0, 1e+18, 200))) for _ in range(self.N)]))] - self.df_mixed = DataFrame(OrderedDict([self.cols[(i % len(self.cols))](i) for i in range(self.C)]), index=self.index) - - self.cols = [(lambda i: ('{0}_float'.format(i), randn(self.N))), (lambda i: ('{0}_int'.format(i), randint(100000000.0, size=self.N)))] - self.df_mixed2 = DataFrame(OrderedDict([self.cols[(i % len(self.cols))](i) for i in range(self.C)]), index=self.index) - - self.cols = [(lambda i: ('{0}_float'.format(i), randn(self.N))), (lambda i: ('{0}_int'.format(i), randint(100000000.0, size=self.N))), (lambda i: ('{0}_str'.format(i), [('%08x' % randrange((16 ** 8))) for _ in range(self.N)]))] - self.df_mixed3 = DataFrame(OrderedDict([self.cols[(i % len(self.cols))](i) for i in range(self.C)]), index=self.index) - - def time_write_json(self): - self.df.to_json(self.f, orient='split') - - def time_write_json_T(self): - self.df.to_json(self.f, orient='columns') - - def time_write_json_date_index(self): - self.df_date.to_json(self.f, orient='split') - - def time_write_json_mixed_delta_int_tstamp(self): - self.df_mixed.to_json(self.f, orient='split') - - def time_write_json_mixed_float_int(self): - self.df_mixed2.to_json(self.f, orient='index') - - def time_write_json_mixed_float_int_T(self): - self.df_mixed2.to_json(self.f, orient='columns') - - def time_write_json_mixed_float_int_str(self): - self.df_mixed3.to_json(self.f, orient='split') - - def time_write_json_lines(self): - self.df.to_json(self.f, orient="records", lines=True) - - def teardown(self): - self.remove(self.f) - - -class MsgPack(_Packers): - - def setup(self): - self._setup() - - def time_write_msgpack(self): - self.df2.to_msgpack(self.f) - - def teardown(self): - self.remove(self.f) - - -class Pickle(_Packers): - - def setup(self): - self._setup() - - def time_write_pickle(self): - self.df2.to_pickle(self.f) - - def teardown(self): - self.remove(self.f) - - -class SQL(_Packers): - - def setup(self): - self._setup() - self.engine = create_engine('sqlite:///:memory:') - - def time_write_sql(self): - self.df2.to_sql('table', self.engine, if_exists='replace') - - -class STATA(_Packers): - - def setup(self): - self._setup() - - self.df3=self.df.copy() - self.df3['int8_'] = [randint(np.iinfo(np.int8).min, (np.iinfo(np.int8).max - 27)) for _ in range(self.N)] - self.df3['int16_'] = [randint(np.iinfo(np.int16).min, (np.iinfo(np.int16).max - 27)) for _ in range(self.N)] - self.df3['int32_'] = [randint(np.iinfo(np.int32).min, (np.iinfo(np.int32).max - 27)) for _ in range(self.N)] - self.df3['float32_'] = np.array(randn(self.N), dtype=np.float32) - - def time_write_stata(self): - self.df.to_stata(self.f, {'index': 'tc', }) - - def time_write_stata_with_validation(self): - self.df3.to_stata(self.f, {'index': 'tc', }) - - def teardown(self): - self.remove(self.f) diff --git a/asv_bench/benchmarks/pandas_vb_common.py b/asv_bench/benchmarks/pandas_vb_common.py index 56ccc94c414fb..d479952cbfbf6 100644 --- a/asv_bench/benchmarks/pandas_vb_common.py +++ b/asv_bench/benchmarks/pandas_vb_common.py @@ -1,37 +1,55 @@ -from pandas import * -import pandas as pd -from datetime import timedelta -from numpy.random import randn -from numpy.random import randint -from numpy.random import permutation -import pandas.util.testing as tm -import random -import numpy as np -import threading +import os from importlib import import_module -try: - from pandas.compat import range -except ImportError: - pass - -np.random.seed(1234) +import numpy as np +import pandas as pd -# try em until it works! -for imp in ['pandas_tseries', 'pandas.lib', 'pandas._libs.lib']: +# Compatibility import for lib +for imp in ['pandas._libs.lib', 'pandas.lib']: try: lib = import_module(imp) break - except: + except (ImportError, TypeError, ValueError): pass +numeric_dtypes = [np.int64, np.int32, np.uint32, np.uint64, np.float32, + np.float64, np.int16, np.int8, np.uint16, np.uint8] +datetime_dtypes = [np.datetime64, np.timedelta64] +string_dtypes = [np.object] try: - Panel = Panel -except Exception: - Panel = WidePanel + extension_dtypes = [pd.Int8Dtype, pd.Int16Dtype, + pd.Int32Dtype, pd.Int64Dtype, + pd.UInt8Dtype, pd.UInt16Dtype, + pd.UInt32Dtype, pd.UInt64Dtype, + pd.CategoricalDtype, + pd.IntervalDtype, + pd.DatetimeTZDtype('ns', 'UTC'), + pd.PeriodDtype('D')] +except AttributeError: + extension_dtypes = [] -# didn't add to namespace until later -try: - from pandas.core.index import MultiIndex -except ImportError: - pass + +def setup(*args, **kwargs): + # This function just needs to be imported into each benchmark file to + # set up the random seed before each function. + # http://asv.readthedocs.io/en/latest/writing_benchmarks.html + np.random.seed(1234) + + +class BaseIO(object): + """ + Base class for IO benchmarks + """ + fname = None + + def remove(self, f): + """Remove created files""" + try: + os.remove(f) + except OSError: + # On Windows, attempting to remove a file that is in use + # causes an exception to be raised + pass + + def teardown(self, *args, **kwargs): + self.remove(self.fname) diff --git a/asv_bench/benchmarks/panel_ctor.py b/asv_bench/benchmarks/panel_ctor.py index faedce6c574ec..627705284481b 100644 --- a/asv_bench/benchmarks/panel_ctor.py +++ b/asv_bench/benchmarks/panel_ctor.py @@ -1,64 +1,55 @@ -from .pandas_vb_common import * +import warnings +from datetime import datetime, timedelta +from pandas import DataFrame, Panel, date_range -class Constructors1(object): - goal_time = 0.2 +class DifferentIndexes(object): def setup(self): self.data_frames = {} - self.start = datetime(1990, 1, 1) - self.end = datetime(2012, 1, 1) + start = datetime(1990, 1, 1) + end = datetime(2012, 1, 1) for x in range(100): - self.end += timedelta(days=1) - self.dr = np.asarray(date_range(self.start, self.end)) - self.df = DataFrame({'a': ([0] * len(self.dr)), 'b': ([1] * len(self.dr)), 'c': ([2] * len(self.dr)), }, index=self.dr) - self.data_frames[x] = self.df + end += timedelta(days=1) + idx = date_range(start, end) + df = DataFrame({'a': 0, 'b': 1, 'c': 2}, index=idx) + self.data_frames[x] = df - def time_panel_from_dict_all_different_indexes(self): - Panel.from_dict(self.data_frames) + def time_from_dict(self): + with warnings.catch_warnings(record=True): + Panel.from_dict(self.data_frames) -class Constructors2(object): - goal_time = 0.2 +class SameIndexes(object): def setup(self): - self.data_frames = {} - for x in range(100): - self.dr = np.asarray(DatetimeIndex(start=datetime(1990, 1, 1), end=datetime(2012, 1, 1), freq=datetools.Day(1))) - self.df = DataFrame({'a': ([0] * len(self.dr)), 'b': ([1] * len(self.dr)), 'c': ([2] * len(self.dr)), }, index=self.dr) - self.data_frames[x] = self.df - - def time_panel_from_dict_equiv_indexes(self): - Panel.from_dict(self.data_frames) - - -class Constructors3(object): - goal_time = 0.2 - - def setup(self): - self.dr = np.asarray(DatetimeIndex(start=datetime(1990, 1, 1), end=datetime(2012, 1, 1), freq=datetools.Day(1))) - self.data_frames = {} - for x in range(100): - self.df = DataFrame({'a': ([0] * len(self.dr)), 'b': ([1] * len(self.dr)), 'c': ([2] * len(self.dr)), }, index=self.dr) - self.data_frames[x] = self.df + idx = date_range(start=datetime(1990, 1, 1), + end=datetime(2012, 1, 1), + freq='D') + df = DataFrame({'a': 0, 'b': 1, 'c': 2}, index=idx) + self.data_frames = dict(enumerate([df] * 100)) - def time_panel_from_dict_same_index(self): - Panel.from_dict(self.data_frames) + def time_from_dict(self): + with warnings.catch_warnings(record=True): + Panel.from_dict(self.data_frames) -class Constructors4(object): - goal_time = 0.2 +class TwoIndexes(object): def setup(self): - self.data_frames = {} - self.start = datetime(1990, 1, 1) - self.end = datetime(2012, 1, 1) - for x in range(100): - if (x == 50): - self.end += timedelta(days=1) - self.dr = np.asarray(date_range(self.start, self.end)) - self.df = DataFrame({'a': ([0] * len(self.dr)), 'b': ([1] * len(self.dr)), 'c': ([2] * len(self.dr)), }, index=self.dr) - self.data_frames[x] = self.df - - def time_panel_from_dict_two_different_indexes(self): - Panel.from_dict(self.data_frames) + start = datetime(1990, 1, 1) + end = datetime(2012, 1, 1) + df1 = DataFrame({'a': 0, 'b': 1, 'c': 2}, + index=date_range(start=start, end=end, freq='D')) + end += timedelta(days=1) + df2 = DataFrame({'a': 0, 'b': 1, 'c': 2}, + index=date_range(start=start, end=end, freq='D')) + dfs = [df1] * 50 + [df2] * 50 + self.data_frames = dict(enumerate(dfs)) + + def time_from_dict(self): + with warnings.catch_warnings(record=True): + Panel.from_dict(self.data_frames) + + +from .pandas_vb_common import setup # noqa: F401 diff --git a/asv_bench/benchmarks/panel_methods.py b/asv_bench/benchmarks/panel_methods.py index 6609305502011..a4c12c082236e 100644 --- a/asv_bench/benchmarks/panel_methods.py +++ b/asv_bench/benchmarks/panel_methods.py @@ -1,24 +1,25 @@ -from .pandas_vb_common import * +import warnings + +import numpy as np +from pandas import Panel class PanelMethods(object): - goal_time = 0.2 - def setup(self): - self.index = date_range(start='2000', freq='D', periods=1000) - self.panel = Panel(np.random.randn(100, len(self.index), 1000)) + params = ['items', 'major', 'minor'] + param_names = ['axis'] - def time_pct_change_items(self): - self.panel.pct_change(1, axis='items') + def setup(self, axis): + with warnings.catch_warnings(record=True): + self.panel = Panel(np.random.randn(100, 1000, 100)) - def time_pct_change_major(self): - self.panel.pct_change(1, axis='major') + def time_pct_change(self, axis): + with warnings.catch_warnings(record=True): + self.panel.pct_change(1, axis=axis) - def time_pct_change_minor(self): - self.panel.pct_change(1, axis='minor') + def time_shift(self, axis): + with warnings.catch_warnings(record=True): + self.panel.shift(1, axis=axis) - def time_shift(self): - self.panel.shift(1) - def time_shift_minor(self): - self.panel.shift(1, axis='minor') +from .pandas_vb_common import setup # noqa: F401 diff --git a/asv_bench/benchmarks/parser_vb.py b/asv_bench/benchmarks/parser_vb.py deleted file mode 100644 index 32bf7e50d1a89..0000000000000 --- a/asv_bench/benchmarks/parser_vb.py +++ /dev/null @@ -1,121 +0,0 @@ -from .pandas_vb_common import * -import os -from pandas import read_csv -try: - from cStringIO import StringIO -except ImportError: - from io import StringIO - - -class read_csv1(object): - goal_time = 0.2 - - def setup(self): - self.N = 10000 - self.K = 8 - self.df = DataFrame((np.random.randn(self.N, self.K) * np.random.randint(100, 10000, (self.N, self.K)))) - self.df.to_csv('test.csv', sep='|') - - self.format = (lambda x: '{:,}'.format(x)) - self.df2 = self.df.applymap(self.format) - self.df2.to_csv('test2.csv', sep='|') - - def time_sep(self): - read_csv('test.csv', sep='|') - - def time_thousands(self): - read_csv('test.csv', sep='|', thousands=',') - - def teardown(self): - os.remove('test.csv') - os.remove('test2.csv') - - -class read_csv2(object): - goal_time = 0.2 - - def setup(self): - self.data = ['A,B,C'] - self.data = (self.data + (['1,2,3 # comment'] * 100000)) - self.data = '\n'.join(self.data) - - def time_comment(self): - read_csv(StringIO(self.data), comment='#') - - -class read_csv3(object): - goal_time = 0.2 - - def setup(self): - self.data = """0.1213700904466425978256438611,0.0525708283766902484401839501,0.4174092731488769913994474336\n -0.4096341697147408700274695547,0.1587830198973579909349496119,0.1292545832485494372576795285\n -0.8323255650024565799327547210,0.9694902427379478160318626578,0.6295047811546814475747169126\n -0.4679375305798131323697930383,0.2963942381834381301075609371,0.5268936082160610157032465394\n -0.6685382761849776311890991564,0.6721207066140679753374342908,0.6519975277021627935170045020\n""" - self.data2 = self.data.replace(',', ';').replace('.', ',') - self.data = (self.data * 200) - self.data2 = (self.data2 * 200) - - def time_default_converter(self): - read_csv(StringIO(self.data), sep=',', header=None, - float_precision=None) - - def time_default_converter_with_decimal(self): - read_csv(StringIO(self.data2), sep=';', header=None, - float_precision=None, decimal=',') - - def time_default_converter_python_engine(self): - read_csv(StringIO(self.data), sep=',', header=None, - float_precision=None, engine='python') - - def time_default_converter_with_decimal_python_engine(self): - read_csv(StringIO(self.data2), sep=';', header=None, - float_precision=None, decimal=',', engine='python') - - def time_precise_converter(self): - read_csv(StringIO(self.data), sep=',', header=None, - float_precision='high') - - def time_roundtrip_converter(self): - read_csv(StringIO(self.data), sep=',', header=None, - float_precision='round_trip') - - -class read_csv_categorical(object): - goal_time = 0.2 - - def setup(self): - N = 100000 - group1 = ['aaaaaaaa', 'bbbbbbb', 'cccccccc', 'dddddddd', 'eeeeeeee'] - df = DataFrame({'a': np.random.choice(group1, N).astype('object'), - 'b': np.random.choice(group1, N).astype('object'), - 'c': np.random.choice(group1, N).astype('object')}) - df.to_csv('strings.csv', index=False) - - def time_convert_post(self): - read_csv('strings.csv').apply(pd.Categorical) - - def time_convert_direct(self): - read_csv('strings.csv', dtype='category') - - def teardown(self): - os.remove('strings.csv') - - -class read_csv_dateparsing(object): - goal_time = 0.2 - - def setup(self): - self.N = 10000 - self.K = 8 - self.data = 'KORD,19990127, 19:00:00, 18:56:00, 0.8100, 2.8100, 7.2000, 0.0000, 280.0000\n KORD,19990127, 20:00:00, 19:56:00, 0.0100, 2.2100, 7.2000, 0.0000, 260.0000\n KORD,19990127, 21:00:00, 20:56:00, -0.5900, 2.2100, 5.7000, 0.0000, 280.0000\n KORD,19990127, 21:00:00, 21:18:00, -0.9900, 2.0100, 3.6000, 0.0000, 270.0000\n KORD,19990127, 22:00:00, 21:56:00, -0.5900, 1.7100, 5.1000, 0.0000, 290.0000\n ' - self.data = (self.data * 200) - self.data2 = 'KORD,19990127 19:00:00, 18:56:00, 0.8100, 2.8100, 7.2000, 0.0000, 280.0000\n KORD,19990127 20:00:00, 19:56:00, 0.0100, 2.2100, 7.2000, 0.0000, 260.0000\n KORD,19990127 21:00:00, 20:56:00, -0.5900, 2.2100, 5.7000, 0.0000, 280.0000\n KORD,19990127 21:00:00, 21:18:00, -0.9900, 2.0100, 3.6000, 0.0000, 270.0000\n KORD,19990127 22:00:00, 21:56:00, -0.5900, 1.7100, 5.1000, 0.0000, 290.0000\n ' - self.data2 = (self.data2 * 200) - - def time_multiple_date(self): - read_csv(StringIO(self.data), sep=',', header=None, - parse_dates=[[1, 2], [1, 3]]) - - def time_baseline(self): - read_csv(StringIO(self.data2), sep=',', header=None, parse_dates=[1]) diff --git a/asv_bench/benchmarks/period.py b/asv_bench/benchmarks/period.py index f9837191a7bae..6d2c7156a0a3d 100644 --- a/asv_bench/benchmarks/period.py +++ b/asv_bench/benchmarks/period.py @@ -1,59 +1,124 @@ -import pandas as pd -from pandas import Series, Period, PeriodIndex, date_range +from pandas import ( + DataFrame, Period, PeriodIndex, Series, date_range, period_range) +from pandas.tseries.frequencies import to_offset -class Constructor(object): - goal_time = 0.2 +class PeriodProperties(object): - def setup(self): + params = (['M', 'min'], + ['year', 'month', 'day', 'hour', 'minute', 'second', + 'is_leap_year', 'quarter', 'qyear', 'week', 'daysinmonth', + 'dayofweek', 'dayofyear', 'start_time', 'end_time']) + param_names = ['freq', 'attr'] + + def setup(self, freq, attr): + self.per = Period('2012-06-01', freq=freq) + + def time_property(self, freq, attr): + getattr(self.per, attr) + + +class PeriodUnaryMethods(object): + + params = ['M', 'min'] + param_names = ['freq'] + + def setup(self, freq): + self.per = Period('2012-06-01', freq=freq) + + def time_to_timestamp(self, freq): + self.per.to_timestamp() + + def time_now(self, freq): + self.per.now(freq) + + def time_asfreq(self, freq): + self.per.asfreq('A') + + +class PeriodConstructor(object): + params = [['D'], [True, False]] + param_names = ['freq', 'is_offset'] + + def setup(self, freq, is_offset): + if is_offset: + self.freq = to_offset(freq) + else: + self.freq = freq + + def time_period_constructor(self, freq, is_offset): + Period('2012-06-01', freq=freq) + + +class PeriodIndexConstructor(object): + + params = [['D'], [True, False]] + param_names = ['freq', 'is_offset'] + + def setup(self, freq, is_offset): self.rng = date_range('1985', periods=1000) self.rng2 = date_range('1985', periods=1000).to_pydatetime() + self.ints = list(range(2000, 3000)) + self.daily_ints = date_range('1/1/2000', periods=1000, + freq=freq).strftime('%Y%m%d').map(int) + if is_offset: + self.freq = to_offset(freq) + else: + self.freq = freq + + def time_from_date_range(self, freq, is_offset): + PeriodIndex(self.rng, freq=freq) - def time_from_date_range(self): - PeriodIndex(self.rng, freq='D') + def time_from_pydatetime(self, freq, is_offset): + PeriodIndex(self.rng2, freq=freq) - def time_from_pydatetime(self): - PeriodIndex(self.rng2, freq='D') + def time_from_ints(self, freq, is_offset): + PeriodIndex(self.ints, freq=freq) + def time_from_ints_daily(self, freq, is_offset): + PeriodIndex(self.daily_ints, freq=freq) -class DataFrame(object): - goal_time = 0.2 + +class DataFramePeriodColumn(object): def setup(self): - self.rng = pd.period_range(start='1/1/1990', freq='S', periods=20000) - self.df = pd.DataFrame(index=range(len(self.rng))) + self.rng = period_range(start='1/1/1990', freq='S', periods=20000) + self.df = DataFrame(index=range(len(self.rng))) def time_setitem_period_column(self): self.df['col'] = self.rng + def time_set_index(self): + # GH#21582 limited by comparisons of Period objects + self.df['col2'] = self.rng + self.df.set_index('col2', append=True) + class Algorithms(object): - goal_time = 0.2 - def setup(self): + params = ['index', 'series'] + param_names = ['typ'] + + def setup(self, typ): data = [Period('2011-01', freq='M'), Period('2011-02', freq='M'), Period('2011-03', freq='M'), Period('2011-04', freq='M')] - self.s = Series(data * 1000) - self.i = PeriodIndex(data, freq='M') - - def time_drop_duplicates_pseries(self): - self.s.drop_duplicates() - def time_drop_duplicates_pindex(self): - self.i.drop_duplicates() + if typ == 'index': + self.vector = PeriodIndex(data * 1000, freq='M') + elif typ == 'series': + self.vector = Series(data * 1000) - def time_value_counts_pseries(self): - self.s.value_counts() + def time_drop_duplicates(self, typ): + self.vector.drop_duplicates() - def time_value_counts_pindex(self): - self.i.value_counts() + def time_value_counts(self, typ): + self.vector.value_counts() -class period_standard_indexing(object): - goal_time = 0.2 +class Indexing(object): def setup(self): - self.index = PeriodIndex(start='1985', periods=1000, freq='D') + self.index = period_range(start='1985', periods=1000, freq='D') self.series = Series(range(1000), index=self.index) self.period = self.index[500] @@ -70,7 +135,10 @@ def time_series_loc(self): self.series.loc[self.period] def time_align(self): - pd.DataFrame({'a': self.series, 'b': self.series[:500]}) + DataFrame({'a': self.series, 'b': self.series[:500]}) def time_intersection(self): self.index[:750].intersection(self.index[250:]) + + def time_unique(self): + self.index.unique() diff --git a/asv_bench/benchmarks/plotting.py b/asv_bench/benchmarks/plotting.py index 757c3e27dd333..8a67af0bdabd1 100644 --- a/asv_bench/benchmarks/plotting.py +++ b/asv_bench/benchmarks/plotting.py @@ -1,21 +1,69 @@ -from .pandas_vb_common import * +import numpy as np +from pandas import DataFrame, Series, DatetimeIndex, date_range try: - from pandas import date_range + from pandas.plotting import andrews_curves except ImportError: - def date_range(start=None, end=None, periods=None, freq=None): - return DatetimeIndex(start, end, periods=periods, offset=freq) -from pandas.tools.plotting import andrews_curves + from pandas.tools.plotting import andrews_curves +import matplotlib +matplotlib.use('Agg') + + +class SeriesPlotting(object): + params = [['line', 'bar', 'area', 'barh', 'hist', 'kde', 'pie']] + param_names = ['kind'] + + def setup(self, kind): + if kind in ['bar', 'barh', 'pie']: + n = 100 + elif kind in ['kde']: + n = 10000 + else: + n = 1000000 + + self.s = Series(np.random.randn(n)) + if kind in ['area', 'pie']: + self.s = self.s.abs() + + def time_series_plot(self, kind): + self.s.plot(kind=kind) + + +class FramePlotting(object): + params = [['line', 'bar', 'area', 'barh', 'hist', 'kde', 'pie', 'scatter', + 'hexbin']] + param_names = ['kind'] + + def setup(self, kind): + if kind in ['bar', 'barh', 'pie']: + n = 100 + elif kind in ['kde', 'scatter', 'hexbin']: + n = 10000 + else: + n = 1000000 + + self.x = Series(np.random.randn(n)) + self.y = Series(np.random.randn(n)) + if kind in ['area', 'pie']: + self.x = self.x.abs() + self.y = self.y.abs() + self.df = DataFrame({'x': self.x, 'y': self.y}) + + def time_frame_plot(self, kind): + self.df.plot(x='x', y='y', kind=kind) class TimeseriesPlotting(object): - goal_time = 0.2 def setup(self): - import matplotlib - matplotlib.use('Agg') - self.N = 2000 - self.M = 5 - self.df = DataFrame(np.random.randn(self.N, self.M), index=date_range('1/1/1975', periods=self.N)) + N = 2000 + M = 5 + idx = date_range('1/1/1975', periods=N) + self.df = DataFrame(np.random.randn(N, M), index=idx) + + idx_irregular = DatetimeIndex(np.concatenate((idx.values[0:10], + idx.values[12:]))) + self.df2 = DataFrame(np.random.randn(len(idx_irregular), M), + index=idx_irregular) def time_plot_regular(self): self.df.plot() @@ -23,18 +71,23 @@ def time_plot_regular(self): def time_plot_regular_compat(self): self.df.plot(x_compat=True) + def time_plot_irregular(self): + self.df2.plot() + + def time_plot_table(self): + self.df.plot(table=True) + class Misc(object): - goal_time = 0.6 def setup(self): - import matplotlib - matplotlib.use('Agg') - self.N = 500 - self.M = 10 - data_dict = {x: np.random.randn(self.N) for x in range(self.M)} - data_dict["Name"] = ["A"] * self.N - self.df = DataFrame(data_dict) + N = 500 + M = 10 + self.df = DataFrame(np.random.randn(N, M)) + self.df['Name'] = ["A"] * N def time_plot_andrews_curves(self): andrews_curves(self.df, "Name") + + +from .pandas_vb_common import setup # noqa: F401 diff --git a/asv_bench/benchmarks/reindex.py b/asv_bench/benchmarks/reindex.py index 537d275e7c727..3080b34024a33 100644 --- a/asv_bench/benchmarks/reindex.py +++ b/asv_bench/benchmarks/reindex.py @@ -1,98 +1,79 @@ -from .pandas_vb_common import * -from random import shuffle +import numpy as np +import pandas.util.testing as tm +from pandas import (DataFrame, Series, MultiIndex, Index, date_range, + period_range) +from .pandas_vb_common import lib -class Reindexing(object): - goal_time = 0.2 +class Reindex(object): def setup(self): - self.rng = DatetimeIndex(start='1/1/1970', periods=10000, freq='1min') - self.df = DataFrame(np.random.rand(10000, 10), index=self.rng, + rng = date_range(start='1/1/1970', periods=10000, freq='1min') + self.df = DataFrame(np.random.rand(10000, 10), index=rng, columns=range(10)) self.df['foo'] = 'bar' - self.rng2 = Index(self.rng[::2]) - + self.rng_subset = Index(rng[::2]) self.df2 = DataFrame(index=range(10000), data=np.random.rand(10000, 30), columns=range(30)) - - # multi-index N = 5000 K = 200 level1 = tm.makeStringIndex(N).values.repeat(K) level2 = np.tile(tm.makeStringIndex(K).values, N) index = MultiIndex.from_arrays([level1, level2]) - self.s1 = Series(np.random.randn((N * K)), index=index) - self.s2 = self.s1[::2] + self.s = Series(np.random.randn(N * K), index=index) + self.s_subset = self.s[::2] def time_reindex_dates(self): - self.df.reindex(self.rng2) + self.df.reindex(self.rng_subset) def time_reindex_columns(self): self.df2.reindex(columns=self.df.columns[1:5]) def time_reindex_multiindex(self): - self.s1.reindex(self.s2.index) - + self.s.reindex(self.s_subset.index) -#---------------------------------------------------------------------- -# Pad / backfill +class ReindexMethod(object): -class FillMethod(object): - goal_time = 0.2 - - def setup(self): - self.rng = date_range('1/1/2000', periods=100000, freq='1min') - self.ts = Series(np.random.randn(len(self.rng)), index=self.rng) - self.ts2 = self.ts[::2] - self.ts3 = self.ts2.reindex(self.ts.index) - self.ts4 = self.ts3.astype('float32') + params = [['pad', 'backfill'], [date_range, period_range]] + param_names = ['method', 'constructor'] - def pad(self, source_series, target_index): - try: - source_series.reindex(target_index, method='pad') - except: - source_series.reindex(target_index, fillMethod='pad') + def setup(self, method, constructor): + N = 100000 + self.idx = constructor('1/1/2000', periods=N, freq='1min') + self.ts = Series(np.random.randn(N), index=self.idx)[::2] - def backfill(self, source_series, target_index): - try: - source_series.reindex(target_index, method='backfill') - except: - source_series.reindex(target_index, fillMethod='backfill') + def time_reindex_method(self, method, constructor): + self.ts.reindex(self.idx, method=method) - def time_backfill_dates(self): - self.backfill(self.ts2, self.ts.index) - def time_pad_daterange(self): - self.pad(self.ts2, self.ts.index) +class Fillna(object): - def time_backfill(self): - self.ts3.fillna(method='backfill') + params = ['pad', 'backfill'] + param_names = ['method'] - def time_backfill_float32(self): - self.ts4.fillna(method='backfill') + def setup(self, method): + N = 100000 + self.idx = date_range('1/1/2000', periods=N, freq='1min') + ts = Series(np.random.randn(N), index=self.idx)[::2] + self.ts_reindexed = ts.reindex(self.idx) + self.ts_float32 = self.ts_reindexed.astype('float32') - def time_pad(self): - self.ts3.fillna(method='pad') + def time_reindexed(self, method): + self.ts_reindexed.fillna(method=method) - def time_pad_float32(self): - self.ts4.fillna(method='pad') - - -#---------------------------------------------------------------------- -# align on level + def time_float_32(self, method): + self.ts_float32.fillna(method=method) class LevelAlign(object): - goal_time = 0.2 def setup(self): self.index = MultiIndex( levels=[np.arange(10), np.arange(100), np.arange(100)], - labels=[np.arange(10).repeat(10000), - np.tile(np.arange(100).repeat(100), 10), - np.tile(np.tile(np.arange(100), 100), 10)]) - random.shuffle(self.index.values) + codes=[np.arange(10).repeat(10000), + np.tile(np.arange(100).repeat(100), 10), + np.tile(np.tile(np.arange(100), 100), 10)]) self.df = DataFrame(np.random.randn(len(self.index), 4), index=self.index) self.df_level = DataFrame(np.random.randn(100, 4), @@ -102,106 +83,82 @@ def time_align_level(self): self.df.align(self.df_level, level=1, copy=False) def time_reindex_level(self): - self.df_level.reindex(self.df.index, level=1) - - -#---------------------------------------------------------------------- -# drop_duplicates + self.df_level.reindex(self.index, level=1) -class Duplicates(object): - goal_time = 0.2 +class DropDuplicates(object): - def setup(self): - self.N = 10000 - self.K = 10 - self.key1 = tm.makeStringIndex(self.N).values.repeat(self.K) - self.key2 = tm.makeStringIndex(self.N).values.repeat(self.K) - self.df = DataFrame({'key1': self.key1, 'key2': self.key2, - 'value': np.random.randn((self.N * self.K)),}) - self.col_array_list = list(self.df.values.T) + params = [True, False] + param_names = ['inplace'] - self.df2 = self.df.copy() - self.df2.ix[:10000, :] = np.nan + def setup(self, inplace): + N = 10000 + K = 10 + key1 = tm.makeStringIndex(N).values.repeat(K) + key2 = tm.makeStringIndex(N).values.repeat(K) + self.df = DataFrame({'key1': key1, 'key2': key2, + 'value': np.random.randn(N * K)}) + self.df_nan = self.df.copy() + self.df_nan.iloc[:10000, :] = np.nan self.s = Series(np.random.randint(0, 1000, size=10000)) - self.s2 = Series(np.tile(tm.makeStringIndex(1000).values, 10)) - - np.random.seed(1234) - self.N = 1000000 - self.K = 10000 - self.key1 = np.random.randint(0, self.K, size=self.N) - self.df_int = DataFrame({'key1': self.key1}) - self.df_bool = DataFrame({i: np.random.randint(0, 2, size=self.K, - dtype=bool) - for i in range(10)}) - - def time_frame_drop_dups(self): - self.df.drop_duplicates(['key1', 'key2']) + self.s_str = Series(np.tile(tm.makeStringIndex(1000).values, 10)) - def time_frame_drop_dups_inplace(self): - self.df.drop_duplicates(['key1', 'key2'], inplace=True) + N = 1000000 + K = 10000 + key1 = np.random.randint(0, K, size=N) + self.df_int = DataFrame({'key1': key1}) + self.df_bool = DataFrame(np.random.randint(0, 2, size=(K, 10), + dtype=bool)) - def time_frame_drop_dups_na(self): - self.df2.drop_duplicates(['key1', 'key2']) + def time_frame_drop_dups(self, inplace): + self.df.drop_duplicates(['key1', 'key2'], inplace=inplace) - def time_frame_drop_dups_na_inplace(self): - self.df2.drop_duplicates(['key1', 'key2'], inplace=True) + def time_frame_drop_dups_na(self, inplace): + self.df_nan.drop_duplicates(['key1', 'key2'], inplace=inplace) - def time_series_drop_dups_int(self): - self.s.drop_duplicates() + def time_series_drop_dups_int(self, inplace): + self.s.drop_duplicates(inplace=inplace) - def time_series_drop_dups_string(self): - self.s2.drop_duplicates() + def time_series_drop_dups_string(self, inplace): + self.s_str.drop_duplicates(inplace=inplace) - def time_frame_drop_dups_int(self): - self.df_int.drop_duplicates() + def time_frame_drop_dups_int(self, inplace): + self.df_int.drop_duplicates(inplace=inplace) - def time_frame_drop_dups_bool(self): - self.df_bool.drop_duplicates() - -#---------------------------------------------------------------------- -# blog "pandas escaped the zoo" + def time_frame_drop_dups_bool(self, inplace): + self.df_bool.drop_duplicates(inplace=inplace) class Align(object): - goal_time = 0.2 - + # blog "pandas escaped the zoo" def setup(self): n = 50000 indices = tm.makeStringIndex(n) subsample_size = 40000 - - def sample(values, k): - sampler = np.arange(len(values)) - shuffle(sampler) - return values.take(sampler[:k]) - - self.x = Series(np.random.randn(50000), indices) + self.x = Series(np.random.randn(n), indices) self.y = Series(np.random.randn(subsample_size), - index=sample(indices, subsample_size)) + index=np.random.choice(indices, subsample_size, + replace=False)) def time_align_series_irregular_string(self): - (self.x + self.y) + self.x + self.y class LibFastZip(object): - goal_time = 0.2 def setup(self): - self.N = 10000 - self.K = 10 - self.key1 = tm.makeStringIndex(self.N).values.repeat(self.K) - self.key2 = tm.makeStringIndex(self.N).values.repeat(self.K) - self.df = DataFrame({'key1': self.key1, 'key2': self.key2, 'value': np.random.randn((self.N * self.K)), }) - self.col_array_list = list(self.df.values.T) - - self.df2 = self.df.copy() - self.df2.ix[:10000, :] = np.nan - self.col_array_list2 = list(self.df2.values.T) + N = 10000 + K = 10 + key1 = tm.makeStringIndex(N).values.repeat(K) + key2 = tm.makeStringIndex(N).values.repeat(K) + col_array = np.vstack([key1, key2, np.random.randn(N * K)]) + col_array2 = col_array.copy() + col_array2[:, :10000] = np.nan + self.col_array_list = list(col_array) def time_lib_fast_zip(self): lib.fast_zip(self.col_array_list) - def time_lib_fast_zip_fillna(self): - lib.fast_zip_fillna(self.col_array_list2) + +from .pandas_vb_common import setup # noqa: F401 diff --git a/asv_bench/benchmarks/replace.py b/asv_bench/benchmarks/replace.py index 66b8af53801ac..d8efaf99e2c4d 100644 --- a/asv_bench/benchmarks/replace.py +++ b/asv_bench/benchmarks/replace.py @@ -1,72 +1,56 @@ -from .pandas_vb_common import * -from pandas.compat import range -from datetime import timedelta +import numpy as np +import pandas as pd -class replace_fillna(object): - goal_time = 0.2 +class FillNa(object): - def setup(self): - self.N = 1000000 - try: - self.rng = date_range('1/1/2000', periods=self.N, freq='min') - except NameError: - self.rng = DatetimeIndex('1/1/2000', periods=self.N, offset=datetools.Minute()) - self.date_range = DateRange - self.ts = Series(np.random.randn(self.N), index=self.rng) + params = [True, False] + param_names = ['inplace'] - def time_replace_fillna(self): - self.ts.fillna(0.0, inplace=True) + def setup(self, inplace): + N = 10**6 + rng = pd.date_range('1/1/2000', periods=N, freq='min') + data = np.random.randn(N) + data[::2] = np.nan + self.ts = pd.Series(data, index=rng) + def time_fillna(self, inplace): + self.ts.fillna(0.0, inplace=inplace) -class replace_large_dict(object): - goal_time = 0.2 + def time_replace(self, inplace): + self.ts.replace(np.nan, 0.0, inplace=inplace) - def setup(self): - self.n = (10 ** 6) - self.start_value = (10 ** 5) - self.to_rep = dict(((i, (self.start_value + i)) for i in range(self.n))) - self.s = Series(np.random.randint(self.n, size=(10 ** 3))) - def time_replace_large_dict(self): - self.s.replace(self.to_rep, inplace=True) +class ReplaceDict(object): + params = [True, False] + param_names = ['inplace'] -class replace_convert(object): - goal_time = 0.5 + def setup(self, inplace): + N = 10**5 + start_value = 10**5 + self.to_rep = dict(enumerate(np.arange(N) + start_value)) + self.s = pd.Series(np.random.randint(N, size=10**3)) - def setup(self): - self.n = (10 ** 3) - self.to_ts = dict(((i, pd.Timestamp(i)) for i in range(self.n))) - self.to_td = dict(((i, pd.Timedelta(i)) for i in range(self.n))) - self.s = Series(np.random.randint(self.n, size=(10 ** 3))) - self.df = DataFrame({'A': np.random.randint(self.n, size=(10 ** 3)), - 'B': np.random.randint(self.n, size=(10 ** 3))}) + def time_replace_series(self, inplace): + self.s.replace(self.to_rep, inplace=inplace) - def time_replace_series_timestamp(self): - self.s.replace(self.to_ts) - def time_replace_series_timedelta(self): - self.s.replace(self.to_td) +class Convert(object): - def time_replace_frame_timestamp(self): - self.df.replace(self.to_ts) + params = (['DataFrame', 'Series'], ['Timestamp', 'Timedelta']) + param_names = ['constructor', 'replace_data'] - def time_replace_frame_timedelta(self): - self.df.replace(self.to_td) + def setup(self, constructor, replace_data): + N = 10**3 + data = {'Series': pd.Series(np.random.randint(N, size=N)), + 'DataFrame': pd.DataFrame({'A': np.random.randint(N, size=N), + 'B': np.random.randint(N, size=N)})} + self.to_replace = {i: getattr(pd, replace_data) for i in range(N)} + self.data = data[constructor] + def time_replace(self, constructor, replace_data): + self.data.replace(self.to_replace) -class replace_replacena(object): - goal_time = 0.2 - def setup(self): - self.N = 1000000 - try: - self.rng = date_range('1/1/2000', periods=self.N, freq='min') - except NameError: - self.rng = DatetimeIndex('1/1/2000', periods=self.N, offset=datetools.Minute()) - self.date_range = DateRange - self.ts = Series(np.random.randn(self.N), index=self.rng) - - def time_replace_replacena(self): - self.ts.replace(np.nan, 0.0, inplace=True) +from .pandas_vb_common import setup # noqa: F401 diff --git a/asv_bench/benchmarks/reshape.py b/asv_bench/benchmarks/reshape.py index b9346c497b9ef..f6ee107ab618e 100644 --- a/asv_bench/benchmarks/reshape.py +++ b/asv_bench/benchmarks/reshape.py @@ -1,13 +1,14 @@ -from .pandas_vb_common import * -from pandas.core.reshape import melt, wide_to_long +import string +from itertools import product +import numpy as np +from pandas import DataFrame, MultiIndex, date_range, melt, wide_to_long +import pandas as pd -class melt_dataframe(object): - goal_time = 0.2 + +class Melt(object): def setup(self): - self.index = MultiIndex.from_arrays([np.arange(100).repeat(100), np.roll(np.tile(np.arange(100), 100), 25)]) - self.df = DataFrame(np.random.randn(10000, 4), index=self.index) self.df = DataFrame(np.random.randn(10000, 3), columns=['A', 'B', 'C']) self.df['id1'] = np.random.randint(0, 10, 10000) self.df['id2'] = np.random.randint(100, 1000, 10000) @@ -16,104 +17,203 @@ def time_melt_dataframe(self): melt(self.df, id_vars=['id1', 'id2']) -class reshape_pivot_time_series(object): - goal_time = 0.2 +class Pivot(object): def setup(self): - self.index = MultiIndex.from_arrays([np.arange(100).repeat(100), np.roll(np.tile(np.arange(100), 100), 25)]) - self.df = DataFrame(np.random.randn(10000, 4), index=self.index) - self.index = date_range('1/1/2000', periods=10000, freq='h') - self.df = DataFrame(randn(10000, 50), index=self.index, columns=range(50)) - self.pdf = self.unpivot(self.df) - self.f = (lambda : self.pdf.pivot('date', 'variable', 'value')) + N = 10000 + index = date_range('1/1/2000', periods=N, freq='h') + data = {'value': np.random.randn(N * 50), + 'variable': np.arange(50).repeat(N), + 'date': np.tile(index.values, 50)} + self.df = DataFrame(data) def time_reshape_pivot_time_series(self): - self.f() - - def unpivot(self, frame): - (N, K) = frame.shape - self.data = {'value': frame.values.ravel('F'), 'variable': np.asarray(frame.columns).repeat(N), 'date': np.tile(np.asarray(frame.index), K), } - return DataFrame(self.data, columns=['date', 'variable', 'value']) + self.df.pivot('date', 'variable', 'value') -class reshape_stack_simple(object): - goal_time = 0.2 +class SimpleReshape(object): def setup(self): - self.index = MultiIndex.from_arrays([np.arange(100).repeat(100), np.roll(np.tile(np.arange(100), 100), 25)]) - self.df = DataFrame(np.random.randn(10000, 4), index=self.index) + arrays = [np.arange(100).repeat(100), + np.roll(np.tile(np.arange(100), 100), 25)] + index = MultiIndex.from_arrays(arrays) + self.df = DataFrame(np.random.randn(10000, 4), index=index) self.udf = self.df.unstack(1) - def time_reshape_stack_simple(self): + def time_stack(self): self.udf.stack() - -class reshape_unstack_simple(object): - goal_time = 0.2 - - def setup(self): - self.index = MultiIndex.from_arrays([np.arange(100).repeat(100), np.roll(np.tile(np.arange(100), 100), 25)]) - self.df = DataFrame(np.random.randn(10000, 4), index=self.index) - - def time_reshape_unstack_simple(self): + def time_unstack(self): self.df.unstack(1) -class reshape_unstack_large_single_dtype(object): - goal_time = 0.2 +class Unstack(object): - def setup(self): + params = ['int', 'category'] + + def setup(self, dtype): m = 100 n = 1000 levels = np.arange(m) - index = pd.MultiIndex.from_product([levels]*2) + index = MultiIndex.from_product([levels] * 2) columns = np.arange(n) - values = np.arange(m*m*n).reshape(m*m, n) - self.df = pd.DataFrame(values, index, columns) + if dtype == 'int': + values = np.arange(m * m * n).reshape(m * m, n) + else: + # the category branch is ~20x slower than int. So we + # cut down the size a bit. Now it's only ~3x slower. + n = 50 + columns = columns[:n] + indices = np.random.randint(0, 52, size=(m * m, n)) + values = np.take(list(string.ascii_letters), indices) + values = [pd.Categorical(v) for v in values.T] + + self.df = DataFrame(values, index, columns) self.df2 = self.df.iloc[:-1] - def time_unstack_full_product(self): + def time_full_product(self, dtype): self.df.unstack() - def time_unstack_with_mask(self): + def time_without_last_row(self, dtype): self.df2.unstack() -class unstack_sparse_keyspace(object): - goal_time = 0.2 +class SparseIndex(object): def setup(self): - self.index = MultiIndex.from_arrays([np.arange(100).repeat(100), np.roll(np.tile(np.arange(100), 100), 25)]) - self.df = DataFrame(np.random.randn(10000, 4), index=self.index) - self.NUM_ROWS = 1000 - for iter in range(10): - self.df = DataFrame({'A': np.random.randint(50, size=self.NUM_ROWS), 'B': np.random.randint(50, size=self.NUM_ROWS), 'C': np.random.randint((-10), 10, size=self.NUM_ROWS), 'D': np.random.randint((-10), 10, size=self.NUM_ROWS), 'E': np.random.randint(10, size=self.NUM_ROWS), 'F': np.random.randn(self.NUM_ROWS), }) - self.idf = self.df.set_index(['A', 'B', 'C', 'D', 'E']) - if (len(self.idf.index.unique()) == self.NUM_ROWS): - break - - def time_unstack_sparse_keyspace(self): - self.idf.unstack() + NUM_ROWS = 1000 + self.df = DataFrame({'A': np.random.randint(50, size=NUM_ROWS), + 'B': np.random.randint(50, size=NUM_ROWS), + 'C': np.random.randint(-10, 10, size=NUM_ROWS), + 'D': np.random.randint(-10, 10, size=NUM_ROWS), + 'E': np.random.randint(10, size=NUM_ROWS), + 'F': np.random.randn(NUM_ROWS)}) + self.df = self.df.set_index(['A', 'B', 'C', 'D', 'E']) + + def time_unstack(self): + self.df.unstack() -class wide_to_long_big(object): - goal_time = 0.2 +class WideToLong(object): def setup(self): - vars = 'ABCD' nyrs = 20 nidvars = 20 N = 5000 - yrvars = [] - for var in vars: - for yr in range(1, nyrs + 1): - yrvars.append(var + str(yr)) - - self.df = pd.DataFrame(np.random.randn(N, nidvars + len(yrvars)), - columns=list(range(nidvars)) + yrvars) - self.vars = vars + self.letters = list('ABCD') + yrvars = [l + str(num) + for l, num in product(self.letters, range(1, nyrs + 1))] + columns = [str(i) for i in range(nidvars)] + yrvars + self.df = DataFrame(np.random.randn(N, nidvars + len(yrvars)), + columns=columns) + self.df['id'] = self.df.index def time_wide_to_long_big(self): - self.df['id'] = self.df.index - wide_to_long(self.df, list(self.vars), i='id', j='year') + wide_to_long(self.df, self.letters, i='id', j='year') + + +class PivotTable(object): + + def setup(self): + N = 100000 + fac1 = np.array(['A', 'B', 'C'], dtype='O') + fac2 = np.array(['one', 'two'], dtype='O') + ind1 = np.random.randint(0, 3, size=N) + ind2 = np.random.randint(0, 2, size=N) + self.df = DataFrame({'key1': fac1.take(ind1), + 'key2': fac2.take(ind2), + 'key3': fac2.take(ind2), + 'value1': np.random.randn(N), + 'value2': np.random.randn(N), + 'value3': np.random.randn(N)}) + + def time_pivot_table(self): + self.df.pivot_table(index='key1', columns=['key2', 'key3']) + + def time_pivot_table_agg(self): + self.df.pivot_table(index='key1', columns=['key2', 'key3'], + aggfunc=['sum', 'mean']) + + def time_pivot_table_margins(self): + self.df.pivot_table(index='key1', columns=['key2', 'key3'], + margins=True) + + +class Crosstab(object): + + def setup(self): + N = 100000 + fac1 = np.array(['A', 'B', 'C'], dtype='O') + fac2 = np.array(['one', 'two'], dtype='O') + self.ind1 = np.random.randint(0, 3, size=N) + self.ind2 = np.random.randint(0, 2, size=N) + self.vec1 = fac1.take(self.ind1) + self.vec2 = fac2.take(self.ind2) + + def time_crosstab(self): + pd.crosstab(self.vec1, self.vec2) + + def time_crosstab_values(self): + pd.crosstab(self.vec1, self.vec2, values=self.ind1, aggfunc='sum') + + def time_crosstab_normalize(self): + pd.crosstab(self.vec1, self.vec2, normalize=True) + + def time_crosstab_normalize_margins(self): + pd.crosstab(self.vec1, self.vec2, normalize=True, margins=True) + + +class GetDummies(object): + def setup(self): + categories = list(string.ascii_letters[:12]) + s = pd.Series(np.random.choice(categories, size=1000000), + dtype=pd.api.types.CategoricalDtype(categories)) + self.s = s + + def time_get_dummies_1d(self): + pd.get_dummies(self.s, sparse=False) + + def time_get_dummies_1d_sparse(self): + pd.get_dummies(self.s, sparse=True) + + +class Cut(object): + params = [[4, 10, 1000]] + param_names = ['bins'] + + def setup(self, bins): + N = 10**5 + self.int_series = pd.Series(np.arange(N).repeat(5)) + self.float_series = pd.Series(np.random.randn(N).repeat(5)) + self.timedelta_series = pd.Series(np.random.randint(N, size=N), + dtype='timedelta64[ns]') + self.datetime_series = pd.Series(np.random.randint(N, size=N), + dtype='datetime64[ns]') + + def time_cut_int(self, bins): + pd.cut(self.int_series, bins) + + def time_cut_float(self, bins): + pd.cut(self.float_series, bins) + + def time_cut_timedelta(self, bins): + pd.cut(self.timedelta_series, bins) + + def time_cut_datetime(self, bins): + pd.cut(self.datetime_series, bins) + + def time_qcut_int(self, bins): + pd.qcut(self.int_series, bins) + + def time_qcut_float(self, bins): + pd.qcut(self.float_series, bins) + + def time_qcut_timedelta(self, bins): + pd.qcut(self.timedelta_series, bins) + + def time_qcut_datetime(self, bins): + pd.qcut(self.datetime_series, bins) + + +from .pandas_vb_common import setup # noqa: F401 diff --git a/asv_bench/benchmarks/rolling.py b/asv_bench/benchmarks/rolling.py new file mode 100644 index 0000000000000..659b6591fbd4b --- /dev/null +++ b/asv_bench/benchmarks/rolling.py @@ -0,0 +1,116 @@ +import pandas as pd +import numpy as np + + +class Methods(object): + + sample_time = 0.2 + params = (['DataFrame', 'Series'], + [10, 1000], + ['int', 'float'], + ['median', 'mean', 'max', 'min', 'std', 'count', 'skew', 'kurt', + 'sum']) + param_names = ['contructor', 'window', 'dtype', 'method'] + + def setup(self, constructor, window, dtype, method): + N = 10**5 + arr = (100 * np.random.random(N)).astype(dtype) + self.roll = getattr(pd, constructor)(arr).rolling(window) + + def time_rolling(self, constructor, window, dtype, method): + getattr(self.roll, method)() + + +class ExpandingMethods(object): + + sample_time = 0.2 + params = (['DataFrame', 'Series'], + ['int', 'float'], + ['median', 'mean', 'max', 'min', 'std', 'count', 'skew', 'kurt', + 'sum']) + param_names = ['contructor', 'window', 'dtype', 'method'] + + def setup(self, constructor, dtype, method): + N = 10**5 + arr = (100 * np.random.random(N)).astype(dtype) + self.expanding = getattr(pd, constructor)(arr).expanding() + + def time_expanding(self, constructor, dtype, method): + getattr(self.expanding, method)() + + +class EWMMethods(object): + + sample_time = 0.2 + params = (['DataFrame', 'Series'], + [10, 1000], + ['int', 'float'], + ['mean', 'std']) + param_names = ['contructor', 'window', 'dtype', 'method'] + + def setup(self, constructor, window, dtype, method): + N = 10**5 + arr = (100 * np.random.random(N)).astype(dtype) + self.ewm = getattr(pd, constructor)(arr).ewm(halflife=window) + + def time_ewm(self, constructor, window, dtype, method): + getattr(self.ewm, method)() + + +class VariableWindowMethods(Methods): + sample_time = 0.2 + params = (['DataFrame', 'Series'], + ['50s', '1h', '1d'], + ['int', 'float'], + ['median', 'mean', 'max', 'min', 'std', 'count', 'skew', 'kurt', + 'sum']) + param_names = ['contructor', 'window', 'dtype', 'method'] + + def setup(self, constructor, window, dtype, method): + N = 10**5 + arr = (100 * np.random.random(N)).astype(dtype) + index = pd.date_range('2017-01-01', periods=N, freq='5s') + self.roll = getattr(pd, constructor)(arr, index=index).rolling(window) + + +class Pairwise(object): + + sample_time = 0.2 + params = ([10, 1000, None], + ['corr', 'cov'], + [True, False]) + param_names = ['window', 'method', 'pairwise'] + + def setup(self, window, method, pairwise): + N = 10**4 + arr = np.random.random(N) + self.df = pd.DataFrame(arr) + + def time_pairwise(self, window, method, pairwise): + if window is None: + r = self.df.expanding() + else: + r = self.df.rolling(window=window) + getattr(r, method)(self.df, pairwise=pairwise) + + +class Quantile(object): + sample_time = 0.2 + params = (['DataFrame', 'Series'], + [10, 1000], + ['int', 'float'], + [0, 0.5, 1], + ['linear', 'nearest', 'lower', 'higher', 'midpoint']) + param_names = ['constructor', 'window', 'dtype', 'percentile'] + + def setup(self, constructor, window, dtype, percentile, interpolation): + N = 10 ** 5 + arr = np.random.random(N).astype(dtype) + self.roll = getattr(pd, constructor)(arr).rolling(window) + + def time_quantile(self, constructor, window, dtype, percentile, + interpolation): + self.roll.quantile(percentile, interpolation=interpolation) + + +from .pandas_vb_common import setup # noqa: F401 diff --git a/asv_bench/benchmarks/series_methods.py b/asv_bench/benchmarks/series_methods.py index c66654ee1e006..3303483c50e20 100644 --- a/asv_bench/benchmarks/series_methods.py +++ b/asv_bench/benchmarks/series_methods.py @@ -1,122 +1,204 @@ -from .pandas_vb_common import * +from datetime import datetime +import numpy as np +import pandas.util.testing as tm +from pandas import Series, date_range, NaT -class series_constructor_no_data_datetime_index(object): - goal_time = 0.2 - def setup(self): - self.dr = pd.date_range( - start=datetime(2015,10,26), - end=datetime(2016,1,1), - freq='50s' - ) # ~100k long +class SeriesConstructor(object): - def time_series_constructor_no_data_datetime_index(self): - Series(data=None, index=self.dr) + params = [None, 'dict'] + param_names = ['data'] + def setup(self, data): + self.idx = date_range(start=datetime(2015, 10, 26), + end=datetime(2016, 1, 1), + freq='50s') + dict_data = dict(zip(self.idx, range(len(self.idx)))) + self.data = None if data is None else dict_data -class series_constructor_dict_data_datetime_index(object): - goal_time = 0.2 + def time_constructor(self, data): + Series(data=self.data, index=self.idx) - def setup(self): - self.dr = pd.date_range( - start=datetime(2015, 10, 26), - end=datetime(2016, 1, 1), - freq='50s' - ) # ~100k long - self.data = {d: v for d, v in zip(self.dr, range(len(self.dr)))} - def time_series_constructor_no_data_datetime_index(self): - Series(data=self.data, index=self.dr) +class IsIn(object): + params = ['int64', 'uint64', 'object'] + param_names = ['dtype'] -class series_isin_int64(object): - goal_time = 0.2 + def setup(self, dtype): + self.s = Series(np.random.randint(1, 10, 100000)).astype(dtype) + self.values = [1, 2] + + def time_isin(self, dtypes): + self.s.isin(self.values) + + +class IsInFloat64(object): def setup(self): - self.s3 = Series(np.random.randint(1, 10, 100000)).astype('int64') - self.s4 = Series(np.random.randint(1, 100, 10000000)).astype('int64') - self.values = [1, 2] + self.small = Series([1, 2], dtype=np.float64) + self.many_different_values = np.arange(10**6, dtype=np.float64) + self.few_different_values = np.zeros(10**7, dtype=np.float64) + self.only_nans_values = np.full(10**7, np.nan, dtype=np.float64) - def time_series_isin_int64(self): - self.s3.isin(self.values) + def time_isin_many_different(self): + # runtime is dominated by creation of the lookup-table + self.small.isin(self.many_different_values) - def time_series_isin_int64_large(self): - self.s4.isin(self.values) + def time_isin_few_different(self): + # runtime is dominated by creation of the lookup-table + self.small.isin(self.few_different_values) + def time_isin_nan_values(self): + # runtime is dominated by creation of the lookup-table + self.small.isin(self.few_different_values) -class series_isin_object(object): - goal_time = 0.2 + +class IsInForObjects(object): def setup(self): - self.s3 = Series(np.random.randint(1, 10, 100000)).astype('int64') - self.values = [1, 2] - self.s4 = self.s3.astype('object') + self.s_nans = Series(np.full(10**4, np.nan)).astype(np.object) + self.vals_nans = np.full(10**4, np.nan).astype(np.object) + self.s_short = Series(np.arange(2)).astype(np.object) + self.s_long = Series(np.arange(10**5)).astype(np.object) + self.vals_short = np.arange(2).astype(np.object) + self.vals_long = np.arange(10**5).astype(np.object) + # because of nans floats are special: + self.s_long_floats = Series(np.arange(10**5, + dtype=np.float)).astype(np.object) + self.vals_long_floats = np.arange(10**5, + dtype=np.float).astype(np.object) - def time_series_isin_object(self): - self.s4.isin(self.values) + def time_isin_nans(self): + # if nan-objects are different objects, + # this has the potential to trigger O(n^2) running time + self.s_nans.isin(self.vals_nans) + def time_isin_short_series_long_values(self): + # running time dominated by the preprocessing + self.s_short.isin(self.vals_long) -class series_nlargest1(object): - goal_time = 0.2 + def time_isin_long_series_short_values(self): + # running time dominated by look-up + self.s_long.isin(self.vals_short) - def setup(self): - self.s1 = Series(np.random.randn(10000)) - self.s2 = Series(np.random.randint(1, 10, 10000)) - self.s3 = Series(np.random.randint(1, 10, 100000)).astype('int64') - self.values = [1, 2] - self.s4 = self.s3.astype('object') + def time_isin_long_series_long_values(self): + # no dominating part + self.s_long.isin(self.vals_long) - def time_series_nlargest1(self): - self.s1.nlargest(3, keep='last') - self.s1.nlargest(3, keep='first') + def time_isin_long_series_long_values_floats(self): + # no dominating part + self.s_long_floats.isin(self.vals_long_floats) -class series_nlargest2(object): - goal_time = 0.2 +class NSort(object): - def setup(self): - self.s1 = Series(np.random.randn(10000)) - self.s2 = Series(np.random.randint(1, 10, 10000)) - self.s3 = Series(np.random.randint(1, 10, 100000)).astype('int64') - self.values = [1, 2] - self.s4 = self.s3.astype('object') + params = ['first', 'last', 'all'] + param_names = ['keep'] - def time_series_nlargest2(self): - self.s2.nlargest(3, keep='last') - self.s2.nlargest(3, keep='first') + def setup(self, keep): + self.s = Series(np.random.randint(1, 10, 100000)) + def time_nlargest(self, keep): + self.s.nlargest(3, keep=keep) -class series_nsmallest2(object): - goal_time = 0.2 + def time_nsmallest(self, keep): + self.s.nsmallest(3, keep=keep) - def setup(self): - self.s1 = Series(np.random.randn(10000)) - self.s2 = Series(np.random.randint(1, 10, 10000)) - self.s3 = Series(np.random.randint(1, 10, 100000)).astype('int64') - self.values = [1, 2] - self.s4 = self.s3.astype('object') - def time_series_nsmallest2(self): - self.s2.nsmallest(3, keep='last') - self.s2.nsmallest(3, keep='first') +class Dropna(object): + + params = ['int', 'datetime'] + param_names = ['dtype'] + + def setup(self, dtype): + N = 10**6 + data = {'int': np.random.randint(1, 10, N), + 'datetime': date_range('2000-01-01', freq='S', periods=N)} + self.s = Series(data[dtype]) + if dtype == 'datetime': + self.s[np.random.randint(1, N, 100)] = NaT + + def time_dropna(self, dtype): + self.s.dropna() + +class SearchSorted(object): -class series_dropna_int64(object): goal_time = 0.2 + params = ['int8', 'int16', 'int32', 'int64', + 'uint8', 'uint16', 'uint32', 'uint64', + 'float16', 'float32', 'float64', + 'str'] + param_names = ['dtype'] + + def setup(self, dtype): + N = 10**5 + data = np.array([1] * N + [2] * N + [3] * N).astype(dtype) + self.s = Series(data) + + def time_searchsorted(self, dtype): + key = '2' if dtype == 'str' else 2 + self.s.searchsorted(key) + + +class Map(object): + + params = ['dict', 'Series'] + param_names = 'mapper' + + def setup(self, mapper): + map_size = 1000 + map_data = Series(map_size - np.arange(map_size)) + self.map_data = map_data if mapper == 'Series' else map_data.to_dict() + self.s = Series(np.random.randint(0, map_size, 10000)) + + def time_map(self, mapper): + self.s.map(self.map_data) + + +class Clip(object): + params = [50, 1000, 10**5] + param_names = ['n'] + + def setup(self, n): + self.s = Series(np.random.randn(n)) + + def time_clip(self, n): + self.s.clip(0, 1) + + +class ValueCounts(object): + + params = ['int', 'uint', 'float', 'object'] + param_names = ['dtype'] + + def setup(self, dtype): + self.s = Series(np.random.randint(0, 1000, size=100000)).astype(dtype) + + def time_value_counts(self, dtype): + self.s.value_counts() + + +class Dir(object): def setup(self): - self.s = Series(np.random.randint(1, 10, 1000000)) + self.s = Series(index=tm.makeStringIndex(10000)) - def time_series_dropna_int64(self): - self.s.dropna() + def time_dir_strings(self): + dir(self.s) -class series_dropna_datetime(object): - goal_time = 0.2 +class SeriesGetattr(object): + # https://github.com/pandas-dev/pandas/issues/19764 def setup(self): - self.s = Series(pd.date_range('2000-01-01', freq='S', periods=1000000)) - self.s[np.random.randint(1, 1000000, 100)] = pd.NaT + self.s = Series(1, + index=date_range("2012-01-01", freq='s', + periods=int(1e6))) - def time_series_dropna_datetime(self): - self.s.dropna() + def time_series_datetimeindex_repr(self): + getattr(self.s, 'a', None) + + +from .pandas_vb_common import setup # noqa: F401 diff --git a/asv_bench/benchmarks/sparse.py b/asv_bench/benchmarks/sparse.py index 717fe7218ceda..64f87c1670170 100644 --- a/asv_bench/benchmarks/sparse.py +++ b/asv_bench/benchmarks/sparse.py @@ -1,142 +1,152 @@ -from .pandas_vb_common import * -import pandas.sparse.series +import itertools + +import numpy as np import scipy.sparse -from pandas.core.sparse import SparseSeries, SparseDataFrame -from pandas.core.sparse import SparseDataFrame +from pandas import (SparseSeries, SparseDataFrame, SparseArray, Series, + date_range, MultiIndex) + +def make_array(size, dense_proportion, fill_value, dtype): + dense_size = int(size * dense_proportion) + arr = np.full(size, fill_value, dtype) + indexer = np.random.choice(np.arange(size), dense_size, replace=False) + arr[indexer] = np.random.choice(np.arange(100, dtype=dtype), dense_size) + return arr -class sparse_series_to_frame(object): - goal_time = 0.2 + +class SparseSeriesToFrame(object): def setup(self): - self.K = 50 - self.N = 50000 - self.rng = np.asarray(date_range('1/1/2000', periods=self.N, freq='T')) + K = 50 + N = 50001 + rng = date_range('1/1/2000', periods=N, freq='T') self.series = {} - for i in range(1, (self.K + 1)): - self.data = np.random.randn(self.N)[:(- i)] - self.this_rng = self.rng[:(- i)] - self.data[100:] = np.nan - self.series[i] = SparseSeries(self.data, index=self.this_rng) + for i in range(1, K): + data = np.random.randn(N)[:-i] + idx = rng[:-i] + data[100:] = np.nan + self.series[i] = SparseSeries(data, index=idx) - def time_sparse_series_to_frame(self): + def time_series_to_frame(self): SparseDataFrame(self.series) -class sparse_frame_constructor(object): - goal_time = 0.2 +class SparseArrayConstructor(object): - def time_sparse_frame_constructor(self): - SparseDataFrame(columns=np.arange(100), index=np.arange(1000)) + params = ([0.1, 0.01], [0, np.nan], + [np.int64, np.float64, np.object]) + param_names = ['dense_proportion', 'fill_value', 'dtype'] + def setup(self, dense_proportion, fill_value, dtype): + N = 10**6 + self.array = make_array(N, dense_proportion, fill_value, dtype) -class sparse_series_from_coo(object): - goal_time = 0.2 + def time_sparse_array(self, dense_proportion, fill_value, dtype): + SparseArray(self.array, fill_value=fill_value, dtype=dtype) - def setup(self): - self.A = scipy.sparse.coo_matrix(([3.0, 1.0, 2.0], ([1, 0, 0], [0, 2, 3])), shape=(100, 100)) - def time_sparse_series_from_coo(self): - self.ss = pandas.sparse.series.SparseSeries.from_coo(self.A) +class SparseDataFrameConstructor(object): + def setup(self): + N = 1000 + self.arr = np.arange(N) + self.sparse = scipy.sparse.rand(N, N, 0.005) + self.dict = dict(zip(range(N), itertools.repeat([0]))) -class sparse_series_to_coo(object): - goal_time = 0.2 + def time_constructor(self): + SparseDataFrame(columns=self.arr, index=self.arr) - def setup(self): - self.s = pd.Series(([np.nan] * 10000)) - self.s[0] = 3.0 - self.s[100] = (-1.0) - self.s[999] = 12.1 - self.s.index = pd.MultiIndex.from_product((range(10), range(10), range(10), range(10))) - self.ss = self.s.to_sparse() + def time_from_scipy(self): + SparseDataFrame(self.sparse) - def time_sparse_series_to_coo(self): - self.ss.to_coo(row_levels=[0, 1], column_levels=[2, 3], sort_labels=True) + def time_from_dict(self): + SparseDataFrame(self.dict) -class sparse_arithmetic_int(object): - goal_time = 0.2 +class FromCoo(object): def setup(self): - np.random.seed(1) - self.a_10percent = self.make_sparse_array(length=1000000, dense_size=100000, fill_value=np.nan) - self.b_10percent = self.make_sparse_array(length=1000000, dense_size=100000, fill_value=np.nan) + self.matrix = scipy.sparse.coo_matrix(([3.0, 1.0, 2.0], + ([1, 0, 0], [0, 2, 3])), + shape=(100, 100)) + + def time_sparse_series_from_coo(self): + SparseSeries.from_coo(self.matrix) - self.a_10percent_zero = self.make_sparse_array(length=1000000, dense_size=100000, fill_value=0) - self.b_10percent_zero = self.make_sparse_array(length=1000000, dense_size=100000, fill_value=0) - self.a_1percent = self.make_sparse_array(length=1000000, dense_size=10000, fill_value=np.nan) - self.b_1percent = self.make_sparse_array(length=1000000, dense_size=10000, fill_value=np.nan) +class ToCoo(object): - def make_sparse_array(self, length, dense_size, fill_value): - arr = np.array([fill_value] * length, dtype=np.float64) - indexer = np.unique(np.random.randint(0, length, dense_size)) - arr[indexer] = np.random.randint(0, 100, len(indexer)) - return pd.SparseArray(arr, fill_value=fill_value) + def setup(self): + s = Series([np.nan] * 10000) + s[0] = 3.0 + s[100] = -1.0 + s[999] = 12.1 + s.index = MultiIndex.from_product([range(10)] * 4) + self.ss = s.to_sparse() - def time_sparse_make_union(self): - self.a_10percent.sp_index.make_union(self.b_10percent.sp_index) + def time_sparse_series_to_coo(self): + self.ss.to_coo(row_levels=[0, 1], + column_levels=[2, 3], + sort_labels=True) - def time_sparse_intersect(self): - self.a_10percent.sp_index.intersect(self.b_10percent.sp_index) - def time_sparse_addition_10percent(self): - self.a_10percent + self.b_10percent +class Arithmetic(object): - def time_sparse_addition_10percent_zero(self): - self.a_10percent_zero + self.b_10percent_zero + params = ([0.1, 0.01], [0, np.nan]) + param_names = ['dense_proportion', 'fill_value'] - def time_sparse_addition_1percent(self): - self.a_1percent + self.b_1percent + def setup(self, dense_proportion, fill_value): + N = 10**6 + arr1 = make_array(N, dense_proportion, fill_value, np.int64) + self.array1 = SparseArray(arr1, fill_value=fill_value) + arr2 = make_array(N, dense_proportion, fill_value, np.int64) + self.array2 = SparseArray(arr2, fill_value=fill_value) - def time_sparse_division_10percent(self): - self.a_10percent / self.b_10percent + def time_make_union(self, dense_proportion, fill_value): + self.array1.sp_index.make_union(self.array2.sp_index) - def time_sparse_division_10percent_zero(self): - self.a_10percent_zero / self.b_10percent_zero + def time_intersect(self, dense_proportion, fill_value): + self.array1.sp_index.intersect(self.array2.sp_index) - def time_sparse_division_1percent(self): - self.a_1percent / self.b_1percent + def time_add(self, dense_proportion, fill_value): + self.array1 + self.array2 + def time_divide(self, dense_proportion, fill_value): + self.array1 / self.array2 -class sparse_arithmetic_block(object): - goal_time = 0.2 +class ArithmeticBlock(object): - def setup(self): - np.random.seed(1) - self.a = self.make_sparse_array(length=1000000, num_blocks=1000, - block_size=10, fill_value=np.nan) - self.b = self.make_sparse_array(length=1000000, num_blocks=1000, - block_size=10, fill_value=np.nan) + params = [np.nan, 0] + param_names = ['fill_value'] - self.a_zero = self.make_sparse_array(length=1000000, num_blocks=1000, - block_size=10, fill_value=0) - self.b_zero = self.make_sparse_array(length=1000000, num_blocks=1000, - block_size=10, fill_value=np.nan) + def setup(self, fill_value): + N = 10**6 + self.arr1 = self.make_block_array(length=N, num_blocks=1000, + block_size=10, fill_value=fill_value) + self.arr2 = self.make_block_array(length=N, num_blocks=1000, + block_size=10, fill_value=fill_value) - def make_sparse_array(self, length, num_blocks, block_size, fill_value): - a = np.array([fill_value] * length) - for block in range(num_blocks): - i = np.random.randint(0, length) - a[i:i + block_size] = np.random.randint(0, 100, len(a[i:i + block_size])) - return pd.SparseArray(a, fill_value=fill_value) + def make_block_array(self, length, num_blocks, block_size, fill_value): + arr = np.full(length, fill_value) + indicies = np.random.choice(np.arange(0, length, block_size), + num_blocks, + replace=False) + for ind in indicies: + arr[ind:ind + block_size] = np.random.randint(0, 100, block_size) + return SparseArray(arr, fill_value=fill_value) - def time_sparse_make_union(self): - self.a.sp_index.make_union(self.b.sp_index) + def time_make_union(self, fill_value): + self.arr1.sp_index.make_union(self.arr2.sp_index) - def time_sparse_intersect(self): - self.a.sp_index.intersect(self.b.sp_index) + def time_intersect(self, fill_value): + self.arr2.sp_index.intersect(self.arr2.sp_index) - def time_sparse_addition(self): - self.a + self.b + def time_addition(self, fill_value): + self.arr1 + self.arr2 - def time_sparse_addition_zero(self): - self.a_zero + self.b_zero + def time_division(self, fill_value): + self.arr1 / self.arr2 - def time_sparse_division(self): - self.a / self.b - def time_sparse_division_zero(self): - self.a_zero / self.b_zero +from .pandas_vb_common import setup # noqa: F401 diff --git a/asv_bench/benchmarks/stat_ops.py b/asv_bench/benchmarks/stat_ops.py index 12fbb2478c2a5..7fdc713f076ed 100644 --- a/asv_bench/benchmarks/stat_ops.py +++ b/asv_bench/benchmarks/stat_ops.py @@ -1,261 +1,144 @@ -from .pandas_vb_common import * +import numpy as np +import pandas as pd -class stat_ops_frame_mean_float_axis_0(object): - goal_time = 0.2 +ops = ['mean', 'sum', 'median', 'std', 'skew', 'kurt', 'mad', 'prod', 'sem', + 'var'] - def setup(self): - self.df = DataFrame(np.random.randn(100000, 4)) - self.dfi = DataFrame(np.random.randint(1000, size=self.df.shape)) - def time_stat_ops_frame_mean_float_axis_0(self): - self.df.mean() +class FrameOps(object): + params = [ops, ['float', 'int'], [0, 1], [True, False]] + param_names = ['op', 'dtype', 'axis', 'use_bottleneck'] -class stat_ops_frame_mean_float_axis_1(object): - goal_time = 0.2 + def setup(self, op, dtype, axis, use_bottleneck): + df = pd.DataFrame(np.random.randn(100000, 4)).astype(dtype) + try: + pd.options.compute.use_bottleneck = use_bottleneck + except TypeError: + from pandas.core import nanops + nanops._USE_BOTTLENECK = use_bottleneck + self.df_func = getattr(df, op) - def setup(self): - self.df = DataFrame(np.random.randn(100000, 4)) - self.dfi = DataFrame(np.random.randint(1000, size=self.df.shape)) + def time_op(self, op, dtype, axis, use_bottleneck): + self.df_func(axis=axis) - def time_stat_ops_frame_mean_float_axis_1(self): - self.df.mean(1) +class FrameMultiIndexOps(object): -class stat_ops_frame_mean_int_axis_0(object): - goal_time = 0.2 + params = ([0, 1, [0, 1]], ops) + param_names = ['level', 'op'] - def setup(self): - self.df = DataFrame(np.random.randn(100000, 4)) - self.dfi = DataFrame(np.random.randint(1000, size=self.df.shape)) + def setup(self, level, op): + levels = [np.arange(10), np.arange(100), np.arange(100)] + codes = [np.arange(10).repeat(10000), + np.tile(np.arange(100).repeat(100), 10), + np.tile(np.tile(np.arange(100), 100), 10)] + index = pd.MultiIndex(levels=levels, codes=codes) + df = pd.DataFrame(np.random.randn(len(index), 4), index=index) + self.df_func = getattr(df, op) - def time_stat_ops_frame_mean_int_axis_0(self): - self.dfi.mean() + def time_op(self, level, op): + self.df_func(level=level) -class stat_ops_frame_mean_int_axis_1(object): - goal_time = 0.2 +class SeriesOps(object): - def setup(self): - self.df = DataFrame(np.random.randn(100000, 4)) - self.dfi = DataFrame(np.random.randint(1000, size=self.df.shape)) + params = [ops, ['float', 'int'], [True, False]] + param_names = ['op', 'dtype', 'use_bottleneck'] - def time_stat_ops_frame_mean_int_axis_1(self): - self.dfi.mean(1) + def setup(self, op, dtype, use_bottleneck): + s = pd.Series(np.random.randn(100000)).astype(dtype) + try: + pd.options.compute.use_bottleneck = use_bottleneck + except TypeError: + from pandas.core import nanops + nanops._USE_BOTTLENECK = use_bottleneck + self.s_func = getattr(s, op) + def time_op(self, op, dtype, use_bottleneck): + self.s_func() -class stat_ops_frame_sum_float_axis_0(object): - goal_time = 0.2 - def setup(self): - self.df = DataFrame(np.random.randn(100000, 4)) - self.dfi = DataFrame(np.random.randint(1000, size=self.df.shape)) +class SeriesMultiIndexOps(object): - def time_stat_ops_frame_sum_float_axis_0(self): - self.df.sum() + params = ([0, 1, [0, 1]], ops) + param_names = ['level', 'op'] + def setup(self, level, op): + levels = [np.arange(10), np.arange(100), np.arange(100)] + codes = [np.arange(10).repeat(10000), + np.tile(np.arange(100).repeat(100), 10), + np.tile(np.tile(np.arange(100), 100), 10)] + index = pd.MultiIndex(levels=levels, codes=codes) + s = pd.Series(np.random.randn(len(index)), index=index) + self.s_func = getattr(s, op) -class stat_ops_frame_sum_float_axis_1(object): - goal_time = 0.2 + def time_op(self, level, op): + self.s_func(level=level) - def setup(self): - self.df = DataFrame(np.random.randn(100000, 4)) - self.dfi = DataFrame(np.random.randint(1000, size=self.df.shape)) - def time_stat_ops_frame_sum_float_axis_1(self): - self.df.sum(1) +class Rank(object): + params = [['DataFrame', 'Series'], [True, False]] + param_names = ['constructor', 'pct'] -class stat_ops_frame_sum_int_axis_0(object): - goal_time = 0.2 + def setup(self, constructor, pct): + values = np.random.randn(10**5) + self.data = getattr(pd, constructor)(values) - def setup(self): - self.df = DataFrame(np.random.randn(100000, 4)) - self.dfi = DataFrame(np.random.randint(1000, size=self.df.shape)) + def time_rank(self, constructor, pct): + self.data.rank(pct=pct) - def time_stat_ops_frame_sum_int_axis_0(self): - self.dfi.sum() + def time_average_old(self, constructor, pct): + self.data.rank(pct=pct) / len(self.data) -class stat_ops_frame_sum_int_axis_1(object): - goal_time = 0.2 +class Correlation(object): - def setup(self): - self.df = DataFrame(np.random.randn(100000, 4)) - self.dfi = DataFrame(np.random.randint(1000, size=self.df.shape)) + params = [['spearman', 'kendall', 'pearson'], [True, False]] + param_names = ['method', 'use_bottleneck'] - def time_stat_ops_frame_sum_int_axis_1(self): - self.dfi.sum(1) + def setup(self, method, use_bottleneck): + try: + pd.options.compute.use_bottleneck = use_bottleneck + except TypeError: + from pandas.core import nanops + nanops._USE_BOTTLENECK = use_bottleneck + self.df = pd.DataFrame(np.random.randn(1000, 30)) + self.df2 = pd.DataFrame(np.random.randn(1000, 30)) + self.s = pd.Series(np.random.randn(1000)) + self.s2 = pd.Series(np.random.randn(1000)) + def time_corr(self, method, use_bottleneck): + self.df.corr(method=method) -class stat_ops_level_frame_sum(object): - goal_time = 0.2 + def time_corr_series(self, method, use_bottleneck): + self.s.corr(self.s2, method=method) - def setup(self): - self.index = MultiIndex(levels=[np.arange(10), np.arange(100), np.arange(100)], labels=[np.arange(10).repeat(10000), np.tile(np.arange(100).repeat(100), 10), np.tile(np.tile(np.arange(100), 100), 10)]) - random.shuffle(self.index.values) - self.df = DataFrame(np.random.randn(len(self.index), 4), index=self.index) - self.df_level = DataFrame(np.random.randn(100, 4), index=self.index.levels[1]) + def time_corrwith_cols(self, method, use_bottleneck): + self.df.corrwith(self.df2, method=method) - def time_stat_ops_level_frame_sum(self): - self.df.sum(level=1) + def time_corrwith_rows(self, method, use_bottleneck): + self.df.corrwith(self.df2, axis=1, method=method) -class stat_ops_level_frame_sum_multiple(object): - goal_time = 0.2 +class Covariance(object): - def setup(self): - self.index = MultiIndex(levels=[np.arange(10), np.arange(100), np.arange(100)], labels=[np.arange(10).repeat(10000), np.tile(np.arange(100).repeat(100), 10), np.tile(np.tile(np.arange(100), 100), 10)]) - random.shuffle(self.index.values) - self.df = DataFrame(np.random.randn(len(self.index), 4), index=self.index) - self.df_level = DataFrame(np.random.randn(100, 4), index=self.index.levels[1]) + params = [[True, False]] + param_names = ['use_bottleneck'] - def time_stat_ops_level_frame_sum_multiple(self): - self.df.sum(level=[0, 1]) + def setup(self, use_bottleneck): + try: + pd.options.compute.use_bottleneck = use_bottleneck + except TypeError: + from pandas.core import nanops + nanops._USE_BOTTLENECK = use_bottleneck + self.s = pd.Series(np.random.randn(100000)) + self.s2 = pd.Series(np.random.randn(100000)) + def time_cov_series(self, use_bottleneck): + self.s.cov(self.s2) -class stat_ops_level_series_sum(object): - goal_time = 0.2 - def setup(self): - self.index = MultiIndex(levels=[np.arange(10), np.arange(100), np.arange(100)], labels=[np.arange(10).repeat(10000), np.tile(np.arange(100).repeat(100), 10), np.tile(np.tile(np.arange(100), 100), 10)]) - random.shuffle(self.index.values) - self.df = DataFrame(np.random.randn(len(self.index), 4), index=self.index) - self.df_level = DataFrame(np.random.randn(100, 4), index=self.index.levels[1]) - - def time_stat_ops_level_series_sum(self): - self.df[1].sum(level=1) - - -class stat_ops_level_series_sum_multiple(object): - goal_time = 0.2 - - def setup(self): - self.index = MultiIndex(levels=[np.arange(10), np.arange(100), np.arange(100)], labels=[np.arange(10).repeat(10000), np.tile(np.arange(100).repeat(100), 10), np.tile(np.tile(np.arange(100), 100), 10)]) - random.shuffle(self.index.values) - self.df = DataFrame(np.random.randn(len(self.index), 4), index=self.index) - self.df_level = DataFrame(np.random.randn(100, 4), index=self.index.levels[1]) - - def time_stat_ops_level_series_sum_multiple(self): - self.df[1].sum(level=[0, 1]) - - -class stat_ops_series_std(object): - goal_time = 0.2 - - def setup(self): - self.s = Series(np.random.randn(100000), index=np.arange(100000)) - self.s[::2] = np.nan - - def time_stat_ops_series_std(self): - self.s.std() - - -class stats_corr_spearman(object): - goal_time = 0.2 - - def setup(self): - self.df = DataFrame(np.random.randn(1000, 30)) - - def time_stats_corr_spearman(self): - self.df.corr(method='spearman') - - -class stats_rank2d_axis0_average(object): - goal_time = 0.2 - - def setup(self): - self.df = DataFrame(np.random.randn(5000, 50)) - - def time_stats_rank2d_axis0_average(self): - self.df.rank() - - -class stats_rank2d_axis1_average(object): - goal_time = 0.2 - - def setup(self): - self.df = DataFrame(np.random.randn(5000, 50)) - - def time_stats_rank2d_axis1_average(self): - self.df.rank(1) - - -class stats_rank_average(object): - goal_time = 0.2 - - def setup(self): - self.values = np.concatenate([np.arange(100000), np.random.randn(100000), np.arange(100000)]) - self.s = Series(self.values) - - def time_stats_rank_average(self): - self.s.rank() - - -class stats_rank_average_int(object): - goal_time = 0.2 - - def setup(self): - self.values = np.random.randint(0, 100000, size=200000) - self.s = Series(self.values) - - def time_stats_rank_average_int(self): - self.s.rank() - - -class stats_rank_pct_average(object): - goal_time = 0.2 - - def setup(self): - self.values = np.concatenate([np.arange(100000), np.random.randn(100000), np.arange(100000)]) - self.s = Series(self.values) - - def time_stats_rank_pct_average(self): - self.s.rank(pct=True) - - -class stats_rank_pct_average_old(object): - goal_time = 0.2 - - def setup(self): - self.values = np.concatenate([np.arange(100000), np.random.randn(100000), np.arange(100000)]) - self.s = Series(self.values) - - def time_stats_rank_pct_average_old(self): - (self.s.rank() / len(self.s)) - - -class stats_rolling_mean(object): - goal_time = 0.2 - - def setup(self): - self.arr = np.random.randn(100000) - self.win = 100 - - def time_rolling_mean(self): - rolling_mean(self.arr, self.win) - - def time_rolling_median(self): - rolling_median(self.arr, self.win) - - def time_rolling_min(self): - rolling_min(self.arr, self.win) - - def time_rolling_max(self): - rolling_max(self.arr, self.win) - - def time_rolling_sum(self): - rolling_sum(self.arr, self.win) - - def time_rolling_std(self): - rolling_std(self.arr, self.win) - - def time_rolling_var(self): - rolling_var(self.arr, self.win) - - def time_rolling_skew(self): - rolling_skew(self.arr, self.win) - - def time_rolling_kurt(self): - rolling_kurt(self.arr, self.win) +from .pandas_vb_common import setup # noqa: F401 diff --git a/asv_bench/benchmarks/strings.py b/asv_bench/benchmarks/strings.py index c1600d4e07f58..b5b2c955f0133 100644 --- a/asv_bench/benchmarks/strings.py +++ b/asv_bench/benchmarks/strings.py @@ -1,107 +1,188 @@ -from .pandas_vb_common import * -import string -import itertools as IT -import pandas.util.testing as testing +import warnings +import numpy as np +from pandas import Series, DataFrame +import pandas.util.testing as tm -class StringMethods(object): - goal_time = 0.2 - def make_series(self, letters, strlen, size): - return Series([str(x) for x in np.fromiter(IT.cycle(letters), count=(size * strlen), dtype='|S1').view('|S{}'.format(strlen))]) +class Methods(object): def setup(self): - self.many = self.make_series(('matchthis' + string.ascii_uppercase), strlen=19, size=10000) - self.few = self.make_series(('matchthis' + (string.ascii_uppercase * 42)), strlen=19, size=10000) - self.s = self.make_series(string.ascii_uppercase, strlen=10, size=10000).str.join('|') - - def time_cat(self): - self.many.str.cat(sep=',') + self.s = Series(tm.makeStringIndex(10**5)) def time_center(self): - self.many.str.center(100) - - def time_contains_few(self): - self.few.str.contains('matchthis') - - def time_contains_few_noregex(self): - self.few.str.contains('matchthis', regex=False) - - def time_contains_many(self): - self.many.str.contains('matchthis') - - def time_contains_many_noregex(self): - self.many.str.contains('matchthis', regex=False) + self.s.str.center(100) def time_count(self): - self.many.str.count('matchthis') + self.s.str.count('A') def time_endswith(self): - self.many.str.endswith('matchthis') + self.s.str.endswith('A') def time_extract(self): - self.many.str.extract('(\\w*)matchthis(\\w*)') + with warnings.catch_warnings(record=True): + self.s.str.extract('(\\w*)A(\\w*)') def time_findall(self): - self.many.str.findall('[A-Z]+') + self.s.str.findall('[A-Z]+') - def time_get(self): - self.many.str.get(0) + def time_find(self): + self.s.str.find('[A-Z]+') - def time_join_split(self): - self.many.str.join('--').str.split('--') + def time_rfind(self): + self.s.str.rfind('[A-Z]+') - def time_join_split_expand(self): - self.many.str.join('--').str.split('--', expand=True) + def time_get(self): + self.s.str.get(0) def time_len(self): - self.many.str.len() + self.s.str.len() + + def time_join(self): + self.s.str.join(' ') def time_match(self): - self.many.str.match('mat..this') + self.s.str.match('A') + + def time_normalize(self): + self.s.str.normalize('NFC') def time_pad(self): - self.many.str.pad(100, side='both') + self.s.str.pad(100, side='both') - def time_repeat(self): - self.many.str.repeat(list(IT.islice(IT.cycle(range(1, 4)), len(self.many)))) + def time_partition(self): + self.s.str.partition('A') + + def time_rpartition(self): + self.s.str.rpartition('A') def time_replace(self): - self.many.str.replace('(matchthis)', '\x01\x01') + self.s.str.replace('A', '\x01\x01') + + def time_translate(self): + self.s.str.translate({'A': '\x01\x01'}) def time_slice(self): - self.many.str.slice(5, 15, 2) + self.s.str.slice(5, 15, 2) def time_startswith(self): - self.many.str.startswith('matchthis') + self.s.str.startswith('A') def time_strip(self): - self.many.str.strip('matchthis') + self.s.str.strip('A') def time_rstrip(self): - self.many.str.rstrip('matchthis') + self.s.str.rstrip('A') def time_lstrip(self): - self.many.str.lstrip('matchthis') + self.s.str.lstrip('A') def time_title(self): - self.many.str.title() + self.s.str.title() def time_upper(self): - self.many.str.upper() + self.s.str.upper() def time_lower(self): - self.many.str.lower() + self.s.str.lower() + + def time_wrap(self): + self.s.str.wrap(10) + + def time_zfill(self): + self.s.str.zfill(10) + + +class Repeat(object): + + params = ['int', 'array'] + param_names = ['repeats'] + + def setup(self, repeats): + N = 10**5 + self.s = Series(tm.makeStringIndex(N)) + repeat = {'int': 1, 'array': np.random.randint(1, 3, N)} + self.values = repeat[repeats] + + def time_repeat(self, repeats): + self.s.str.repeat(self.values) + + +class Cat(object): + + params = ([0, 3], [None, ','], [None, '-'], [0.0, 0.001, 0.15]) + param_names = ['other_cols', 'sep', 'na_rep', 'na_frac'] + + def setup(self, other_cols, sep, na_rep, na_frac): + N = 10 ** 5 + mask_gen = lambda: np.random.choice([True, False], N, + p=[1 - na_frac, na_frac]) + self.s = Series(tm.makeStringIndex(N)).where(mask_gen()) + if other_cols == 0: + # str.cat self-concatenates only for others=None + self.others = None + else: + self.others = DataFrame({i: tm.makeStringIndex(N).where(mask_gen()) + for i in range(other_cols)}) + + def time_cat(self, other_cols, sep, na_rep, na_frac): + # before the concatenation (one caller + other_cols columns), the total + # expected fraction of rows containing any NaN is: + # reduce(lambda t, _: t + (1 - t) * na_frac, range(other_cols + 1), 0) + # for other_cols=3 and na_frac=0.15, this works out to ~48% + self.s.str.cat(others=self.others, sep=sep, na_rep=na_rep) + + +class Contains(object): + + params = [True, False] + param_names = ['regex'] + + def setup(self, regex): + self.s = Series(tm.makeStringIndex(10**5)) + + def time_contains(self, regex): + self.s.str.contains('A', regex=regex) + + +class Split(object): + + params = [True, False] + param_names = ['expand'] + + def setup(self, expand): + self.s = Series(tm.makeStringIndex(10**5)).str.join('--') + + def time_split(self, expand): + self.s.str.split('--', expand=expand) + + def time_rsplit(self, expand): + self.s.str.rsplit('--', expand=expand) + + +class Dummies(object): + + def setup(self): + self.s = Series(tm.makeStringIndex(10**5)).str.join('|') def time_get_dummies(self): self.s.str.get_dummies('|') -class StringEncode(object): - goal_time = 0.2 +class Encode(object): def setup(self): - self.ser = Series(testing.makeUnicodeIndex()) + self.ser = Series(tm.makeUnicodeIndex()) def time_encode_decode(self): self.ser.str.encode('utf-8').str.decode('utf-8') + + +class Slice(object): + + def setup(self): + self.s = Series(['abcdefg', np.nan] * 500000) + + def time_vector_slice(self): + # GH 2602 + self.s.str[:5] diff --git a/asv_bench/benchmarks/timedelta.py b/asv_bench/benchmarks/timedelta.py index c112d1ef72eb8..0cfbbd536bc8b 100644 --- a/asv_bench/benchmarks/timedelta.py +++ b/asv_bench/benchmarks/timedelta.py @@ -1,42 +1,153 @@ -from .pandas_vb_common import * -from pandas import to_timedelta, Timestamp +import datetime +import numpy as np -class ToTimedelta(object): - goal_time = 0.2 +from pandas import ( + DataFrame, Series, Timedelta, Timestamp, timedelta_range, to_timedelta) - def setup(self): - self.arr = np.random.randint(0, 1000, size=10000) - self.arr2 = ['{0} days'.format(i) for i in self.arr] - self.arr3 = np.random.randint(0, 60, size=10000) - self.arr3 = ['00:00:{0:02d}'.format(i) for i in self.arr3] +class TimedeltaConstructor(object): + + def time_from_int(self): + Timedelta(123456789) + + def time_from_unit(self): + Timedelta(1, unit='d') + + def time_from_components(self): + Timedelta(days=1, hours=2, minutes=3, seconds=4, milliseconds=5, + microseconds=6, nanoseconds=7) + + def time_from_datetime_timedelta(self): + Timedelta(datetime.timedelta(days=1, seconds=1)) + + def time_from_np_timedelta(self): + Timedelta(np.timedelta64(1, 'ms')) + + def time_from_string(self): + Timedelta('1 days') + + def time_from_iso_format(self): + Timedelta('P4DT12H30M5S') + + def time_from_missing(self): + Timedelta('nat') + - self.arr4 = list(self.arr2) - self.arr4[-1] = 'apple' +class ToTimedelta(object): + + def setup(self): + self.ints = np.random.randint(0, 60, size=10000) + self.str_days = [] + self.str_seconds = [] + for i in self.ints: + self.str_days.append('{0} days'.format(i)) + self.str_seconds.append('00:00:{0:02d}'.format(i)) def time_convert_int(self): - to_timedelta(self.arr, unit='s') + to_timedelta(self.ints, unit='s') - def time_convert_string(self): - to_timedelta(self.arr2) + def time_convert_string_days(self): + to_timedelta(self.str_days) def time_convert_string_seconds(self): - to_timedelta(self.arr3) + to_timedelta(self.str_seconds) + + +class ToTimedeltaErrors(object): + + params = ['coerce', 'ignore'] + param_names = ['errors'] - def time_convert_coerce(self): - to_timedelta(self.arr4, errors='coerce') + def setup(self, errors): + ints = np.random.randint(0, 60, size=10000) + self.arr = ['{0} days'.format(i) for i in ints] + self.arr[-1] = 'apple' - def time_convert_ignore(self): - to_timedelta(self.arr4, errors='ignore') + def time_convert(self, errors): + to_timedelta(self.arr, errors=errors) -class Ops(object): - goal_time = 0.2 +class TimedeltaOps(object): def setup(self): self.td = to_timedelta(np.arange(1000000)) self.ts = Timestamp('2000') - def test_add_td_ts(self): + def time_add_td_ts(self): self.td + self.ts + + +class TimedeltaProperties(object): + + def setup_cache(self): + td = Timedelta(days=365, minutes=35, seconds=25, milliseconds=35) + return td + + def time_timedelta_days(self, td): + td.days + + def time_timedelta_seconds(self, td): + td.seconds + + def time_timedelta_microseconds(self, td): + td.microseconds + + def time_timedelta_nanoseconds(self, td): + td.nanoseconds + + +class DatetimeAccessor(object): + + def setup_cache(self): + N = 100000 + series = Series(timedelta_range('1 days', periods=N, freq='h')) + return series + + def time_dt_accessor(self, series): + series.dt + + def time_timedelta_days(self, series): + series.dt.days + + def time_timedelta_seconds(self, series): + series.dt.seconds + + def time_timedelta_microseconds(self, series): + series.dt.microseconds + + def time_timedelta_nanoseconds(self, series): + series.dt.nanoseconds + + +class TimedeltaIndexing(object): + + def setup(self): + self.index = timedelta_range(start='1985', periods=1000, freq='D') + self.index2 = timedelta_range(start='1986', periods=1000, freq='D') + self.series = Series(range(1000), index=self.index) + self.timedelta = self.index[500] + + def time_get_loc(self): + self.index.get_loc(self.timedelta) + + def time_shape(self): + self.index.shape + + def time_shallow_copy(self): + self.index._shallow_copy() + + def time_series_loc(self): + self.series.loc[self.timedelta] + + def time_align(self): + DataFrame({'a': self.series, 'b': self.series[:500]}) + + def time_intersection(self): + self.index.intersection(self.index2) + + def time_union(self): + self.index.union(self.index2) + + def time_unique(self): + self.index.unique() diff --git a/asv_bench/benchmarks/timeseries.py b/asv_bench/benchmarks/timeseries.py index 6e9ef4b10273c..6efd720d1acdd 100644 --- a/asv_bench/benchmarks/timeseries.py +++ b/asv_bench/benchmarks/timeseries.py @@ -1,349 +1,313 @@ -from pandas.tseries.converter import DatetimeConverter -from .pandas_vb_common import * -import pandas as pd from datetime import timedelta -import datetime as dt + +import dateutil +import numpy as np +from pandas import to_datetime, date_range, Series, DataFrame, period_range +from pandas.tseries.frequencies import infer_freq try: - import pandas.tseries.holiday + from pandas.plotting._converter import DatetimeConverter except ImportError: - pass -from pandas.tseries.frequencies import infer_freq -import numpy as np - -if hasattr(Series, 'convert'): - Series.resample = Series.convert + from pandas.tseries.converter import DatetimeConverter class DatetimeIndex(object): - goal_time = 0.2 - def setup(self): - self.N = 100000 - self.rng = date_range(start='1/1/2000', periods=self.N, freq='T') - self.delta_offset = pd.offsets.Day() - self.fast_offset = pd.offsets.DateOffset(months=2, days=2) - self.slow_offset = pd.offsets.BusinessDay() + params = ['dst', 'repeated', 'tz_aware', 'tz_local', 'tz_naive'] + param_names = ['index_type'] - self.rng2 = date_range(start='1/1/2000 9:30', periods=10000, freq='S', tz='US/Eastern') + def setup(self, index_type): + N = 100000 + dtidxes = {'dst': date_range(start='10/29/2000 1:00:00', + end='10/29/2000 1:59:59', freq='S'), + 'repeated': date_range(start='2000', + periods=N / 10, + freq='s').repeat(10), + 'tz_aware': date_range(start='2000', + periods=N, + freq='s', + tz='US/Eastern'), + 'tz_local': date_range(start='2000', + periods=N, + freq='s', + tz=dateutil.tz.tzlocal()), + 'tz_naive': date_range(start='2000', + periods=N, + freq='s')} + self.index = dtidxes[index_type] - self.index_repeated = date_range(start='1/1/2000', periods=1000, freq='T').repeat(10) + def time_add_timedelta(self, index_type): + self.index + timedelta(minutes=2) - self.rng3 = date_range(start='1/1/2000', periods=1000, freq='H') - self.df = DataFrame(np.random.randn(len(self.rng3), 2), self.rng3) + def time_normalize(self, index_type): + self.index.normalize() - self.rng4 = date_range(start='1/1/2000', periods=1000, freq='H', tz='US/Eastern') - self.df2 = DataFrame(np.random.randn(len(self.rng4), 2), index=self.rng4) + def time_unique(self, index_type): + self.index.unique() - N = 100000 - self.dti = pd.date_range('2011-01-01', freq='H', periods=N).repeat(5) - self.dti_tz = pd.date_range('2011-01-01', freq='H', periods=N, - tz='Asia/Tokyo').repeat(5) + def time_to_time(self, index_type): + self.index.time + + def time_get(self, index_type): + self.index[0] - self.rng5 = date_range(start='1/1/2000', end='3/1/2000', tz='US/Eastern') + def time_timeseries_is_month_start(self, index_type): + self.index.is_month_start - self.dst_rng = date_range(start='10/29/2000 1:00:00', end='10/29/2000 1:59:59', freq='S') - self.index = date_range(start='10/29/2000', end='10/29/2000 00:59:59', freq='S') - self.index = self.index.append(self.dst_rng) - self.index = self.index.append(self.dst_rng) - self.index = self.index.append(date_range(start='10/29/2000 2:00:00', end='10/29/2000 3:00:00', freq='S')) + def time_to_date(self, index_type): + self.index.date - self.N = 10000 - self.rng6 = date_range(start='1/1/1', periods=self.N, freq='B') + def time_to_pydatetime(self, index_type): + self.index.to_pydatetime() - self.rng7 = date_range(start='1/1/1700', freq='D', periods=100000) - self.a = self.rng7[:50000].append(self.rng7[50002:]) - def time_add_timedelta(self): - (self.rng + timedelta(minutes=2)) +class TzLocalize(object): - def time_add_offset_delta(self): - (self.rng + self.delta_offset) + params = [None, 'US/Eastern', 'UTC', dateutil.tz.tzutc()] + param_names = 'tz' - def time_add_offset_fast(self): - (self.rng + self.fast_offset) + def setup(self, tz): + dst_rng = date_range(start='10/29/2000 1:00:00', + end='10/29/2000 1:59:59', freq='S') + self.index = date_range(start='10/29/2000', + end='10/29/2000 00:59:59', freq='S') + self.index = self.index.append(dst_rng) + self.index = self.index.append(dst_rng) + self.index = self.index.append(date_range(start='10/29/2000 2:00:00', + end='10/29/2000 3:00:00', + freq='S')) - def time_add_offset_slow(self): - (self.rng + self.slow_offset) + def time_infer_dst(self, tz): + self.index.tz_localize(tz, ambiguous='infer') - def time_normalize(self): - self.rng2.normalize() - def time_unique(self): - self.index_repeated.unique() +class ResetIndex(object): - def time_reset_index(self): + params = [None, 'US/Eastern'] + param_names = 'tz' + + def setup(self, tz): + idx = date_range(start='1/1/2000', periods=1000, freq='H', tz=tz) + self.df = DataFrame(np.random.randn(1000, 2), index=idx) + + def time_reest_datetimeindex(self, tz): self.df.reset_index() - def time_reset_index_tz(self): - self.df2.reset_index() - def time_dti_factorize(self): +class Factorize(object): + + params = [None, 'Asia/Tokyo'] + param_names = 'tz' + + def setup(self, tz): + N = 100000 + self.dti = date_range('2011-01-01', freq='H', periods=N, tz=tz) + self.dti = self.dti.repeat(5) + + def time_factorize(self, tz): self.dti.factorize() - def time_dti_tz_factorize(self): - self.dti_tz.factorize() - def time_timestamp_tzinfo_cons(self): - self.rng5[0] +class InferFreq(object): - def time_infer_dst(self): - self.index.tz_localize('US/Eastern', infer_dst=True) + params = [None, 'D', 'B'] + param_names = ['freq'] - def time_timeseries_is_month_start(self): - self.rng6.is_month_start + def setup(self, freq): + if freq is None: + self.idx = date_range(start='1/1/1700', freq='D', periods=10000) + self.idx.freq = None + else: + self.idx = date_range(start='1/1/1700', freq=freq, periods=10000) - def time_infer_freq(self): - infer_freq(self.a) + def time_infer_freq(self, freq): + infer_freq(self.idx) class TimeDatetimeConverter(object): - goal_time = 0.2 def setup(self): - self.N = 100000 - self.rng = date_range(start='1/1/2000', periods=self.N, freq='T') + N = 100000 + self.rng = date_range(start='1/1/2000', periods=N, freq='T') def time_convert(self): DatetimeConverter.convert(self.rng, None, None) class Iteration(object): - goal_time = 0.2 - - def setup(self): - self.N = 1000000 - self.M = 10000 - self.idx1 = date_range(start='20140101', freq='T', periods=self.N) - self.idx2 = period_range(start='20140101', freq='T', periods=self.N) - - def iter_n(self, iterable, n=None): - self.i = 0 - for _ in iterable: - self.i += 1 - if ((n is not None) and (self.i > n)): - break - def time_iter_datetimeindex(self): - self.iter_n(self.idx1) + params = [date_range, period_range] + param_names = ['time_index'] - def time_iter_datetimeindex_preexit(self): - self.iter_n(self.idx1, self.M) + def setup(self, time_index): + N = 10**6 + self.idx = time_index(start='20140101', freq='T', periods=N) + self.exit = 10000 - def time_iter_periodindex(self): - self.iter_n(self.idx2) - - def time_iter_periodindex_preexit(self): - self.iter_n(self.idx2, self.M) + def time_iter(self, time_index): + for _ in self.idx: + pass + def time_iter_preexit(self, time_index): + for i, _ in enumerate(self.idx): + if i > self.exit: + break -#---------------------------------------------------------------------- -# Resampling class ResampleDataFrame(object): - goal_time = 0.2 - - def setup(self): - self.rng = date_range(start='20130101', periods=100000, freq='50L') - self.df = DataFrame(np.random.randn(100000, 2), index=self.rng) - - def time_max_numpy(self): - self.df.resample('1s', how=np.max) - def time_max_string(self): - self.df.resample('1s', how='max') + params = ['max', 'mean', 'min'] + param_names = ['method'] - def time_mean_numpy(self): - self.df.resample('1s', how=np.mean) + def setup(self, method): + rng = date_range(start='20130101', periods=100000, freq='50L') + df = DataFrame(np.random.randn(100000, 2), index=rng) + self.resample = getattr(df.resample('1s'), method) - def time_mean_string(self): - self.df.resample('1s', how='mean') - - def time_min_numpy(self): - self.df.resample('1s', how=np.min) - - def time_min_string(self): - self.df.resample('1s', how='min') + def time_method(self, method): + self.resample() class ResampleSeries(object): - goal_time = 0.2 - - def setup(self): - self.rng1 = period_range(start='1/1/2000', end='1/1/2001', freq='T') - self.ts1 = Series(np.random.randn(len(self.rng1)), index=self.rng1) - - self.rng2 = date_range(start='1/1/2000', end='1/1/2001', freq='T') - self.ts2 = Series(np.random.randn(len(self.rng2)), index=self.rng2) - self.rng3 = date_range(start='2000-01-01 00:00:00', end='2000-01-01 10:00:00', freq='555000U') - self.int_ts = Series(5, self.rng3, dtype='int64') - self.dt_ts = self.int_ts.astype('datetime64[ns]') + params = (['period', 'datetime'], ['5min', '1D'], ['mean', 'ohlc']) + param_names = ['index', 'freq', 'method'] - def time_period_downsample_mean(self): - self.ts1.resample('D', how='mean') + def setup(self, index, freq, method): + indexes = {'period': period_range(start='1/1/2000', + end='1/1/2001', + freq='T'), + 'datetime': date_range(start='1/1/2000', + end='1/1/2001', + freq='T')} + idx = indexes[index] + ts = Series(np.random.randn(len(idx)), index=idx) + self.resample = getattr(ts.resample(freq), method) - def time_timestamp_downsample_mean(self): - self.ts2.resample('D', how='mean') + def time_resample(self, index, freq, method): + self.resample() - def time_resample_datetime64(self): - # GH 7754 - self.dt_ts.resample('1S', how='last') - def time_1min_5min_mean(self): - self.ts2[:10000].resample('5min', how='mean') +class ResampleDatetetime64(object): + # GH 7754 + def setup(self): + rng3 = date_range(start='2000-01-01 00:00:00', + end='2000-01-01 10:00:00', freq='555000U') + self.dt_ts = Series(5, rng3, dtype='datetime64[ns]') - def time_1min_5min_ohlc(self): - self.ts2[:10000].resample('5min', how='ohlc') + def time_resample(self): + self.dt_ts.resample('1S').last() class AsOf(object): - goal_time = 0.2 - def setup(self): - self.N = 10000 - self.rng = date_range(start='1/1/1990', periods=self.N, freq='53s') - self.ts = Series(np.random.randn(self.N), index=self.rng) - self.dates = date_range(start='1/1/1990', periods=(self.N * 10), freq='5s') + params = ['DataFrame', 'Series'] + param_names = ['constructor'] + + def setup(self, constructor): + N = 10000 + M = 10 + rng = date_range(start='1/1/1990', periods=N, freq='53s') + data = {'DataFrame': DataFrame(np.random.randn(N, M)), + 'Series': Series(np.random.randn(N))} + self.ts = data[constructor] + self.ts.index = rng self.ts2 = self.ts.copy() - self.ts2[250:5000] = np.nan + self.ts2.iloc[250:5000] = np.nan self.ts3 = self.ts.copy() - self.ts3[-5000:] = np.nan + self.ts3.iloc[-5000:] = np.nan + self.dates = date_range(start='1/1/1990', periods=N * 10, freq='5s') + self.date = self.dates[0] + self.date_last = self.dates[-1] + self.date_early = self.date - timedelta(10) # test speed of pre-computing NAs. - def time_asof(self): + def time_asof(self, constructor): self.ts.asof(self.dates) # should be roughly the same as above. - def time_asof_nan(self): + def time_asof_nan(self, constructor): self.ts2.asof(self.dates) # test speed of the code path for a scalar index # without *while* loop - def time_asof_single(self): - self.ts.asof(self.dates[0]) + def time_asof_single(self, constructor): + self.ts.asof(self.date) # test speed of the code path for a scalar index # before the start. should be the same as above. - def time_asof_single_early(self): - self.ts.asof(self.dates[0] - dt.timedelta(10)) + def time_asof_single_early(self, constructor): + self.ts.asof(self.date_early) # test the speed of the code path for a scalar index # with a long *while* loop. should still be much # faster than pre-computing all the NAs. - def time_asof_nan_single(self): - self.ts3.asof(self.dates[-1]) + def time_asof_nan_single(self, constructor): + self.ts3.asof(self.date_last) -class AsOfDataFrame(object): - goal_time = 0.2 +class SortIndex(object): - def setup(self): - self.N = 10000 - self.M = 100 - self.rng = date_range(start='1/1/1990', periods=self.N, freq='53s') - self.dates = date_range(start='1/1/1990', periods=(self.N * 10), freq='5s') - self.ts = DataFrame(np.random.randn(self.N, self.M), index=self.rng) - self.ts2 = self.ts.copy() - self.ts2.iloc[250:5000] = np.nan - self.ts3 = self.ts.copy() - self.ts3.iloc[-5000:] = np.nan + params = [True, False] + param_names = ['monotonic'] - # test speed of pre-computing NAs. - def time_asof(self): - self.ts.asof(self.dates) - - # should be roughly the same as above. - def time_asof_nan(self): - self.ts2.asof(self.dates) + def setup(self, monotonic): + N = 10**5 + idx = date_range(start='1/1/2000', periods=N, freq='s') + self.s = Series(np.random.randn(N), index=idx) + if not monotonic: + self.s = self.s.sample(frac=1) - # test speed of the code path for a scalar index - # with pre-computing all NAs. - def time_asof_single(self): - self.ts.asof(self.dates[0]) + def time_sort_index(self, monotonic): + self.s.sort_index() - # should be roughly the same as above. - def time_asof_nan_single(self): - self.ts3.asof(self.dates[-1]) + def time_get_slice(self, monotonic): + self.s[:10000] - # test speed of the code path for a scalar index - # before the start. should be without the cost of - # pre-computing all the NAs. - def time_asof_single_early(self): - self.ts.asof(self.dates[0] - dt.timedelta(10)) - -class TimeSeries(object): - goal_time = 0.2 +class IrregularOps(object): def setup(self): - self.N = 100000 - self.rng = date_range(start='1/1/2000', periods=self.N, freq='s') - self.rng = self.rng.take(np.random.permutation(self.N)) - self.ts = Series(np.random.randn(self.N), index=self.rng) - - self.rng2 = date_range(start='1/1/2000', periods=self.N, freq='T') - self.ts2 = Series(np.random.randn(self.N), index=self.rng2) + N = 10**5 + idx = date_range(start='1/1/2000', periods=N, freq='s') + s = Series(np.random.randn(N), index=idx) + self.left = s.sample(frac=1) + self.right = s.sample(frac=1) - self.lindex = np.random.permutation(self.N)[:(self.N // 2)] - self.rindex = np.random.permutation(self.N)[:(self.N // 2)] - self.left = Series(self.ts2.values.take(self.lindex), index=self.ts2.index.take(self.lindex)) - self.right = Series(self.ts2.values.take(self.rindex), index=self.ts2.index.take(self.rindex)) + def time_add(self): + self.left + self.right - self.rng3 = date_range(start='1/1/2000', periods=1500000, freq='S') - self.ts3 = Series(1, index=self.rng3) - def time_sort_index(self): - self.ts.sort_index() +class Lookup(object): - def time_timeseries_slice_minutely(self): - self.ts2[:10000] - - def time_add_irregular(self): - (self.left + self.right) + def setup(self): + N = 1500000 + rng = date_range(start='1/1/2000', periods=N, freq='S') + self.ts = Series(1, index=rng) + self.lookup_val = rng[N // 2] - def time_large_lookup_value(self): - self.ts3[self.ts3.index[(len(self.ts3) // 2)]] - self.ts3.index._cleanup() + def time_lookup_and_cleanup(self): + self.ts[self.lookup_val] + self.ts.index._cleanup() -class SeriesArithmetic(object): - goal_time = 0.2 +class ToDatetimeYYYYMMDD(object): def setup(self): - self.N = 100000 - self.s = Series(date_range(start='20140101', freq='T', periods=self.N)) - self.delta_offset = pd.offsets.Day() - self.fast_offset = pd.offsets.DateOffset(months=2, days=2) - self.slow_offset = pd.offsets.BusinessDay() - - def time_add_offset_delta(self): - (self.s + self.delta_offset) + rng = date_range(start='1/1/2000', periods=10000, freq='D') + self.stringsD = Series(rng.strftime('%Y%m%d')) - def time_add_offset_fast(self): - (self.s + self.fast_offset) - - def time_add_offset_slow(self): - (self.s + self.slow_offset) + def time_format_YYYYMMDD(self): + to_datetime(self.stringsD, format='%Y%m%d') -class ToDatetime(object): - goal_time = 0.2 +class ToDatetimeISO8601(object): def setup(self): - self.rng = date_range(start='1/1/2000', periods=10000, freq='D') - self.stringsD = Series((((self.rng.year * 10000) + (self.rng.month * 100)) + self.rng.day), dtype=np.int64).apply(str) - - self.rng = date_range(start='1/1/2000', periods=20000, freq='H') - self.strings = [x.strftime('%Y-%m-%d %H:%M:%S') for x in self.rng] - self.strings_nosep = [x.strftime('%Y%m%d %H:%M:%S') for x in self.rng] + rng = date_range(start='1/1/2000', periods=20000, freq='H') + self.strings = rng.strftime('%Y-%m-%d %H:%M:%S').tolist() + self.strings_nosep = rng.strftime('%Y%m%d %H:%M:%S').tolist() self.strings_tz_space = [x.strftime('%Y-%m-%d %H:%M:%S') + ' -0800' - for x in self.rng] - - self.s = Series((['19MAY11', '19MAY11:00:00:00'] * 100000)) - self.s2 = self.s.str.replace(':\\S+$', '') - - def time_format_YYYYMMDD(self): - to_datetime(self.stringsD, format='%Y%m%d') + for x in rng] def time_iso8601(self): to_datetime(self.strings) @@ -360,138 +324,105 @@ def time_iso8601_format_no_sep(self): def time_iso8601_tz_spaceformat(self): to_datetime(self.strings_tz_space) - def time_format_exact(self): - to_datetime(self.s2, format='%d%b%y') - - def time_format_no_exact(self): - to_datetime(self.s, format='%d%b%y', exact=False) - -class Offsets(object): - goal_time = 0.2 +class ToDatetimeNONISO8601(object): def setup(self): - self.date = dt.datetime(2011, 1, 1) - self.dt64 = np.datetime64('2011-01-01 09:00Z') - self.hcal = pd.tseries.holiday.USFederalHolidayCalendar() - self.day = pd.offsets.Day() - self.year = pd.offsets.YearBegin() - self.cday = pd.offsets.CustomBusinessDay() - self.cmb = pd.offsets.CustomBusinessMonthBegin(calendar=self.hcal) - self.cme = pd.offsets.CustomBusinessMonthEnd(calendar=self.hcal) - self.cdayh = pd.offsets.CustomBusinessDay(calendar=self.hcal) - - def time_timeseries_day_apply(self): - self.day.apply(self.date) - - def time_timeseries_day_incr(self): - (self.date + self.day) - - def time_timeseries_year_apply(self): - self.year.apply(self.date) + N = 10000 + half = int(N / 2) + ts_string_1 = 'March 1, 2018 12:00:00+0400' + ts_string_2 = 'March 1, 2018 12:00:00+0500' + self.same_offset = [ts_string_1] * N + self.diff_offset = [ts_string_1] * half + [ts_string_2] * half - def time_timeseries_year_incr(self): - (self.date + self.year) + def time_same_offset(self): + to_datetime(self.same_offset) - # custom business offsets + def time_different_offset(self): + to_datetime(self.diff_offset) - def time_custom_bday_decr(self): - (self.date - self.cday) - def time_custom_bday_incr(self): - (self.date + self.cday) +class ToDatetimeFormatQuarters(object): - def time_custom_bday_apply(self): - self.cday.apply(self.date) - - def time_custom_bday_apply_dt64(self): - self.cday.apply(self.dt64) - - def time_custom_bday_cal_incr(self): - self.date + 1 * self.cdayh + def setup(self): + self.s = Series(['2Q2005', '2Q05', '2005Q1', '05Q1'] * 10000) - def time_custom_bday_cal_decr(self): - self.date - 1 * self.cdayh + def time_infer_quarter(self): + to_datetime(self.s) - def time_custom_bday_cal_incr_n(self): - self.date + 10 * self.cdayh - def time_custom_bday_cal_incr_neg_n(self): - self.date - 10 * self.cdayh +class ToDatetimeFormat(object): - # Increment custom business month + def setup(self): + self.s = Series(['19MAY11', '19MAY11:00:00:00'] * 100000) + self.s2 = self.s.str.replace(':\\S+$', '') - def time_custom_bmonthend_incr(self): - (self.date + self.cme) + def time_exact(self): + to_datetime(self.s2, format='%d%b%y') - def time_custom_bmonthend_incr_n(self): - (self.date + (10 * self.cme)) + def time_no_exact(self): + to_datetime(self.s, format='%d%b%y', exact=False) - def time_custom_bmonthend_decr_n(self): - (self.date - (10 * self.cme)) - def time_custom_bmonthbegin_decr_n(self): - (self.date - (10 * self.cmb)) +class ToDatetimeCache(object): - def time_custom_bmonthbegin_incr_n(self): - (self.date + (10 * self.cmb)) + params = [True, False] + param_names = ['cache'] + def setup(self, cache): + N = 10000 + self.unique_numeric_seconds = list(range(N)) + self.dup_numeric_seconds = [1000] * N + self.dup_string_dates = ['2000-02-11'] * N + self.dup_string_with_tz = ['2000-02-11 15:00:00-0800'] * N -class SemiMonthOffset(object): - goal_time = 0.2 + def time_unique_seconds_and_unit(self, cache): + to_datetime(self.unique_numeric_seconds, unit='s', cache=cache) - def setup(self): - self.N = 100000 - self.rng = date_range(start='1/1/2000', periods=self.N, freq='T') - # date is not on an offset which will be slowest case - self.date = dt.datetime(2011, 1, 2) - self.semi_month_end = pd.offsets.SemiMonthEnd() - self.semi_month_begin = pd.offsets.SemiMonthBegin() + def time_dup_seconds_and_unit(self, cache): + to_datetime(self.dup_numeric_seconds, unit='s', cache=cache) - def time_end_apply(self): - self.semi_month_end.apply(self.date) + def time_dup_string_dates(self, cache): + to_datetime(self.dup_string_dates, cache=cache) - def time_end_incr(self): - self.date + self.semi_month_end + def time_dup_string_dates_and_format(self, cache): + to_datetime(self.dup_string_dates, format='%Y-%m-%d', cache=cache) - def time_end_incr_n(self): - self.date + 10 * self.semi_month_end + def time_dup_string_tzoffset_dates(self, cache): + to_datetime(self.dup_string_with_tz, cache=cache) - def time_end_decr(self): - self.date - self.semi_month_end - def time_end_decr_n(self): - self.date - 10 * self.semi_month_end +class DatetimeAccessor(object): - def time_end_apply_index(self): - self.semi_month_end.apply_index(self.rng) + params = [None, 'US/Eastern', 'UTC', dateutil.tz.tzutc()] + param_names = 'tz' - def time_end_incr_rng(self): - self.rng + self.semi_month_end + def setup(self, tz): + N = 100000 + self.series = Series( + date_range(start='1/1/2000', periods=N, freq='T', tz=tz) + ) - def time_end_decr_rng(self): - self.rng - self.semi_month_end + def time_dt_accessor(self, tz): + self.series.dt - def time_begin_apply(self): - self.semi_month_begin.apply(self.date) + def time_dt_accessor_normalize(self, tz): + self.series.dt.normalize() - def time_begin_incr(self): - self.date + self.semi_month_begin + def time_dt_accessor_month_name(self, tz): + self.series.dt.month_name() - def time_begin_incr_n(self): - self.date + 10 * self.semi_month_begin + def time_dt_accessor_day_name(self, tz): + self.series.dt.day_name() - def time_begin_decr(self): - self.date - self.semi_month_begin + def time_dt_accessor_time(self, tz): + self.series.dt.time - def time_begin_decr_n(self): - self.date - 10 * self.semi_month_begin + def time_dt_accessor_date(self, tz): + self.series.dt.date - def time_begin_apply_index(self): - self.semi_month_begin.apply_index(self.rng) + def time_dt_accessor_year(self, tz): + self.series.dt.year - def time_begin_incr_rng(self): - self.rng + self.semi_month_begin - def time_begin_decr_rng(self): - self.rng - self.semi_month_begin +from .pandas_vb_common import setup # noqa: F401 diff --git a/asv_bench/benchmarks/timestamp.py b/asv_bench/benchmarks/timestamp.py new file mode 100644 index 0000000000000..b45ae22650e17 --- /dev/null +++ b/asv_bench/benchmarks/timestamp.py @@ -0,0 +1,140 @@ +import datetime + +import dateutil +import pytz + +from pandas import Timestamp + + +class TimestampConstruction(object): + + def time_parse_iso8601_no_tz(self): + Timestamp('2017-08-25 08:16:14') + + def time_parse_iso8601_tz(self): + Timestamp('2017-08-25 08:16:14-0500') + + def time_parse_dateutil(self): + Timestamp('2017/08/25 08:16:14 AM') + + def time_parse_today(self): + Timestamp('today') + + def time_parse_now(self): + Timestamp('now') + + def time_fromordinal(self): + Timestamp.fromordinal(730120) + + def time_fromtimestamp(self): + Timestamp.fromtimestamp(1515448538) + + +class TimestampProperties(object): + _tzs = [None, pytz.timezone('Europe/Amsterdam'), pytz.UTC, + dateutil.tz.tzutc()] + _freqs = [None, 'B'] + params = [_tzs, _freqs] + param_names = ['tz', 'freq'] + + def setup(self, tz, freq): + self.ts = Timestamp('2017-08-25 08:16:14', tzinfo=tz, freq=freq) + + def time_tz(self, tz, freq): + self.ts.tz + + def time_dayofweek(self, tz, freq): + self.ts.dayofweek + + def time_weekday_name(self, tz, freq): + self.ts.day_name + + def time_dayofyear(self, tz, freq): + self.ts.dayofyear + + def time_week(self, tz, freq): + self.ts.week + + def time_quarter(self, tz, freq): + self.ts.quarter + + def time_days_in_month(self, tz, freq): + self.ts.days_in_month + + def time_freqstr(self, tz, freq): + self.ts.freqstr + + def time_is_month_start(self, tz, freq): + self.ts.is_month_start + + def time_is_month_end(self, tz, freq): + self.ts.is_month_end + + def time_is_quarter_start(self, tz, freq): + self.ts.is_quarter_start + + def time_is_quarter_end(self, tz, freq): + self.ts.is_quarter_end + + def time_is_year_start(self, tz, freq): + self.ts.is_year_start + + def time_is_year_end(self, tz, freq): + self.ts.is_year_end + + def time_is_leap_year(self, tz, freq): + self.ts.is_leap_year + + def time_microsecond(self, tz, freq): + self.ts.microsecond + + def time_month_name(self, tz, freq): + self.ts.month_name() + + +class TimestampOps(object): + params = [None, 'US/Eastern', pytz.UTC, + dateutil.tz.tzutc()] + param_names = ['tz'] + + def setup(self, tz): + self.ts = Timestamp('2017-08-25 08:16:14', tz=tz) + + def time_replace_tz(self, tz): + self.ts.replace(tzinfo=pytz.timezone('US/Eastern')) + + def time_replace_None(self, tz): + self.ts.replace(tzinfo=None) + + def time_to_pydatetime(self, tz): + self.ts.to_pydatetime() + + def time_normalize(self, tz): + self.ts.normalize() + + def time_tz_convert(self, tz): + if self.ts.tz is not None: + self.ts.tz_convert(tz) + + def time_tz_localize(self, tz): + if self.ts.tz is None: + self.ts.tz_localize(tz) + + def time_to_julian_date(self, tz): + self.ts.to_julian_date() + + def time_floor(self, tz): + self.ts.floor('5T') + + def time_ceil(self, tz): + self.ts.ceil('5T') + + +class TimestampAcrossDst(object): + def setup(self): + dt = datetime.datetime(2016, 3, 27, 1) + self.tzinfo = pytz.timezone('CET').localize(dt, is_dst=False).tzinfo + self.ts2 = Timestamp(dt) + + def time_replace_across_dst(self): + self.ts2.replace(tzinfo=self.tzinfo) diff --git a/asv_bench/vbench_to_asv.py b/asv_bench/vbench_to_asv.py deleted file mode 100644 index c3041ec2b1ba1..0000000000000 --- a/asv_bench/vbench_to_asv.py +++ /dev/null @@ -1,163 +0,0 @@ -import ast -import vbench -import os -import sys -import astor -import glob - - -def vbench_to_asv_source(bench, kinds=None): - tab = ' ' * 4 - if kinds is None: - kinds = ['time'] - - output = 'class {}(object):\n'.format(bench.name) - output += tab + 'goal_time = 0.2\n\n' - - if bench.setup: - indented_setup = [tab * 2 + '{}\n'.format(x) for x in bench.setup.splitlines()] - output += tab + 'def setup(self):\n' + ''.join(indented_setup) + '\n' - - for kind in kinds: - output += tab + 'def {}_{}(self):\n'.format(kind, bench.name) - for line in bench.code.splitlines(): - output += tab * 2 + line + '\n' - output += '\n\n' - - if bench.cleanup: - output += tab + 'def teardown(self):\n' + tab * 2 + bench.cleanup - - output += '\n\n' - return output - - -class AssignToSelf(ast.NodeTransformer): - def __init__(self): - super(AssignToSelf, self).__init__() - self.transforms = {} - self.imports = [] - - self.in_class_define = False - self.in_setup = False - - def visit_ClassDef(self, node): - self.transforms = {} - self.in_class_define = True - - functions_to_promote = [] - setup_func = None - - for class_func in ast.iter_child_nodes(node): - if isinstance(class_func, ast.FunctionDef): - if class_func.name == 'setup': - setup_func = class_func - for anon_func in ast.iter_child_nodes(class_func): - if isinstance(anon_func, ast.FunctionDef): - functions_to_promote.append(anon_func) - - if setup_func: - for func in functions_to_promote: - setup_func.body.remove(func) - func.args.args.insert(0, ast.Name(id='self', ctx=ast.Load())) - node.body.append(func) - self.transforms[func.name] = 'self.' + func.name - - ast.fix_missing_locations(node) - - self.generic_visit(node) - - return node - - def visit_TryExcept(self, node): - if any([isinstance(x, (ast.Import, ast.ImportFrom)) for x in node.body]): - self.imports.append(node) - else: - self.generic_visit(node) - return node - - def visit_Assign(self, node): - for target in node.targets: - if isinstance(target, ast.Name) and not isinstance(target.ctx, ast.Param) and not self.in_class_define: - self.transforms[target.id] = 'self.' + target.id - self.generic_visit(node) - - return node - - def visit_Name(self, node): - new_node = node - if node.id in self.transforms: - if not isinstance(node.ctx, ast.Param): - new_node = ast.Attribute(value=ast.Name(id='self', ctx=node.ctx), attr=node.id, ctx=node.ctx) - - self.generic_visit(node) - - return ast.copy_location(new_node, node) - - def visit_Import(self, node): - self.imports.append(node) - - def visit_ImportFrom(self, node): - self.imports.append(node) - - def visit_FunctionDef(self, node): - """Delete functions that are empty due to imports being moved""" - self.in_class_define = False - - self.generic_visit(node) - - if node.body: - return node - - -def translate_module(target_module): - g_vars = {} - l_vars = {} - exec('import ' + target_module) in g_vars - - print target_module - module = eval(target_module, g_vars) - - benchmarks = [] - for obj_str in dir(module): - obj = getattr(module, obj_str) - if isinstance(obj, vbench.benchmark.Benchmark): - benchmarks.append(obj) - - if not benchmarks: - return - - rewritten_output = '' - for bench in benchmarks: - rewritten_output += vbench_to_asv_source(bench) - - with open('rewrite.py', 'w') as f: - f.write(rewritten_output) - - ast_module = ast.parse(rewritten_output) - - transformer = AssignToSelf() - transformed_module = transformer.visit(ast_module) - - unique_imports = {astor.to_source(node): node for node in transformer.imports} - - transformed_module.body = unique_imports.values() + transformed_module.body - - transformed_source = astor.to_source(transformed_module) - - with open('benchmarks/{}.py'.format(target_module), 'w') as f: - f.write(transformed_source) - - -if __name__ == '__main__': - cwd = os.getcwd() - new_dir = os.path.join(os.path.dirname(__file__), '../vb_suite') - sys.path.insert(0, new_dir) - - for module in glob.glob(os.path.join(new_dir, '*.py')): - mod = os.path.basename(module) - if mod in ['make.py', 'measure_memory_consumption.py', 'perf_HEAD.py', 'run_suite.py', 'test_perf.py', 'generate_rst_files.py', 'test.py', 'suite.py']: - continue - print - print mod - - translate_module(mod.replace('.py', '')) diff --git a/azure-pipelines.yml b/azure-pipelines.yml new file mode 100644 index 0000000000000..f0567d76659b6 --- /dev/null +++ b/azure-pipelines.yml @@ -0,0 +1,119 @@ +# Adapted from https://github.com/numba/numba/blob/master/azure-pipelines.yml +jobs: +# Mac and Linux use the same template +- template: ci/azure/posix.yml + parameters: + name: macOS + vmImage: xcode9-macos10.13 +- template: ci/azure/posix.yml + parameters: + name: Linux + vmImage: ubuntu-16.04 + +# Windows Python 2.7 needs VC 9.0 installed, handled in the template +- template: ci/azure/windows.yml + parameters: + name: Windows + vmImage: vs2017-win2016 + +- job: 'Checks_and_doc' + pool: + vmImage: ubuntu-16.04 + timeoutInMinutes: 90 + steps: + - script: | + # XXX next command should avoid redefining the path in every step, but + # made the process crash as it couldn't find deactivate + #echo '##vso[task.prependpath]$HOME/miniconda3/bin' + echo '##vso[task.setvariable variable=CONDA_ENV]pandas-dev' + echo '##vso[task.setvariable variable=ENV_FILE]environment.yml' + echo '##vso[task.setvariable variable=AZURE]true' + displayName: 'Setting environment variables' + + # Do not require a conda environment + - script: | + export PATH=$HOME/miniconda3/bin:$PATH + ci/code_checks.sh patterns + displayName: 'Looking for unwanted patterns' + condition: true + + - script: | + export PATH=$HOME/miniconda3/bin:$PATH + sudo apt-get install -y libc6-dev-i386 + ci/incremental/install_miniconda.sh + ci/incremental/setup_conda_environment.sh + displayName: 'Set up environment' + condition: true + + # Do not require pandas + - script: | + export PATH=$HOME/miniconda3/bin:$PATH + source activate pandas-dev + ci/code_checks.sh lint + displayName: 'Linting' + condition: true + + - script: | + export PATH=$HOME/miniconda3/bin:$PATH + source activate pandas-dev + ci/code_checks.sh dependencies + displayName: 'Dependencies consistency' + condition: true + + - script: | + export PATH=$HOME/miniconda3/bin:$PATH + source activate pandas-dev + ci/incremental/build.sh + displayName: 'Build' + condition: true + + # Require pandas + - script: | + export PATH=$HOME/miniconda3/bin:$PATH + source activate pandas-dev + ci/code_checks.sh code + displayName: 'Checks on imported code' + condition: true + + - script: | + export PATH=$HOME/miniconda3/bin:$PATH + source activate pandas-dev + ci/code_checks.sh doctests + displayName: 'Running doctests' + condition: true + + - script: | + export PATH=$HOME/miniconda3/bin:$PATH + source activate pandas-dev + ci/code_checks.sh docstrings + displayName: 'Docstring validation' + condition: true + + - script: | + export PATH=$HOME/miniconda3/bin:$PATH + source activate pandas-dev + pytest --capture=no --strict scripts + displayName: 'Testing docstring validaton script' + condition: true + + - script: | + export PATH=$HOME/miniconda3/bin:$PATH + source activate pandas-dev + git remote add upstream https://github.com/pandas-dev/pandas.git + git fetch upstream + if git diff upstream/master --name-only | grep -q "^asv_bench/"; then + cd asv_bench + asv machine --yes + ASV_OUTPUT="$(asv dev)" + if [[ $(echo "$ASV_OUTPUT" | grep "failed") ]]; then + echo "##vso[task.logissue type=error]Benchmarks run with errors" + echo "$ASV_OUTPUT" + exit 1 + else + echo "Benchmarks run without errors" + fi + else + echo "Benchmarks did not run, no changes detected" + fi + displayName: 'Running benchmarks' + condition: true diff --git a/bench/alignment.py b/bench/alignment.py deleted file mode 100644 index bc3134f597ee0..0000000000000 --- a/bench/alignment.py +++ /dev/null @@ -1,22 +0,0 @@ -# Setup -from pandas.compat import range, lrange -import numpy as np -import pandas -import la -N = 1000 -K = 50 -arr1 = np.random.randn(N, K) -arr2 = np.random.randn(N, K) -idx1 = lrange(N) -idx2 = lrange(K) - -# pandas -dma1 = pandas.DataFrame(arr1, idx1, idx2) -dma2 = pandas.DataFrame(arr2, idx1[::-1], idx2[::-1]) - -# larry -lar1 = la.larry(arr1, [idx1, idx2]) -lar2 = la.larry(arr2, [idx1[::-1], idx2[::-1]]) - -for i in range(100): - result = lar1 + lar2 diff --git a/bench/bench_dense_to_sparse.py b/bench/bench_dense_to_sparse.py deleted file mode 100644 index e1dcd3456e88d..0000000000000 --- a/bench/bench_dense_to_sparse.py +++ /dev/null @@ -1,14 +0,0 @@ -from pandas import * - -K = 100 -N = 100000 -rng = DatetimeIndex('1/1/2000', periods=N, offset=datetools.Minute()) - -rng2 = np.asarray(rng).astype('M8[us]').astype('i8') - -series = {} -for i in range(1, K + 1): - data = np.random.randn(N)[:-i] - this_rng = rng2[:-i] - data[100:] = np.nan - series[i] = SparseSeries(data, index=this_rng) diff --git a/bench/bench_get_put_value.py b/bench/bench_get_put_value.py deleted file mode 100644 index 427e0b1b10a22..0000000000000 --- a/bench/bench_get_put_value.py +++ /dev/null @@ -1,56 +0,0 @@ -from pandas import * -from pandas.util.testing import rands -from pandas.compat import range - -N = 1000 -K = 50 - - -def _random_index(howmany): - return Index([rands(10) for _ in range(howmany)]) - -df = DataFrame(np.random.randn(N, K), index=_random_index(N), - columns=_random_index(K)) - - -def get1(): - for col in df.columns: - for row in df.index: - _ = df[col][row] - - -def get2(): - for col in df.columns: - for row in df.index: - _ = df.get_value(row, col) - - -def put1(): - for col in df.columns: - for row in df.index: - df[col][row] = 0 - - -def put2(): - for col in df.columns: - for row in df.index: - df.set_value(row, col, 0) - - -def resize1(): - buf = DataFrame() - for col in df.columns: - for row in df.index: - buf = buf.set_value(row, col, 5.) - return buf - - -def resize2(): - from collections import defaultdict - - buf = defaultdict(dict) - for col in df.columns: - for row in df.index: - buf[col][row] = 5. - - return DataFrame(buf) diff --git a/bench/bench_groupby.py b/bench/bench_groupby.py deleted file mode 100644 index d7a2853e1e7b2..0000000000000 --- a/bench/bench_groupby.py +++ /dev/null @@ -1,66 +0,0 @@ -from pandas import * -from pandas.util.testing import rands -from pandas.compat import range - -import string -import random - -k = 20000 -n = 10 - -foo = np.tile(np.array([rands(10) for _ in range(k)], dtype='O'), n) -foo2 = list(foo) -random.shuffle(foo) -random.shuffle(foo2) - -df = DataFrame({'A': foo, - 'B': foo2, - 'C': np.random.randn(n * k)}) - -import pandas._sandbox as sbx - - -def f(): - table = sbx.StringHashTable(len(df)) - ret = table.factorize(df['A']) - return ret - - -def g(): - table = sbx.PyObjectHashTable(len(df)) - ret = table.factorize(df['A']) - return ret - -ret = f() - -""" -import pandas._tseries as lib - -f = np.std - - -grouped = df.groupby(['A', 'B']) - -label_list = [ping.labels for ping in grouped.groupings] -shape = [len(ping.ids) for ping in grouped.groupings] - -from pandas.core.groupby import get_group_index - - -group_index = get_group_index(label_list, shape, - sort=True, xnull=True).astype('i4') - -ngroups = np.prod(shape) - -indexer = lib.groupsort_indexer(group_index, ngroups) - -values = df['C'].values.take(indexer) -group_index = group_index.take(indexer) - -f = lambda x: x.std(ddof=1) - -grouper = lib.Grouper(df['C'], np.ndarray.std, group_index, ngroups) -result = grouper.get_result() - -expected = grouped.std() -""" diff --git a/bench/bench_join_panel.py b/bench/bench_join_panel.py deleted file mode 100644 index f3c3f8ba15f70..0000000000000 --- a/bench/bench_join_panel.py +++ /dev/null @@ -1,85 +0,0 @@ -# reasonably efficient - - -def create_panels_append(cls, panels): - """ return an append list of panels """ - panels = [a for a in panels if a is not None] - # corner cases - if len(panels) == 0: - return None - elif len(panels) == 1: - return panels[0] - elif len(panels) == 2 and panels[0] == panels[1]: - return panels[0] - # import pdb; pdb.set_trace() - # create a joint index for the axis - - def joint_index_for_axis(panels, axis): - s = set() - for p in panels: - s.update(list(getattr(p, axis))) - return sorted(list(s)) - - def reindex_on_axis(panels, axis, axis_reindex): - new_axis = joint_index_for_axis(panels, axis) - new_panels = [p.reindex(**{axis_reindex: new_axis, - 'copy': False}) for p in panels] - return new_panels, new_axis - # create the joint major index, dont' reindex the sub-panels - we are - # appending - major = joint_index_for_axis(panels, 'major_axis') - # reindex on minor axis - panels, minor = reindex_on_axis(panels, 'minor_axis', 'minor') - # reindex on items - panels, items = reindex_on_axis(panels, 'items', 'items') - # concatenate values - try: - values = np.concatenate([p.values for p in panels], axis=1) - except Exception as detail: - raise Exception("cannot append values that dont' match dimensions! -> [%s] %s" - % (','.join(["%s" % p for p in panels]), str(detail))) - # pm('append - create_panel') - p = Panel(values, items=items, major_axis=major, - minor_axis=minor) - # pm('append - done') - return p - - -# does the job but inefficient (better to handle like you read a table in -# pytables...e.g create a LongPanel then convert to Wide) -def create_panels_join(cls, panels): - """ given an array of panels's, create a single panel """ - panels = [a for a in panels if a is not None] - # corner cases - if len(panels) == 0: - return None - elif len(panels) == 1: - return panels[0] - elif len(panels) == 2 and panels[0] == panels[1]: - return panels[0] - d = dict() - minor, major, items = set(), set(), set() - for panel in panels: - items.update(panel.items) - major.update(panel.major_axis) - minor.update(panel.minor_axis) - values = panel.values - for item, item_index in panel.items.indexMap.items(): - for minor_i, minor_index in panel.minor_axis.indexMap.items(): - for major_i, major_index in panel.major_axis.indexMap.items(): - try: - d[(minor_i, major_i, item)] = values[item_index, major_index, minor_index] - except: - pass - # stack the values - minor = sorted(list(minor)) - major = sorted(list(major)) - items = sorted(list(items)) - # create the 3d stack (items x columns x indicies) - data = np.dstack([np.asarray([np.asarray([d.get((minor_i, major_i, item), np.nan) - for item in items]) - for major_i in major]).transpose() - for minor_i in minor]) - # construct the panel - return Panel(data, items, major, minor) -add_class_method(Panel, create_panels_join, 'join_many') diff --git a/bench/bench_khash_dict.py b/bench/bench_khash_dict.py deleted file mode 100644 index 054fc36131b65..0000000000000 --- a/bench/bench_khash_dict.py +++ /dev/null @@ -1,89 +0,0 @@ -""" -Some comparisons of khash.h to Python dict -""" -from __future__ import print_function - -import numpy as np -import os - -from vbench.api import Benchmark -from pandas.util.testing import rands -from pandas.compat import range -import pandas._tseries as lib -import pandas._sandbox as sbx -import time - -import psutil - -pid = os.getpid() -proc = psutil.Process(pid) - - -def object_test_data(n): - pass - - -def string_test_data(n): - return np.array([rands(10) for _ in range(n)], dtype='O') - - -def int_test_data(n): - return np.arange(n, dtype='i8') - -N = 1000000 - -#---------------------------------------------------------------------- -# Benchmark 1: map_locations - - -def map_locations_python_object(): - arr = string_test_data(N) - return _timeit(lambda: lib.map_indices_object(arr)) - - -def map_locations_khash_object(): - arr = string_test_data(N) - - def f(): - table = sbx.PyObjectHashTable(len(arr)) - table.map_locations(arr) - return _timeit(f) - - -def _timeit(f, iterations=10): - start = time.time() - for _ in range(iterations): - foo = f() - elapsed = time.time() - start - return elapsed - -#---------------------------------------------------------------------- -# Benchmark 2: lookup_locations - - -def lookup_python(values): - table = lib.map_indices_object(values) - return _timeit(lambda: lib.merge_indexer_object(values, table)) - - -def lookup_khash(values): - table = sbx.PyObjectHashTable(len(values)) - table.map_locations(values) - locs = table.lookup_locations(values) - # elapsed = _timeit(lambda: table.lookup_locations2(values)) - return table - - -def leak(values): - for _ in range(100): - print(proc.get_memory_info()) - table = lookup_khash(values) - # table.destroy() - -arr = string_test_data(N) - -#---------------------------------------------------------------------- -# Benchmark 3: unique - -#---------------------------------------------------------------------- -# Benchmark 4: factorize diff --git a/bench/bench_merge.R b/bench/bench_merge.R deleted file mode 100644 index 3ed4618494857..0000000000000 --- a/bench/bench_merge.R +++ /dev/null @@ -1,161 +0,0 @@ -library(plyr) -library(data.table) -N <- 10000 -indices = rep(NA, N) -indices2 = rep(NA, N) -for (i in 1:N) { - indices[i] <- paste(sample(letters, 10), collapse="") - indices2[i] <- paste(sample(letters, 10), collapse="") -} -left <- data.frame(key=rep(indices[1:8000], 10), - key2=rep(indices2[1:8000], 10), - value=rnorm(80000)) -right <- data.frame(key=indices[2001:10000], - key2=indices2[2001:10000], - value2=rnorm(8000)) - -right2 <- data.frame(key=rep(right$key, 2), - key2=rep(right$key2, 2), - value2=rnorm(16000)) - -left.dt <- data.table(left, key=c("key", "key2")) -right.dt <- data.table(right, key=c("key", "key2")) -right2.dt <- data.table(right2, key=c("key", "key2")) - -# left.dt2 <- data.table(left) -# right.dt2 <- data.table(right) - -## left <- data.frame(key=rep(indices[1:1000], 10), -## key2=rep(indices2[1:1000], 10), -## value=rnorm(100000)) -## right <- data.frame(key=indices[1:1000], -## key2=indices2[1:1000], -## value2=rnorm(10000)) - -timeit <- function(func, niter=10) { - timing = rep(NA, niter) - for (i in 1:niter) { - gc() - timing[i] <- system.time(func())[3] - } - mean(timing) -} - -left.join <- function(sort=FALSE) { - result <- base::merge(left, right, all.x=TRUE, sort=sort) -} - -right.join <- function(sort=FALSE) { - result <- base::merge(left, right, all.y=TRUE, sort=sort) -} - -outer.join <- function(sort=FALSE) { - result <- base::merge(left, right, all=TRUE, sort=sort) -} - -inner.join <- function(sort=FALSE) { - result <- base::merge(left, right, all=FALSE, sort=sort) -} - -left.join.dt <- function(sort=FALSE) { - result <- right.dt[left.dt] -} - -right.join.dt <- function(sort=FALSE) { - result <- left.dt[right.dt] -} - -outer.join.dt <- function(sort=FALSE) { - result <- merge(left.dt, right.dt, all=TRUE, sort=sort) -} - -inner.join.dt <- function(sort=FALSE) { - result <- merge(left.dt, right.dt, all=FALSE, sort=sort) -} - -plyr.join <- function(type) { - result <- plyr::join(left, right, by=c("key", "key2"), - type=type, match="first") -} - -sort.options <- c(FALSE, TRUE) - -# many-to-one - -results <- matrix(nrow=4, ncol=3) -colnames(results) <- c("base::merge", "plyr", "data.table") -rownames(results) <- c("inner", "outer", "left", "right") - -base.functions <- c(inner.join, outer.join, left.join, right.join) -plyr.functions <- c(function() plyr.join("inner"), - function() plyr.join("full"), - function() plyr.join("left"), - function() plyr.join("right")) -dt.functions <- c(inner.join.dt, outer.join.dt, left.join.dt, right.join.dt) -for (i in 1:4) { - base.func <- base.functions[[i]] - plyr.func <- plyr.functions[[i]] - dt.func <- dt.functions[[i]] - results[i, 1] <- timeit(base.func) - results[i, 2] <- timeit(plyr.func) - results[i, 3] <- timeit(dt.func) -} - - -# many-to-many - -left.join <- function(sort=FALSE) { - result <- base::merge(left, right2, all.x=TRUE, sort=sort) -} - -right.join <- function(sort=FALSE) { - result <- base::merge(left, right2, all.y=TRUE, sort=sort) -} - -outer.join <- function(sort=FALSE) { - result <- base::merge(left, right2, all=TRUE, sort=sort) -} - -inner.join <- function(sort=FALSE) { - result <- base::merge(left, right2, all=FALSE, sort=sort) -} - -left.join.dt <- function(sort=FALSE) { - result <- right2.dt[left.dt] -} - -right.join.dt <- function(sort=FALSE) { - result <- left.dt[right2.dt] -} - -outer.join.dt <- function(sort=FALSE) { - result <- merge(left.dt, right2.dt, all=TRUE, sort=sort) -} - -inner.join.dt <- function(sort=FALSE) { - result <- merge(left.dt, right2.dt, all=FALSE, sort=sort) -} - -sort.options <- c(FALSE, TRUE) - -# many-to-one - -results <- matrix(nrow=4, ncol=3) -colnames(results) <- c("base::merge", "plyr", "data.table") -rownames(results) <- c("inner", "outer", "left", "right") - -base.functions <- c(inner.join, outer.join, left.join, right.join) -plyr.functions <- c(function() plyr.join("inner"), - function() plyr.join("full"), - function() plyr.join("left"), - function() plyr.join("right")) -dt.functions <- c(inner.join.dt, outer.join.dt, left.join.dt, right.join.dt) -for (i in 1:4) { - base.func <- base.functions[[i]] - plyr.func <- plyr.functions[[i]] - dt.func <- dt.functions[[i]] - results[i, 1] <- timeit(base.func) - results[i, 2] <- timeit(plyr.func) - results[i, 3] <- timeit(dt.func) -} - diff --git a/bench/bench_merge.py b/bench/bench_merge.py deleted file mode 100644 index 330dba7b9af69..0000000000000 --- a/bench/bench_merge.py +++ /dev/null @@ -1,105 +0,0 @@ -import random -import gc -import time -from pandas import * -from pandas.compat import range, lrange, StringIO -from pandas.util.testing import rands - -N = 10000 -ngroups = 10 - - -def get_test_data(ngroups=100, n=N): - unique_groups = lrange(ngroups) - arr = np.asarray(np.tile(unique_groups, n / ngroups), dtype=object) - - if len(arr) < n: - arr = np.asarray(list(arr) + unique_groups[:n - len(arr)], - dtype=object) - - random.shuffle(arr) - return arr - -# aggregate multiple columns -# df = DataFrame({'key1' : get_test_data(ngroups=ngroups), -# 'key2' : get_test_data(ngroups=ngroups), -# 'data1' : np.random.randn(N), -# 'data2' : np.random.randn(N)}) - -# df2 = DataFrame({'key1' : get_test_data(ngroups=ngroups, n=N//10), -# 'key2' : get_test_data(ngroups=ngroups//2, n=N//10), -# 'value' : np.random.randn(N // 10)}) -# result = merge.merge(df, df2, on='key2') - -N = 10000 - -indices = np.array([rands(10) for _ in range(N)], dtype='O') -indices2 = np.array([rands(10) for _ in range(N)], dtype='O') -key = np.tile(indices[:8000], 10) -key2 = np.tile(indices2[:8000], 10) - -left = DataFrame({'key': key, 'key2': key2, - 'value': np.random.randn(80000)}) -right = DataFrame({'key': indices[2000:], 'key2': indices2[2000:], - 'value2': np.random.randn(8000)}) - -right2 = right.append(right, ignore_index=True) - - -join_methods = ['inner', 'outer', 'left', 'right'] -results = DataFrame(index=join_methods, columns=[False, True]) -niter = 10 -for sort in [False, True]: - for join_method in join_methods: - f = lambda: merge(left, right, how=join_method, sort=sort) - gc.disable() - start = time.time() - for _ in range(niter): - f() - elapsed = (time.time() - start) / niter - gc.enable() - results[sort][join_method] = elapsed -# results.columns = ['pandas'] -results.columns = ['dont_sort', 'sort'] - - -# R results -# many to one -r_results = read_table(StringIO(""" base::merge plyr data.table -inner 0.2475 0.1183 0.1100 -outer 0.4213 0.1916 0.2090 -left 0.2998 0.1188 0.0572 -right 0.3102 0.0536 0.0376 -"""), sep='\s+') - -presults = results[['dont_sort']].rename(columns={'dont_sort': 'pandas'}) -all_results = presults.join(r_results) - -all_results = all_results.div(all_results['pandas'], axis=0) - -all_results = all_results.ix[:, ['pandas', 'data.table', 'plyr', - 'base::merge']] - -sort_results = DataFrame.from_items([('pandas', results['sort']), - ('R', r_results['base::merge'])]) -sort_results['Ratio'] = sort_results['R'] / sort_results['pandas'] - - -nosort_results = DataFrame.from_items([('pandas', results['dont_sort']), - ('R', r_results['base::merge'])]) -nosort_results['Ratio'] = nosort_results['R'] / nosort_results['pandas'] - -# many to many - -# many to one -r_results = read_table(StringIO("""base::merge plyr data.table -inner 0.4610 0.1276 0.1269 -outer 0.9195 0.1881 0.2725 -left 0.6559 0.1257 0.0678 -right 0.6425 0.0522 0.0428 -"""), sep='\s+') - -all_results = presults.join(r_results) -all_results = all_results.div(all_results['pandas'], axis=0) -all_results = all_results.ix[:, ['pandas', 'data.table', 'plyr', - 'base::merge']] diff --git a/bench/bench_merge_sqlite.py b/bench/bench_merge_sqlite.py deleted file mode 100644 index 3ad4b810119c3..0000000000000 --- a/bench/bench_merge_sqlite.py +++ /dev/null @@ -1,87 +0,0 @@ -import numpy as np -from collections import defaultdict -import gc -import time -from pandas import DataFrame -from pandas.util.testing import rands -from pandas.compat import range, zip -import random - -N = 10000 - -indices = np.array([rands(10) for _ in range(N)], dtype='O') -indices2 = np.array([rands(10) for _ in range(N)], dtype='O') -key = np.tile(indices[:8000], 10) -key2 = np.tile(indices2[:8000], 10) - -left = DataFrame({'key': key, 'key2': key2, - 'value': np.random.randn(80000)}) -right = DataFrame({'key': indices[2000:], 'key2': indices2[2000:], - 'value2': np.random.randn(8000)}) - -# right2 = right.append(right, ignore_index=True) -# right = right2 - -# random.shuffle(key2) -# indices2 = indices.copy() -# random.shuffle(indices2) - -# Prepare Database -import sqlite3 -create_sql_indexes = True - -conn = sqlite3.connect(':memory:') -conn.execute( - 'create table left( key varchar(10), key2 varchar(10), value int);') -conn.execute( - 'create table right( key varchar(10), key2 varchar(10), value2 int);') -conn.executemany('insert into left values (?, ?, ?)', - zip(key, key2, left['value'])) -conn.executemany('insert into right values (?, ?, ?)', - zip(right['key'], right['key2'], right['value2'])) - -# Create Indices -if create_sql_indexes: - conn.execute('create index left_ix on left(key, key2)') - conn.execute('create index right_ix on right(key, key2)') - - -join_methods = ['inner', 'left outer', 'left'] # others not supported -sql_results = DataFrame(index=join_methods, columns=[False]) -niter = 5 -for sort in [False]: - for join_method in join_methods: - sql = """CREATE TABLE test as select * - from left - %s join right - on left.key=right.key - and left.key2 = right.key2;""" % join_method - sql = """select * - from left - %s join right - on left.key=right.key - and left.key2 = right.key2;""" % join_method - - if sort: - sql = '%s order by key, key2' % sql - f = lambda: list(conn.execute(sql)) # list fetches results - g = lambda: conn.execute(sql) # list fetches results - gc.disable() - start = time.time() - # for _ in range(niter): - g() - elapsed = (time.time() - start) / niter - gc.enable() - - cur = conn.execute("DROP TABLE test") - conn.commit() - - sql_results[sort][join_method] = elapsed - sql_results.columns = ['sqlite3'] # ['dont_sort', 'sort'] - sql_results.index = ['inner', 'outer', 'left'] - - sql = """select * - from left - inner join right - on left.key=right.key - and left.key2 = right.key2;""" diff --git a/bench/bench_pivot.R b/bench/bench_pivot.R deleted file mode 100644 index 06dc6a105bc43..0000000000000 --- a/bench/bench_pivot.R +++ /dev/null @@ -1,27 +0,0 @@ -library(reshape2) - - -n <- 100000 -a.size <- 5 -b.size <- 5 - -data <- data.frame(a=sample(letters[1:a.size], n, replace=T), - b=sample(letters[1:b.size], n, replace=T), - c=rnorm(n), - d=rnorm(n)) - -timings <- numeric() - -# acast(melt(data, id=c("a", "b")), a ~ b, mean) -# acast(melt(data, id=c("a", "b")), a + b ~ variable, mean) - -for (i in 1:10) { - gc() - tim <- system.time(acast(melt(data, id=c("a", "b")), a ~ b, mean, - subset=.(variable=="c"))) - timings[i] = tim[3] -} - -mean(timings) - -acast(melt(data, id=c("a", "b")), a ~ b, mean, subset=.(variable="c")) diff --git a/bench/bench_pivot.py b/bench/bench_pivot.py deleted file mode 100644 index 007bd0aaebc2f..0000000000000 --- a/bench/bench_pivot.py +++ /dev/null @@ -1,16 +0,0 @@ -from pandas import * -import string - - -n = 100000 -asize = 5 -bsize = 5 - -letters = np.asarray(list(string.letters), dtype=object) - -data = DataFrame(dict(foo=letters[:asize][np.random.randint(0, asize, n)], - bar=letters[:bsize][np.random.randint(0, bsize, n)], - baz=np.random.randn(n), - qux=np.random.randn(n))) - -table = pivot_table(data, xby=['foo', 'bar']) diff --git a/bench/bench_take_indexing.py b/bench/bench_take_indexing.py deleted file mode 100644 index 5fb584bcfe45f..0000000000000 --- a/bench/bench_take_indexing.py +++ /dev/null @@ -1,55 +0,0 @@ -from __future__ import print_function -import numpy as np - -from pandas import * -import pandas._tseries as lib - -from pandas import DataFrame -import timeit -from pandas.compat import zip - -setup = """ -from pandas import Series -import pandas._tseries as lib -import random -import numpy as np - -import random -n = %d -k = %d -arr = np.random.randn(n, k) -indexer = np.arange(n, dtype=np.int32) -indexer = indexer[::-1] -""" - -sizes = [100, 1000, 10000, 100000] -iters = [1000, 1000, 100, 1] - -fancy_2d = [] -take_2d = [] -cython_2d = [] - -n = 1000 - - -def _timeit(stmt, size, k=5, iters=1000): - timer = timeit.Timer(stmt=stmt, setup=setup % (sz, k)) - return timer.timeit(n) / n - -for sz, its in zip(sizes, iters): - print(sz) - fancy_2d.append(_timeit('arr[indexer]', sz, iters=its)) - take_2d.append(_timeit('arr.take(indexer, axis=0)', sz, iters=its)) - cython_2d.append(_timeit('lib.take_axis0(arr, indexer)', sz, iters=its)) - -df = DataFrame({'fancy': fancy_2d, - 'take': take_2d, - 'cython': cython_2d}) - -print(df) - -from pandas.rpy.common import r -r('mat <- matrix(rnorm(50000), nrow=10000, ncol=5)') -r('set.seed(12345') -r('indexer <- sample(1:10000)') -r('mat[indexer,]') diff --git a/bench/bench_unique.py b/bench/bench_unique.py deleted file mode 100644 index 87bd2f2df586c..0000000000000 --- a/bench/bench_unique.py +++ /dev/null @@ -1,278 +0,0 @@ -from __future__ import print_function -from pandas import * -from pandas.util.testing import rands -from pandas.compat import range, zip -import pandas._tseries as lib -import numpy as np -import matplotlib.pyplot as plt - -N = 50000 -K = 10000 - -groups = np.array([rands(10) for _ in range(K)], dtype='O') -groups2 = np.array([rands(10) for _ in range(K)], dtype='O') - -labels = np.tile(groups, N // K) -labels2 = np.tile(groups2, N // K) -data = np.random.randn(N) - - -def timeit(f, niter): - import gc - import time - gc.disable() - start = time.time() - for _ in range(niter): - f() - elapsed = (time.time() - start) / niter - gc.enable() - return elapsed - - -def algo1(): - unique_labels = np.unique(labels) - result = np.empty(len(unique_labels)) - for i, label in enumerate(unique_labels): - result[i] = data[labels == label].sum() - - -def algo2(): - unique_labels = np.unique(labels) - indices = lib.groupby_indices(labels) - result = np.empty(len(unique_labels)) - - for i, label in enumerate(unique_labels): - result[i] = data.take(indices[label]).sum() - - -def algo3_nosort(): - rizer = lib.DictFactorizer() - labs, counts = rizer.factorize(labels, sort=False) - k = len(rizer.uniques) - out = np.empty(k) - lib.group_add(out, counts, data, labs) - - -def algo3_sort(): - rizer = lib.DictFactorizer() - labs, counts = rizer.factorize(labels, sort=True) - k = len(rizer.uniques) - out = np.empty(k) - lib.group_add(out, counts, data, labs) - -import numpy as np -import random - - -# dict to hold results -counts = {} - -# a hack to generate random key, value pairs. -# 5k keys, 100k values -x = np.tile(np.arange(5000, dtype='O'), 20) -random.shuffle(x) -xarr = x -x = [int(y) for y in x] -data = np.random.uniform(0, 1, 100000) - - -def f(): - # groupby sum - for k, v in zip(x, data): - try: - counts[k] += v - except KeyError: - counts[k] = v - - -def f2(): - rizer = lib.DictFactorizer() - labs, counts = rizer.factorize(xarr, sort=False) - k = len(rizer.uniques) - out = np.empty(k) - lib.group_add(out, counts, data, labs) - - -def algo4(): - rizer = lib.DictFactorizer() - labs1, _ = rizer.factorize(labels, sort=False) - k1 = len(rizer.uniques) - - rizer = lib.DictFactorizer() - labs2, _ = rizer.factorize(labels2, sort=False) - k2 = len(rizer.uniques) - - group_id = labs1 * k2 + labs2 - max_group = k1 * k2 - - if max_group > 1e6: - rizer = lib.Int64Factorizer(len(group_id)) - group_id, _ = rizer.factorize(group_id.astype('i8'), sort=True) - max_group = len(rizer.uniques) - - out = np.empty(max_group) - counts = np.zeros(max_group, dtype='i4') - lib.group_add(out, counts, data, group_id) - -# cumtime percall filename:lineno(function) -# 0.592 0.592 :1() - # 0.584 0.006 groupby_ex.py:37(algo3_nosort) - # 0.535 0.005 {method 'factorize' of DictFactorizer' objects} - # 0.047 0.000 {pandas._tseries.group_add} - # 0.002 0.000 numeric.py:65(zeros_like) - # 0.001 0.000 {method 'fill' of 'numpy.ndarray' objects} - # 0.000 0.000 {numpy.core.multiarray.empty_like} - # 0.000 0.000 {numpy.core.multiarray.empty} - -# UNIQUE timings - -# N = 10000000 -# K = 500000 - -# groups = np.array([rands(10) for _ in range(K)], dtype='O') - -# labels = np.tile(groups, N // K) -data = np.random.randn(N) - -data = np.random.randn(N) - -Ks = [100, 1000, 5000, 10000, 25000, 50000, 100000] - -# Ks = [500000, 1000000, 2500000, 5000000, 10000000] - -import psutil -import os -import gc - -pid = os.getpid() -proc = psutil.Process(pid) - - -def dict_unique(values, expected_K, sort=False, memory=False): - if memory: - gc.collect() - before_mem = proc.get_memory_info().rss - - rizer = lib.DictFactorizer() - result = rizer.unique_int64(values) - - if memory: - result = proc.get_memory_info().rss - before_mem - return result - - if sort: - result.sort() - assert(len(result) == expected_K) - return result - - -def khash_unique(values, expected_K, size_hint=False, sort=False, - memory=False): - if memory: - gc.collect() - before_mem = proc.get_memory_info().rss - - if size_hint: - rizer = lib.Factorizer(len(values)) - else: - rizer = lib.Factorizer(100) - - result = [] - result = rizer.unique(values) - - if memory: - result = proc.get_memory_info().rss - before_mem - return result - - if sort: - result.sort() - assert(len(result) == expected_K) - - -def khash_unique_str(values, expected_K, size_hint=False, sort=False, - memory=False): - if memory: - gc.collect() - before_mem = proc.get_memory_info().rss - - if size_hint: - rizer = lib.StringHashTable(len(values)) - else: - rizer = lib.StringHashTable(100) - - result = [] - result = rizer.unique(values) - - if memory: - result = proc.get_memory_info().rss - before_mem - return result - - if sort: - result.sort() - assert(len(result) == expected_K) - - -def khash_unique_int64(values, expected_K, size_hint=False, sort=False): - if size_hint: - rizer = lib.Int64HashTable(len(values)) - else: - rizer = lib.Int64HashTable(100) - - result = [] - result = rizer.unique(values) - - if sort: - result.sort() - assert(len(result) == expected_K) - - -def hash_bench(): - numpy = [] - dict_based = [] - dict_based_sort = [] - khash_hint = [] - khash_nohint = [] - for K in Ks: - print(K) - # groups = np.array([rands(10) for _ in range(K)]) - # labels = np.tile(groups, N // K).astype('O') - - groups = np.random.randint(0, long(100000000000), size=K) - labels = np.tile(groups, N // K) - dict_based.append(timeit(lambda: dict_unique(labels, K), 20)) - khash_nohint.append(timeit(lambda: khash_unique_int64(labels, K), 20)) - khash_hint.append(timeit(lambda: khash_unique_int64(labels, K, - size_hint=True), 20)) - - # memory, hard to get - # dict_based.append(np.mean([dict_unique(labels, K, memory=True) - # for _ in range(10)])) - # khash_nohint.append(np.mean([khash_unique(labels, K, memory=True) - # for _ in range(10)])) - # khash_hint.append(np.mean([khash_unique(labels, K, size_hint=True, memory=True) - # for _ in range(10)])) - - # dict_based_sort.append(timeit(lambda: dict_unique(labels, K, - # sort=True), 10)) - # numpy.append(timeit(lambda: np.unique(labels), 10)) - - # unique_timings = DataFrame({'numpy.unique' : numpy, - # 'dict, no sort' : dict_based, - # 'dict, sort' : dict_based_sort}, - # columns=['dict, no sort', - # 'dict, sort', 'numpy.unique'], - # index=Ks) - - unique_timings = DataFrame({'dict': dict_based, - 'khash, preallocate': khash_hint, - 'khash': khash_nohint}, - columns=['khash, preallocate', 'khash', 'dict'], - index=Ks) - - unique_timings.plot(kind='bar', legend=False) - plt.legend(loc='best') - plt.title('Unique on 100,000 values, int64') - plt.xlabel('Number of unique labels') - plt.ylabel('Mean execution time') - - plt.show() diff --git a/bench/bench_with_subset.R b/bench/bench_with_subset.R deleted file mode 100644 index 69d0f7a9eec63..0000000000000 --- a/bench/bench_with_subset.R +++ /dev/null @@ -1,53 +0,0 @@ -library(microbenchmark) -library(data.table) - - -data.frame.subset.bench <- function (n=1e7, times=30) { - df <- data.frame(a=rnorm(n), b=rnorm(n), c=rnorm(n)) - print(microbenchmark(subset(df, a <= b & b <= (c ^ 2 + b ^ 2 - a) & b > c), - times=times)) -} - - -# data.table allows something very similar to query with an expression -# but we have chained comparisons AND we're faster BOO YAH! -data.table.subset.expression.bench <- function (n=1e7, times=30) { - dt <- data.table(a=rnorm(n), b=rnorm(n), c=rnorm(n)) - print(microbenchmark(dt[, a <= b & b <= (c ^ 2 + b ^ 2 - a) & b > c], - times=times)) -} - - -# compare against subset with data.table for good measure -data.table.subset.bench <- function (n=1e7, times=30) { - dt <- data.table(a=rnorm(n), b=rnorm(n), c=rnorm(n)) - print(microbenchmark(subset(dt, a <= b & b <= (c ^ 2 + b ^ 2 - a) & b > c), - times=times)) -} - - -data.frame.with.bench <- function (n=1e7, times=30) { - df <- data.frame(a=rnorm(n), b=rnorm(n), c=rnorm(n)) - - print(microbenchmark(with(df, a + b * (c ^ 2 + b ^ 2 - a) / (a * c) ^ 3), - times=times)) -} - - -data.table.with.bench <- function (n=1e7, times=30) { - dt <- data.table(a=rnorm(n), b=rnorm(n), c=rnorm(n)) - print(microbenchmark(with(dt, a + b * (c ^ 2 + b ^ 2 - a) / (a * c) ^ 3), - times=times)) -} - - -bench <- function () { - data.frame.subset.bench() - data.table.subset.expression.bench() - data.table.subset.bench() - data.frame.with.bench() - data.table.with.bench() -} - - -bench() diff --git a/bench/bench_with_subset.py b/bench/bench_with_subset.py deleted file mode 100644 index 017401df3f7f3..0000000000000 --- a/bench/bench_with_subset.py +++ /dev/null @@ -1,116 +0,0 @@ -#!/usr/bin/env python - -""" -Microbenchmarks for comparison with R's "with" and "subset" functions -""" - -from __future__ import print_function -import numpy as np -from numpy import array -from timeit import repeat as timeit -from pandas.compat import range, zip -from pandas import DataFrame - - -setup_common = """from pandas import DataFrame -from numpy.random import randn -df = DataFrame(randn(%d, 3), columns=list('abc')) -%s""" - - -setup_with = "s = 'a + b * (c ** 2 + b ** 2 - a) / (a * c) ** 3'" - - -def bench_with(n, times=10, repeat=3, engine='numexpr'): - return np.array(timeit('df.eval(s, engine=%r)' % engine, - setup=setup_common % (n, setup_with), - repeat=repeat, number=times)) / times - - -setup_subset = "s = 'a <= b <= c ** 2 + b ** 2 - a and b > c'" - - -def bench_subset(n, times=10, repeat=3, engine='numexpr'): - return np.array(timeit('df.query(s, engine=%r)' % engine, - setup=setup_common % (n, setup_subset), - repeat=repeat, number=times)) / times - - -def bench(mn=1, mx=7, num=100, engines=('python', 'numexpr'), verbose=False): - r = np.logspace(mn, mx, num=num).round().astype(int) - - ev = DataFrame(np.empty((num, len(engines))), columns=engines) - qu = ev.copy(deep=True) - - ev['size'] = qu['size'] = r - - for engine in engines: - for i, n in enumerate(r): - if verbose: - print('engine: %r, i == %d' % (engine, i)) - ev.loc[i, engine] = bench_with(n, times=1, repeat=1, engine=engine) - qu.loc[i, engine] = bench_subset(n, times=1, repeat=1, - engine=engine) - - return ev, qu - - -def plot_perf(df, engines, title, filename=None): - from matplotlib.pyplot import figure, rc - - try: - from mpltools import style - except ImportError: - pass - else: - style.use('ggplot') - - rc('text', usetex=True) - - fig = figure(figsize=(4, 3), dpi=100) - ax = fig.add_subplot(111) - - for engine in engines: - ax.plot(df.size, df[engine], label=engine, lw=2) - - ax.set_xlabel('Number of Rows') - ax.set_ylabel('Time (s)') - ax.set_title(title) - ax.legend(loc='best') - ax.tick_params(top=False, right=False) - - fig.tight_layout() - - if filename is not None: - fig.savefig(filename) - - -if __name__ == '__main__': - import os - import pandas as pd - - pandas_dir = os.path.dirname(os.path.abspath(os.path.dirname(__file__))) - static_path = os.path.join(pandas_dir, 'doc', 'source', '_static') - - join = lambda p: os.path.join(static_path, p) - - fn = join('eval-query-perf-data.h5') - - engines = 'python', 'numexpr' - - if not os.path.exists(fn): - ev, qu = bench(verbose=True) - ev.to_hdf(fn, 'eval') - qu.to_hdf(fn, 'query') - else: - ev = pd.read_hdf(fn, 'eval') - qu = pd.read_hdf(fn, 'query') - - plot_perf(ev, engines, 'DataFrame.eval()', filename=join('eval-perf.png')) - plot_perf(qu, engines, 'DataFrame.query()', - filename=join('query-perf.png')) - - plot_perf(ev[ev.size <= 50000], engines, 'DataFrame.eval()', - filename=join('eval-perf-small.png')) - plot_perf(qu[qu.size <= 500000], engines, 'DataFrame.query()', - filename=join('query-perf-small.png')) diff --git a/bench/better_unique.py b/bench/better_unique.py deleted file mode 100644 index e03a4f433ce66..0000000000000 --- a/bench/better_unique.py +++ /dev/null @@ -1,80 +0,0 @@ -from __future__ import print_function -from pandas import DataFrame -from pandas.compat import range, zip -import timeit - -setup = """ -from pandas import Series -import pandas._tseries as _tseries -from pandas.compat import range -import random -import numpy as np - -def better_unique(values): - uniques = _tseries.fast_unique(values) - id_map = _tseries.map_indices_buf(uniques) - labels = _tseries.get_unique_labels(values, id_map) - return uniques, labels - -tot = 100000 - -def get_test_data(ngroups=100, n=tot): - unique_groups = range(ngroups) - random.shuffle(unique_groups) - arr = np.asarray(np.tile(unique_groups, n / ngroups), dtype=object) - - if len(arr) < n: - arr = np.asarray(list(arr) + unique_groups[:n - len(arr)], - dtype=object) - - return arr - -arr = get_test_data(ngroups=%d) -""" - -group_sizes = [10, 100, 1000, 10000, - 20000, 30000, 40000, - 50000, 60000, 70000, - 80000, 90000, 100000] - -numbers = [100, 100, 50] + [10] * 10 - -numpy = [] -wes = [] - -for sz, n in zip(group_sizes, numbers): - # wes_timer = timeit.Timer(stmt='better_unique(arr)', - # setup=setup % sz) - wes_timer = timeit.Timer(stmt='_tseries.fast_unique(arr)', - setup=setup % sz) - - numpy_timer = timeit.Timer(stmt='np.unique(arr)', - setup=setup % sz) - - print(n) - numpy_result = numpy_timer.timeit(number=n) / n - wes_result = wes_timer.timeit(number=n) / n - - print('Groups: %d, NumPy: %s, Wes: %s' % (sz, numpy_result, wes_result)) - - wes.append(wes_result) - numpy.append(numpy_result) - -result = DataFrame({'wes': wes, 'numpy': numpy}, index=group_sizes) - - -def make_plot(numpy, wes): - pass - -# def get_test_data(ngroups=100, n=100000): -# unique_groups = range(ngroups) -# random.shuffle(unique_groups) -# arr = np.asarray(np.tile(unique_groups, n / ngroups), dtype=object) - -# if len(arr) < n: -# arr = np.asarray(list(arr) + unique_groups[:n - len(arr)], -# dtype=object) - -# return arr - -# arr = get_test_data(ngroups=1000) diff --git a/bench/duplicated.R b/bench/duplicated.R deleted file mode 100644 index eb2376df2932a..0000000000000 --- a/bench/duplicated.R +++ /dev/null @@ -1,22 +0,0 @@ -N <- 100000 - -k1 = rep(NA, N) -k2 = rep(NA, N) -for (i in 1:N){ - k1[i] <- paste(sample(letters, 1), collapse="") - k2[i] <- paste(sample(letters, 1), collapse="") -} -df <- data.frame(a=k1, b=k2, c=rep(1:100, N / 100)) -df2 <- data.frame(a=k1, b=k2) - -timings <- numeric() -timings2 <- numeric() -for (i in 1:50) { - gc() - timings[i] = system.time(deduped <- df[!duplicated(df),])[3] - gc() - timings2[i] = system.time(deduped <- df[!duplicated(df[,c("a", "b")]),])[3] -} - -mean(timings) -mean(timings2) diff --git a/bench/io_roundtrip.py b/bench/io_roundtrip.py deleted file mode 100644 index d87da0ec6321a..0000000000000 --- a/bench/io_roundtrip.py +++ /dev/null @@ -1,116 +0,0 @@ -from __future__ import print_function -import time -import os -import numpy as np - -import la -import pandas -from pandas.compat import range -from pandas import datetools, DatetimeIndex - - -def timeit(f, iterations): - start = time.clock() - - for i in range(iterations): - f() - - return time.clock() - start - - -def rountrip_archive(N, K=50, iterations=10): - # Create data - arr = np.random.randn(N, K) - # lar = la.larry(arr) - dma = pandas.DataFrame(arr, - DatetimeIndex('1/1/2000', periods=N, - offset=datetools.Minute())) - dma[201] = 'bar' - - # filenames - filename_numpy = '/Users/wesm/tmp/numpy.npz' - filename_larry = '/Users/wesm/tmp/archive.hdf5' - filename_pandas = '/Users/wesm/tmp/pandas_tmp' - - # Delete old files - try: - os.unlink(filename_numpy) - except: - pass - try: - os.unlink(filename_larry) - except: - pass - - try: - os.unlink(filename_pandas) - except: - pass - - # Time a round trip save and load - # numpy_f = lambda: numpy_roundtrip(filename_numpy, arr, arr) - # numpy_time = timeit(numpy_f, iterations) / iterations - - # larry_f = lambda: larry_roundtrip(filename_larry, lar, lar) - # larry_time = timeit(larry_f, iterations) / iterations - - pandas_f = lambda: pandas_roundtrip(filename_pandas, dma, dma) - pandas_time = timeit(pandas_f, iterations) / iterations - print('pandas (HDF5) %7.4f seconds' % pandas_time) - - pickle_f = lambda: pandas_roundtrip(filename_pandas, dma, dma) - pickle_time = timeit(pickle_f, iterations) / iterations - print('pandas (pickle) %7.4f seconds' % pickle_time) - - # print('Numpy (npz) %7.4f seconds' % numpy_time) - # print('larry (HDF5) %7.4f seconds' % larry_time) - - # Delete old files - try: - os.unlink(filename_numpy) - except: - pass - try: - os.unlink(filename_larry) - except: - pass - - try: - os.unlink(filename_pandas) - except: - pass - - -def numpy_roundtrip(filename, arr1, arr2): - np.savez(filename, arr1=arr1, arr2=arr2) - npz = np.load(filename) - arr1 = npz['arr1'] - arr2 = npz['arr2'] - - -def larry_roundtrip(filename, lar1, lar2): - io = la.IO(filename) - io['lar1'] = lar1 - io['lar2'] = lar2 - lar1 = io['lar1'] - lar2 = io['lar2'] - - -def pandas_roundtrip(filename, dma1, dma2): - # What's the best way to code this? - from pandas.io.pytables import HDFStore - store = HDFStore(filename) - store['dma1'] = dma1 - store['dma2'] = dma2 - dma1 = store['dma1'] - dma2 = store['dma2'] - - -def pandas_roundtrip_pickle(filename, dma1, dma2): - dma1.save(filename) - dma1 = pandas.DataFrame.load(filename) - dma2.save(filename) - dma2 = pandas.DataFrame.load(filename) - -if __name__ == '__main__': - rountrip_archive(10000, K=200) diff --git a/bench/serialize.py b/bench/serialize.py deleted file mode 100644 index b0edd6a5752d2..0000000000000 --- a/bench/serialize.py +++ /dev/null @@ -1,89 +0,0 @@ -from __future__ import print_function -from pandas.compat import range, lrange -import time -import os -import numpy as np - -import la -import pandas - - -def timeit(f, iterations): - start = time.clock() - - for i in range(iterations): - f() - - return time.clock() - start - - -def roundtrip_archive(N, iterations=10): - - # Create data - arr = np.random.randn(N, N) - lar = la.larry(arr) - dma = pandas.DataFrame(arr, lrange(N), lrange(N)) - - # filenames - filename_numpy = '/Users/wesm/tmp/numpy.npz' - filename_larry = '/Users/wesm/tmp/archive.hdf5' - filename_pandas = '/Users/wesm/tmp/pandas_tmp' - - # Delete old files - try: - os.unlink(filename_numpy) - except: - pass - try: - os.unlink(filename_larry) - except: - pass - try: - os.unlink(filename_pandas) - except: - pass - - # Time a round trip save and load - numpy_f = lambda: numpy_roundtrip(filename_numpy, arr, arr) - numpy_time = timeit(numpy_f, iterations) / iterations - - larry_f = lambda: larry_roundtrip(filename_larry, lar, lar) - larry_time = timeit(larry_f, iterations) / iterations - - pandas_f = lambda: pandas_roundtrip(filename_pandas, dma, dma) - pandas_time = timeit(pandas_f, iterations) / iterations - - print('Numpy (npz) %7.4f seconds' % numpy_time) - print('larry (HDF5) %7.4f seconds' % larry_time) - print('pandas (HDF5) %7.4f seconds' % pandas_time) - - -def numpy_roundtrip(filename, arr1, arr2): - np.savez(filename, arr1=arr1, arr2=arr2) - npz = np.load(filename) - arr1 = npz['arr1'] - arr2 = npz['arr2'] - - -def larry_roundtrip(filename, lar1, lar2): - io = la.IO(filename) - io['lar1'] = lar1 - io['lar2'] = lar2 - lar1 = io['lar1'] - lar2 = io['lar2'] - - -def pandas_roundtrip(filename, dma1, dma2): - from pandas.io.pytables import HDFStore - store = HDFStore(filename) - store['dma1'] = dma1 - store['dma2'] = dma2 - dma1 = store['dma1'] - dma2 = store['dma2'] - - -def pandas_roundtrip_pickle(filename, dma1, dma2): - dma1.save(filename) - dma1 = pandas.DataFrame.load(filename) - dma2.save(filename) - dma2 = pandas.DataFrame.load(filename) diff --git a/bench/test.py b/bench/test.py deleted file mode 100644 index 2339deab313a1..0000000000000 --- a/bench/test.py +++ /dev/null @@ -1,70 +0,0 @@ -import numpy as np -import itertools -import collections -import scipy.ndimage as ndi -from pandas.compat import zip, range - -N = 10000 - -lat = np.random.randint(0, 360, N) -lon = np.random.randint(0, 360, N) -data = np.random.randn(N) - - -def groupby1(lat, lon, data): - indexer = np.lexsort((lon, lat)) - lat = lat.take(indexer) - lon = lon.take(indexer) - sorted_data = data.take(indexer) - - keys = 1000. * lat + lon - unique_keys = np.unique(keys) - bounds = keys.searchsorted(unique_keys) - - result = group_agg(sorted_data, bounds, lambda x: x.mean()) - - decoder = keys.searchsorted(unique_keys) - - return dict(zip(zip(lat.take(decoder), lon.take(decoder)), result)) - - -def group_mean(lat, lon, data): - indexer = np.lexsort((lon, lat)) - lat = lat.take(indexer) - lon = lon.take(indexer) - sorted_data = data.take(indexer) - - keys = 1000 * lat + lon - unique_keys = np.unique(keys) - - result = ndi.mean(sorted_data, labels=keys, index=unique_keys) - decoder = keys.searchsorted(unique_keys) - - return dict(zip(zip(lat.take(decoder), lon.take(decoder)), result)) - - -def group_mean_naive(lat, lon, data): - grouped = collections.defaultdict(list) - for lt, ln, da in zip(lat, lon, data): - grouped[(lt, ln)].append(da) - - averaged = dict((ltln, np.mean(da)) for ltln, da in grouped.items()) - - return averaged - - -def group_agg(values, bounds, f): - N = len(values) - result = np.empty(len(bounds), dtype=float) - for i, left_bound in enumerate(bounds): - if i == len(bounds) - 1: - right_bound = N - else: - right_bound = bounds[i + 1] - - result[i] = f(values[left_bound: right_bound]) - - return result - -# for i in range(10): -# groupby1(lat, lon, data) diff --git a/bench/zoo_bench.R b/bench/zoo_bench.R deleted file mode 100644 index 294d55f51a9ab..0000000000000 --- a/bench/zoo_bench.R +++ /dev/null @@ -1,71 +0,0 @@ -library(zoo) -library(xts) -library(fts) -library(tseries) -library(its) -library(xtable) - -## indices = rep(NA, 100000) -## for (i in 1:100000) -## indices[i] <- paste(sample(letters, 10), collapse="") - - - -## x <- zoo(rnorm(100000), indices) -## y <- zoo(rnorm(90000), indices[sample(1:100000, 90000)]) - -## indices <- as.POSIXct(1:100000) - -indices <- as.POSIXct(Sys.Date()) + seq(1, 100000000, 100) - -sz <- 500000 - -## x <- xts(rnorm(sz), sample(indices, sz)) -## y <- xts(rnorm(sz), sample(indices, sz)) - -zoo.bench <- function(){ - x <- zoo(rnorm(sz), sample(indices, sz)) - y <- zoo(rnorm(sz), sample(indices, sz)) - timeit(function() {x + y}) -} - -xts.bench <- function(){ - x <- xts(rnorm(sz), sample(indices, sz)) - y <- xts(rnorm(sz), sample(indices, sz)) - timeit(function() {x + y}) -} - -fts.bench <- function(){ - x <- fts(rnorm(sz), sort(sample(indices, sz))) - y <- fts(rnorm(sz), sort(sample(indices, sz)) - timeit(function() {x + y}) -} - -its.bench <- function(){ - x <- its(rnorm(sz), sort(sample(indices, sz))) - y <- its(rnorm(sz), sort(sample(indices, sz))) - timeit(function() {x + y}) -} - -irts.bench <- function(){ - x <- irts(sort(sample(indices, sz)), rnorm(sz)) - y <- irts(sort(sample(indices, sz)), rnorm(sz)) - timeit(function() {x + y}) -} - -timeit <- function(f){ - timings <- numeric() - for (i in 1:10) { - gc() - timings[i] = system.time(f())[3] - } - mean(timings) -} - -bench <- function(){ - results <- c(xts.bench(), fts.bench(), its.bench(), zoo.bench()) - names <- c("xts", "fts", "its", "zoo") - data.frame(results, names) -} - -result <- bench() diff --git a/bench/zoo_bench.py b/bench/zoo_bench.py deleted file mode 100644 index 74cb1952a5a2a..0000000000000 --- a/bench/zoo_bench.py +++ /dev/null @@ -1,36 +0,0 @@ -from pandas import * -from pandas.util.testing import rands - -n = 1000000 -# indices = Index([rands(10) for _ in xrange(n)]) - - -def sample(values, k): - sampler = np.random.permutation(len(values)) - return values.take(sampler[:k]) -sz = 500000 -rng = np.arange(0, 10000000000000, 10000000) -stamps = np.datetime64(datetime.now()).view('i8') + rng -idx1 = np.sort(sample(stamps, sz)) -idx2 = np.sort(sample(stamps, sz)) -ts1 = Series(np.random.randn(sz), idx1) -ts2 = Series(np.random.randn(sz), idx2) - - -# subsample_size = 90000 - -# x = Series(np.random.randn(100000), indices) -# y = Series(np.random.randn(subsample_size), -# index=sample(indices, subsample_size)) - - -# lx = larry(np.random.randn(100000), [list(indices)]) -# ly = larry(np.random.randn(subsample_size), [list(y.index)]) - -# Benchmark 1: Two 1-million length time series (int64-based index) with -# randomly chosen timestamps - -# Benchmark 2: Join two 5-variate time series DataFrames (outer and inner join) - -# df1 = DataFrame(np.random.randn(1000000, 5), idx1, columns=range(5)) -# df2 = DataFrame(np.random.randn(1000000, 5), idx2, columns=range(5, 10)) diff --git a/ci/README.txt b/ci/README.txt deleted file mode 100644 index bb71dc25d6093..0000000000000 --- a/ci/README.txt +++ /dev/null @@ -1,17 +0,0 @@ -Travis is a ci service that's well-integrated with GitHub. -The following types of breakage should be detected -by Travis builds: - -1) Failing tests on any supported version of Python. -2) Pandas should install and the tests should run if no optional deps are installed. -That also means tests which rely on optional deps need to raise SkipTest() -if the dep is missing. -3) unicode related fails when running under exotic locales. - -We tried running the vbench suite for a while, but with varying load -on Travis machines, that wasn't useful. - -Travis currently (4/2013) has a 5-job concurrency limit. Exceeding it -basically doubles the total runtime for a commit through travis, and -since dep+pandas installation is already quite long, this should become -a hard limit on concurrent travis runs. diff --git a/ci/azure/posix.yml b/ci/azure/posix.yml new file mode 100644 index 0000000000000..b9e0cd0b9258c --- /dev/null +++ b/ci/azure/posix.yml @@ -0,0 +1,100 @@ +parameters: + name: '' + vmImage: '' + +jobs: +- job: ${{ parameters.name }} + pool: + vmImage: ${{ parameters.vmImage }} + strategy: + matrix: + ${{ if eq(parameters.name, 'macOS') }}: + py35_np_120: + ENV_FILE: ci/deps/azure-macos-35.yaml + CONDA_PY: "35" + PATTERN: "not slow and not network" + + ${{ if eq(parameters.name, 'Linux') }}: + py27_np_120: + ENV_FILE: ci/deps/azure-27-compat.yaml + CONDA_PY: "27" + PATTERN: "not slow and not network" + + py27_locale_slow_old_np: + ENV_FILE: ci/deps/azure-27-locale.yaml + CONDA_PY: "27" + PATTERN: "slow" + LOCALE_OVERRIDE: "zh_CN.UTF-8" + EXTRA_APT: "language-pack-zh-hans" + + py36_locale_slow: + ENV_FILE: ci/deps/azure-36-locale_slow.yaml + CONDA_PY: "36" + PATTERN: "not slow and not network" + LOCALE_OVERRIDE: "it_IT.UTF-8" + + py37_locale: + ENV_FILE: ci/deps/azure-37-locale.yaml + CONDA_PY: "37" + PATTERN: "not slow and not network" + LOCALE_OVERRIDE: "zh_CN.UTF-8" + + py37_np_dev: + ENV_FILE: ci/deps/azure-37-numpydev.yaml + CONDA_PY: "37" + PATTERN: "not slow and not network" + TEST_ARGS: "-W error" + PANDAS_TESTING_MODE: "deprecate" + EXTRA_APT: "xsel" + + steps: + - script: | + if [ "$(uname)" == "Linux" ]; then sudo apt-get install -y libc6-dev-i386 $EXTRA_APT; fi + echo "Installing Miniconda" + ci/incremental/install_miniconda.sh + export PATH=$HOME/miniconda3/bin:$PATH + echo "Setting up Conda environment" + ci/incremental/setup_conda_environment.sh + displayName: 'Before Install' + - script: | + export PATH=$HOME/miniconda3/bin:$PATH + source activate pandas-dev + ci/incremental/build.sh + displayName: 'Build' + - script: | + export PATH=$HOME/miniconda3/bin:$PATH + source activate pandas-dev + ci/run_tests.sh + displayName: 'Test' + - script: | + export PATH=$HOME/miniconda3/bin:$PATH + source activate pandas-dev && pushd /tmp && python -c "import pandas; pandas.show_versions();" && popd + - task: PublishTestResults@2 + inputs: + testResultsFiles: 'test-data-*.xml' + testRunTitle: ${{ format('{0}-$(CONDA_PY)', parameters.name) }} + - powershell: | + $junitXml = "test-data-single.xml" + $(Get-Content $junitXml | Out-String) -match 'failures="(.*?)"' + if ($matches[1] -eq 0) + { + Write-Host "No test failures in test-data-single" + } + else + { + # note that this will produce $LASTEXITCODE=1 + Write-Error "$($matches[1]) tests failed" + } + + $junitXmlMulti = "test-data-multiple.xml" + $(Get-Content $junitXmlMulti | Out-String) -match 'failures="(.*?)"' + if ($matches[1] -eq 0) + { + Write-Host "No test failures in test-data-multi" + } + else + { + # note that this will produce $LASTEXITCODE=1 + Write-Error "$($matches[1]) tests failed" + } + displayName: Check for test failures diff --git a/ci/azure/windows.yml b/ci/azure/windows.yml new file mode 100644 index 0000000000000..cece002024936 --- /dev/null +++ b/ci/azure/windows.yml @@ -0,0 +1,59 @@ +parameters: + name: '' + vmImage: '' + +jobs: +- job: ${{ parameters.name }} + pool: + vmImage: ${{ parameters.vmImage }} + strategy: + matrix: + py36_np14: + ENV_FILE: ci/deps/azure-windows-36.yaml + CONDA_PY: "36" + + py27_np121: + ENV_FILE: ci/deps/azure-windows-27.yaml + CONDA_PY: "27" + + steps: + - task: CondaEnvironment@1 + inputs: + updateConda: no + packageSpecs: '' + + - powershell: | + $wc = New-Object net.webclient + $wc.Downloadfile("https://download.microsoft.com/download/7/9/6/796EF2E4-801B-4FC4-AB28-B59FBF6D907B/VCForPython27.msi", "VCForPython27.msi") + Start-Process "VCForPython27.msi" /qn -Wait + displayName: 'Install VC 9.0 only for Python 2.7' + condition: eq(variables.CONDA_PY, '27') + + - script: | + ci\\incremental\\setup_conda_environment.cmd + displayName: 'Before Install' + - script: | + call activate pandas-dev + ci\\incremental\\build.cmd + displayName: 'Build' + - script: | + call activate pandas-dev + pytest -m "not slow and not network" --junitxml=test-data.xml pandas -n 2 -r sxX --strict --durations=10 %* + displayName: 'Test' + - task: PublishTestResults@2 + inputs: + testResultsFiles: 'test-data.xml' + testRunTitle: 'Windows-$(CONDA_PY)' + - powershell: | + $junitXml = "test-data.xml" + $(Get-Content $junitXml | Out-String) -match 'failures="(.*?)"' + if ($matches[1] -eq 0) + { + Write-Host "No test failures in test-data" + } + else + { + # note that this will produce $LASTEXITCODE=1 + Write-Error "$($matches[1]) tests failed" + } + displayName: Check for test failures diff --git a/ci/before_install_travis.sh b/ci/before_install_travis.sh deleted file mode 100755 index f90427f97d3b7..0000000000000 --- a/ci/before_install_travis.sh +++ /dev/null @@ -1,15 +0,0 @@ -#!/bin/bash - -# If envars.sh determined we're running in an authorized fork -# and the user opted in to the network cache,and that cached versions -# are available on the cache server, download and deploy the cached -# files to the local filesystem - -echo "inside $0" - -# overview -if [ "${TRAVIS_OS_NAME}" == "linux" ]; then - sh -e /etc/init.d/xvfb start -fi - -true # never fail because bad things happened here diff --git a/ci/before_script_travis.sh b/ci/before_script_travis.sh new file mode 100755 index 0000000000000..0b3939b1906a2 --- /dev/null +++ b/ci/before_script_travis.sh @@ -0,0 +1,11 @@ +#!/bin/bash + +echo "inside $0" + +if [ "${TRAVIS_OS_NAME}" == "linux" ]; then + sh -e /etc/init.d/xvfb start + sleep 3 +fi + +# Never fail because bad things happened here. +true diff --git a/ci/build_docs.sh b/ci/build_docs.sh index 1356d097025c9..bf22f0764144c 100755 --- a/ci/build_docs.sh +++ b/ci/build_docs.sh @@ -1,31 +1,19 @@ #!/bin/bash +set -e + if [ "${TRAVIS_OS_NAME}" != "linux" ]; then echo "not doing build_docs on non-linux" exit 0 fi -cd "$TRAVIS_BUILD_DIR" +cd "$TRAVIS_BUILD_DIR"/doc echo "inside $0" -git show --pretty="format:" --name-only HEAD~5.. --first-parent | grep -P "rst|txt|doc" - -if [ "$?" != "0" ]; then - echo "Skipping doc build, none were modified" - # nope, skip docs build - exit 0 -fi - - if [ "$DOC" ]; then echo "Will build docs" - source activate pandas - - mv "$TRAVIS_BUILD_DIR"/doc /tmp - cd /tmp/doc - echo ############################### echo # Log file for the doc build # echo ############################### @@ -37,24 +25,32 @@ if [ "$DOC" ]; then echo # Create and send docs # echo ######################## - cd /tmp/doc/build/html - git config --global user.email "pandas-docs-bot@localhost.foo" - git config --global user.name "pandas-docs-bot" - git config --global credential.helper cache - - # create the repo - git init - touch README - git add README - git commit -m "Initial commit" --allow-empty - git branch gh-pages - git checkout gh-pages - touch .nojekyll - git add --all . - git commit -m "Version" --allow-empty - git remote remove origin - git remote add origin "https://${PANDAS_GH_TOKEN}@github.com/pandas-docs/pandas-docs-travis.git" - git push origin gh-pages -f + echo "Only uploading docs when TRAVIS_PULL_REQUEST is 'false'" + echo "TRAVIS_PULL_REQUEST: ${TRAVIS_PULL_REQUEST}" + + if [ "${TRAVIS_PULL_REQUEST}" == "false" ]; then + cd build/html + git config --global user.email "pandas-docs-bot@localhost.foo" + git config --global user.name "pandas-docs-bot" + + # create the repo + git init + + touch README + git add README + git commit -m "Initial commit" --allow-empty + git branch gh-pages + git checkout gh-pages + touch .nojekyll + git add --all . + git commit -m "Version" --allow-empty + + git remote add origin "https://${PANDAS_GH_TOKEN}@github.com/pandas-dev/pandas-docs-travis.git" + git fetch origin + git remote -v + + git push origin gh-pages -f + fi fi exit 0 diff --git a/ci/code_checks.sh b/ci/code_checks.sh new file mode 100755 index 0000000000000..c4840f1e836c4 --- /dev/null +++ b/ci/code_checks.sh @@ -0,0 +1,259 @@ +#!/bin/bash +# +# Run checks related to code quality. +# +# This script is intended for both the CI and to check locally that code standards are +# respected. We are currently linting (PEP-8 and similar), looking for patterns of +# common mistakes (sphinx directives with missing blank lines, old style classes, +# unwanted imports...), we run doctests here (currently some files only), and we +# validate formatting error in docstrings. +# +# Usage: +# $ ./ci/code_checks.sh # run all checks +# $ ./ci/code_checks.sh lint # run linting only +# $ ./ci/code_checks.sh patterns # check for patterns that should not exist +# $ ./ci/code_checks.sh code # checks on imported code +# $ ./ci/code_checks.sh doctests # run doctests +# $ ./ci/code_checks.sh docstrings # validate docstring errors +# $ ./ci/code_checks.sh dependencies # check that dependencies are consistent + +[[ -z "$1" || "$1" == "lint" || "$1" == "patterns" || "$1" == "code" || "$1" == "doctests" || "$1" == "docstrings" || "$1" == "dependencies" ]] || \ + { echo "Unknown command $1. Usage: $0 [lint|patterns|code|doctests|docstrings|dependencies]"; exit 9999; } + +BASE_DIR="$(dirname $0)/.." +RET=0 +CHECK=$1 + +function invgrep { + # grep with inverse exist status and formatting for azure-pipelines + # + # This function works exactly as grep, but with opposite exit status: + # - 0 (success) when no patterns are found + # - 1 (fail) when the patterns are found + # + # This is useful for the CI, as we want to fail if one of the patterns + # that we want to avoid is found by grep. + if [[ "$AZURE" == "true" ]]; then + set -o pipefail + grep -n "$@" | awk -F ":" '{print "##vso[task.logissue type=error;sourcepath=" $1 ";linenumber=" $2 ";] Found unwanted pattern: " $3}' + else + grep "$@" + fi + return $((! $?)) +} + +if [[ "$AZURE" == "true" ]]; then + FLAKE8_FORMAT="##vso[task.logissue type=error;sourcepath=%(path)s;linenumber=%(row)s;columnnumber=%(col)s;code=%(code)s;]%(text)s" +else + FLAKE8_FORMAT="default" +fi + +### LINTING ### +if [[ -z "$CHECK" || "$CHECK" == "lint" ]]; then + + # `setup.cfg` contains the list of error codes that are being ignored in flake8 + + echo "flake8 --version" + flake8 --version + + # pandas/_libs/src is C code, so no need to search there. + MSG='Linting .py code' ; echo $MSG + flake8 --format="$FLAKE8_FORMAT" . + RET=$(($RET + $?)) ; echo $MSG "DONE" + + MSG='Linting .pyx code' ; echo $MSG + flake8 --format="$FLAKE8_FORMAT" pandas --filename=*.pyx --select=E501,E302,E203,E111,E114,E221,E303,E128,E231,E126,E265,E305,E301,E127,E261,E271,E129,W291,E222,E241,E123,F403,C400,C401,C402,C403,C404,C405,C406,C407,C408,C409,C410,C411 + RET=$(($RET + $?)) ; echo $MSG "DONE" + + MSG='Linting .pxd and .pxi.in' ; echo $MSG + flake8 --format="$FLAKE8_FORMAT" pandas/_libs --filename=*.pxi.in,*.pxd --select=E501,E302,E203,E111,E114,E221,E303,E231,E126,F403 + RET=$(($RET + $?)) ; echo $MSG "DONE" + + echo "flake8-rst --version" + flake8-rst --version + + MSG='Linting code-blocks in .rst documentation' ; echo $MSG + flake8-rst doc/source --filename=*.rst --format="$FLAKE8_FORMAT" + RET=$(($RET + $?)) ; echo $MSG "DONE" + + # Check that cython casting is of the form `obj` as opposed to ` obj`; + # it doesn't make a difference, but we want to be internally consistent. + # Note: this grep pattern is (intended to be) equivalent to the python + # regex r'(?])> ' + MSG='Linting .pyx code for spacing conventions in casting' ; echo $MSG + invgrep -r -E --include '*.pyx' --include '*.pxi.in' '[a-zA-Z0-9*]> ' pandas/_libs + RET=$(($RET + $?)) ; echo $MSG "DONE" + + # readability/casting: Warnings about C casting instead of C++ casting + # runtime/int: Warnings about using C number types instead of C++ ones + # build/include_subdir: Warnings about prefacing included header files with directory + + # We don't lint all C files because we don't want to lint any that are built + # from Cython files nor do we want to lint C files that we didn't modify for + # this particular codebase (e.g. src/headers, src/klib, src/msgpack). However, + # we can lint all header files since they aren't "generated" like C files are. + MSG='Linting .c and .h' ; echo $MSG + cpplint --quiet --extensions=c,h --headers=h --recursive --filter=-readability/casting,-runtime/int,-build/include_subdir pandas/_libs/src/*.h pandas/_libs/src/parser pandas/_libs/ujson pandas/_libs/tslibs/src/datetime pandas/io/msgpack pandas/_libs/*.cpp pandas/util + RET=$(($RET + $?)) ; echo $MSG "DONE" + + echo "isort --version-number" + isort --version-number + + # Imports - Check formatting using isort see setup.cfg for settings + MSG='Check import format using isort ' ; echo $MSG + isort --recursive --check-only pandas asv_bench + RET=$(($RET + $?)) ; echo $MSG "DONE" + +fi + +### PATTERNS ### +if [[ -z "$CHECK" || "$CHECK" == "patterns" ]]; then + + # Check for imports from pandas.core.common instead of `import pandas.core.common as com` + MSG='Check for non-standard imports' ; echo $MSG + invgrep -R --include="*.py*" -E "from pandas.core.common import " pandas + # invgrep -R --include="*.py*" -E "from numpy import nan " pandas # GH#24822 not yet implemented since the offending imports have not all been removed + RET=$(($RET + $?)) ; echo $MSG "DONE" + + MSG='Check for pytest warns' ; echo $MSG + invgrep -r -E --include '*.py' 'pytest\.warns' pandas/tests/ + RET=$(($RET + $?)) ; echo $MSG "DONE" + + # Check for the following code in testing: `np.testing` and `np.array_equal` + MSG='Check for invalid testing' ; echo $MSG + invgrep -r -E --include '*.py' --exclude testing.py '(numpy|np)(\.testing|\.array_equal)' pandas/tests/ + RET=$(($RET + $?)) ; echo $MSG "DONE" + + # Check for the following code in the extension array base tests: `tm.assert_frame_equal` and `tm.assert_series_equal` + MSG='Check for invalid EA testing' ; echo $MSG + invgrep -r -E --include '*.py' --exclude base.py 'tm.assert_(series|frame)_equal' pandas/tests/extension/base + RET=$(($RET + $?)) ; echo $MSG "DONE" + + MSG='Check for deprecated messages without sphinx directive' ; echo $MSG + invgrep -R --include="*.py" --include="*.pyx" -E "(DEPRECATED|DEPRECATE|Deprecated)(:|,|\.)" pandas + RET=$(($RET + $?)) ; echo $MSG "DONE" + + MSG='Check for old-style classes' ; echo $MSG + invgrep -R --include="*.py" -E "class\s\S*[^)]:" pandas scripts + RET=$(($RET + $?)) ; echo $MSG "DONE" + + MSG='Check for backticks incorrectly rendering because of missing spaces' ; echo $MSG + invgrep -R --include="*.rst" -E "[a-zA-Z0-9]\`\`?[a-zA-Z0-9]" doc/source/ + RET=$(($RET + $?)) ; echo $MSG "DONE" + + MSG='Check for incorrect sphinx directives' ; echo $MSG + invgrep -R --include="*.py" --include="*.pyx" --include="*.rst" -E "\.\. (autosummary|contents|currentmodule|deprecated|function|image|important|include|ipython|literalinclude|math|module|note|raw|seealso|toctree|versionadded|versionchanged|warning):[^:]" ./pandas ./doc/source + RET=$(($RET + $?)) ; echo $MSG "DONE" + + MSG='Check that the deprecated `assert_raises_regex` is not used (`pytest.raises(match=pattern)` should be used instead)' ; echo $MSG + invgrep -R --exclude=*.pyc --exclude=testing.py --exclude=test_util.py assert_raises_regex pandas + RET=$(($RET + $?)) ; echo $MSG "DONE" + + # Check for the following code in testing: `unittest.mock`, `mock.Mock()` or `mock.patch` + MSG='Check that unittest.mock is not used (pytest builtin monkeypatch fixture should be used instead)' ; echo $MSG + invgrep -r -E --include '*.py' '(unittest(\.| import )mock|mock\.Mock\(\)|mock\.patch)' pandas/tests/ + RET=$(($RET + $?)) ; echo $MSG "DONE" + + # Check that we use pytest.raises only as a context manager + # + # For any flake8-compliant code, the only way this regex gets + # matched is if there is no "with" statement preceding "pytest.raises" + MSG='Check for pytest.raises as context manager (a line starting with `pytest.raises` is invalid, needs a `with` to precede it)' ; echo $MSG + MSG='TODO: This check is currently skipped because so many files fail this. Please enable when all are corrected (xref gh-24332)' ; echo $MSG + # invgrep -R --include '*.py' -E '[[:space:]] pytest.raises' pandas/tests + # RET=$(($RET + $?)) ; echo $MSG "DONE" + + MSG='Check for wrong space after code-block directive and before colon (".. code-block ::" instead of ".. code-block::")' ; echo $MSG + invgrep -R --include="*.rst" ".. code-block ::" doc/source + RET=$(($RET + $?)) ; echo $MSG "DONE" + + MSG='Check for wrong space after ipython directive and before colon (".. ipython ::" instead of ".. ipython::")' ; echo $MSG + invgrep -R --include="*.rst" ".. ipython ::" doc/source + RET=$(($RET + $?)) ; echo $MSG "DONE" + + MSG='Check that no file in the repo contains tailing whitespaces' ; echo $MSG + set -o pipefail + if [[ "$AZURE" == "true" ]]; then + # we exclude all c/cpp files as the c/cpp files of pandas code base are tested when Linting .c and .h files + ! grep -n '--exclude=*.'{svg,c,cpp,html} -RI "\s$" * | awk -F ":" '{print "##vso[task.logissue type=error;sourcepath=" $1 ";linenumber=" $2 ";] Tailing whitespaces found: " $3}' + else + ! grep -n '--exclude=*.'{svg,c,cpp,html} -RI "\s$" * | awk -F ":" '{print $1 ":" $2 ":Tailing whitespaces found: " $3}' + fi + RET=$(($RET + $?)) ; echo $MSG "DONE" +fi + +### CODE ### +if [[ -z "$CHECK" || "$CHECK" == "code" ]]; then + + MSG='Check import. No warnings, and blacklist some optional dependencies' ; echo $MSG + python -W error -c " +import sys +import pandas + +blacklist = {'bs4', 'gcsfs', 'html5lib', 'ipython', 'jinja2' 'hypothesis', + 'lxml', 'numexpr', 'openpyxl', 'py', 'pytest', 's3fs', 'scipy', + 'tables', 'xlrd', 'xlsxwriter', 'xlwt'} +mods = blacklist & set(m.split('.')[0] for m in sys.modules) +if mods: + sys.stderr.write('err: pandas should not import: {}\n'.format(', '.join(mods))) + sys.exit(len(mods)) + " + RET=$(($RET + $?)) ; echo $MSG "DONE" + +fi + +### DOCTESTS ### +if [[ -z "$CHECK" || "$CHECK" == "doctests" ]]; then + + MSG='Doctests frame.py' ; echo $MSG + pytest -q --doctest-modules pandas/core/frame.py \ + -k" -itertuples -join -reindex -reindex_axis -round" + RET=$(($RET + $?)) ; echo $MSG "DONE" + + MSG='Doctests series.py' ; echo $MSG + pytest -q --doctest-modules pandas/core/series.py \ + -k"-nonzero -reindex -searchsorted -to_dict" + RET=$(($RET + $?)) ; echo $MSG "DONE" + + MSG='Doctests generic.py' ; echo $MSG + pytest -q --doctest-modules pandas/core/generic.py \ + -k"-_set_axis_name -_xs -describe -droplevel -groupby -interpolate -pct_change -pipe -reindex -reindex_axis -to_json -transpose -values -xs -to_clipboard" + RET=$(($RET + $?)) ; echo $MSG "DONE" + + MSG='Doctests top-level reshaping functions' ; echo $MSG + pytest -q --doctest-modules \ + pandas/core/reshape/concat.py \ + pandas/core/reshape/pivot.py \ + pandas/core/reshape/reshape.py \ + pandas/core/reshape/tile.py \ + -k"-crosstab -pivot_table -cut" + RET=$(($RET + $?)) ; echo $MSG "DONE" + + MSG='Doctests interval classes' ; echo $MSG + pytest --doctest-modules -v \ + pandas/core/indexes/interval.py \ + pandas/core/arrays/interval.py \ + -k"-from_arrays -from_breaks -from_intervals -from_tuples -get_loc -set_closed -to_tuples -interval_range" + RET=$(($RET + $?)) ; echo $MSG "DONE" + +fi + +### DOCSTRINGS ### +if [[ -z "$CHECK" || "$CHECK" == "docstrings" ]]; then + + MSG='Validate docstrings (GL06, GL07, GL09, SS04, SS05, PR03, PR04, PR05, PR10, EX04, RT04, RT05, SA05)' ; echo $MSG + $BASE_DIR/scripts/validate_docstrings.py --format=azure --errors=GL06,GL07,GL09,SS04,SS05,PR03,PR04,PR05,PR10,EX04,RT04,RT05,SA05 + RET=$(($RET + $?)) ; echo $MSG "DONE" + +fi + +### DEPENDENCIES ### +if [[ -z "$CHECK" || "$CHECK" == "dependencies" ]]; then + + MSG='Check that requirements-dev.txt has been generated from environment.yml' ; echo $MSG + $BASE_DIR/scripts/generate_pip_deps_from_conda.py --compare --azure + RET=$(($RET + $?)) ; echo $MSG "DONE" + +fi + +exit $RET diff --git a/ci/deps/azure-27-compat.yaml b/ci/deps/azure-27-compat.yaml new file mode 100644 index 0000000000000..a7784f17d1956 --- /dev/null +++ b/ci/deps/azure-27-compat.yaml @@ -0,0 +1,28 @@ +name: pandas-dev +channels: + - defaults + - conda-forge +dependencies: + - bottleneck=1.2.0 + - cython=0.28.2 + - jinja2=2.8 + - numexpr=2.6.1 + - numpy=1.12.0 + - openpyxl=2.5.5 + - pytables=3.4.2 + - python-dateutil=2.5.0 + - python=2.7* + - pytz=2013b + - scipy=0.18.1 + - xlrd=1.0.0 + - xlsxwriter=0.5.2 + - xlwt=0.7.5 + # universal + - pytest>=4.0.2 + - pytest-xdist + - pytest-mock + - isort + - pip: + - html5lib==1.0b2 + - beautifulsoup4==4.2.1 + - hypothesis>=3.58.0 diff --git a/ci/deps/azure-27-locale.yaml b/ci/deps/azure-27-locale.yaml new file mode 100644 index 0000000000000..8636a63d02fed --- /dev/null +++ b/ci/deps/azure-27-locale.yaml @@ -0,0 +1,30 @@ +name: pandas-dev +channels: + - defaults + - conda-forge +dependencies: + - bottleneck=1.2.0 + - cython=0.28.2 + - lxml + - matplotlib=2.0.0 + - numpy=1.12.0 + - openpyxl=2.4.0 + - python-dateutil + - python-blosc + - python=2.7 + - pytz + - pytz=2013b + - scipy + - sqlalchemy=0.8.1 + - xlrd=1.0.0 + - xlsxwriter=0.5.2 + - xlwt=0.7.5 + # universal + - pytest>=4.0.2 + - pytest-xdist + - pytest-mock + - hypothesis>=3.58.0 + - isort + - pip: + - html5lib==1.0b2 + - beautifulsoup4==4.2.1 diff --git a/ci/deps/azure-36-locale_slow.yaml b/ci/deps/azure-36-locale_slow.yaml new file mode 100644 index 0000000000000..3f788e5ddcf39 --- /dev/null +++ b/ci/deps/azure-36-locale_slow.yaml @@ -0,0 +1,35 @@ +name: pandas-dev +channels: + - defaults + - conda-forge +dependencies: + - beautifulsoup4 + - cython>=0.28.2 + - gcsfs + - html5lib + - ipython + - jinja2 + - lxml + - matplotlib + - nomkl + - numexpr + - numpy + - openpyxl + - pytables + - python-dateutil + - python=3.6* + - pytz + - s3fs + - scipy + - xarray + - xlrd + - xlsxwriter + - xlwt + # universal + - pytest>=4.0.2 + - pytest-xdist + - pytest-mock + - moto + - isort + - pip: + - hypothesis>=3.58.0 diff --git a/ci/deps/azure-37-locale.yaml b/ci/deps/azure-37-locale.yaml new file mode 100644 index 0000000000000..9d598cddce91a --- /dev/null +++ b/ci/deps/azure-37-locale.yaml @@ -0,0 +1,34 @@ +name: pandas-dev +channels: + - defaults + - conda-forge +dependencies: + - beautifulsoup4 + - cython>=0.28.2 + - html5lib + - ipython + - jinja2 + - lxml + - matplotlib + - nomkl + - numexpr + - numpy + - openpyxl + - pytables + - python-dateutil + - python=3.7* + - pytz + - s3fs + - scipy + - xarray + - xlrd + - xlsxwriter + - xlwt + # universal + - pytest>=4.0.2 + - pytest-xdist + - pytest-mock + - isort + - pip: + - hypothesis>=3.58.0 + - moto # latest moto in conda-forge fails with 3.7, move to conda dependencies when this is fixed diff --git a/ci/deps/azure-37-numpydev.yaml b/ci/deps/azure-37-numpydev.yaml new file mode 100644 index 0000000000000..e58c1f599279c --- /dev/null +++ b/ci/deps/azure-37-numpydev.yaml @@ -0,0 +1,19 @@ +name: pandas-dev +channels: + - defaults +dependencies: + - python=3.7* + - pytz + - Cython>=0.28.2 + # universal + - pytest>=4.0.2 + - pytest-xdist + - pytest-mock + - hypothesis>=3.58.0 + - isort + - pip: + - "git+git://github.com/dateutil/dateutil.git" + - "-f https://7933911d6844c6c53a7d-47bd50c35cd79bd838daf386af554a83.ssl.cf2.rackcdn.com" + - "--pre" + - "numpy" + - "scipy" diff --git a/ci/deps/azure-macos-35.yaml b/ci/deps/azure-macos-35.yaml new file mode 100644 index 0000000000000..2326e8092cc85 --- /dev/null +++ b/ci/deps/azure-macos-35.yaml @@ -0,0 +1,31 @@ +name: pandas-dev +channels: + - defaults +dependencies: + - beautifulsoup4 + - bottleneck + - cython>=0.28.2 + - html5lib + - jinja2 + - lxml + - matplotlib=2.2.0 + - nomkl + - numexpr + - numpy=1.12.0 + - openpyxl=2.5.5 + - pyarrow + - pytables + - python=3.5* + - pytz + - xarray + - xlrd + - xlsxwriter + - xlwt + - isort + - pip: + - python-dateutil==2.5.3 + # universal + - pytest>=4.0.2 + - pytest-xdist + - pytest-mock + - hypothesis>=3.58.0 diff --git a/ci/deps/azure-windows-27.yaml b/ci/deps/azure-windows-27.yaml new file mode 100644 index 0000000000000..f40efdfca3cbd --- /dev/null +++ b/ci/deps/azure-windows-27.yaml @@ -0,0 +1,33 @@ +name: pandas-dev +channels: + - defaults + - conda-forge +dependencies: + - beautifulsoup4 + - bottleneck + - dateutil + - gcsfs + - html5lib + - jinja2=2.8 + - lxml + - matplotlib=2.0.1 + - numexpr + - numpy=1.12* + - openpyxl + - pytables + - python=2.7.* + - pytz + - s3fs + - scipy + - sqlalchemy + - xlrd + - xlsxwriter + - xlwt + # universal + - cython>=0.28.2 + - pytest>=4.0.2 + - pytest-xdist + - pytest-mock + - moto + - hypothesis>=3.58.0 + - isort diff --git a/ci/deps/azure-windows-36.yaml b/ci/deps/azure-windows-36.yaml new file mode 100644 index 0000000000000..8517d340f2ba8 --- /dev/null +++ b/ci/deps/azure-windows-36.yaml @@ -0,0 +1,30 @@ +name: pandas-dev +channels: + - defaults + - conda-forge +dependencies: + - blosc + - bottleneck + - boost-cpp<1.67 + - fastparquet>=0.2.1 + - matplotlib + - numexpr + - numpy=1.14* + - openpyxl + - parquet-cpp + - pyarrow + - pytables + - python-dateutil + - python=3.6.6 + - pytz + - scipy + - xlrd + - xlsxwriter + - xlwt + # universal + - cython>=0.28.2 + - pytest>=4.0.2 + - pytest-xdist + - pytest-mock + - hypothesis>=3.58.0 + - isort diff --git a/ci/deps/travis-27.yaml b/ci/deps/travis-27.yaml new file mode 100644 index 0000000000000..a910af36a6b10 --- /dev/null +++ b/ci/deps/travis-27.yaml @@ -0,0 +1,51 @@ +name: pandas-dev +channels: + - defaults + - conda-forge +dependencies: + - beautifulsoup4 + - bottleneck + - cython=0.28.2 + - fastparquet>=0.2.1 + - gcsfs + - html5lib + - ipython + - jemalloc=4.5.0.post + - jinja2=2.8 + - lxml + - matplotlib=2.2.2 + - mock + - nomkl + - numexpr + - numpy=1.13* + - openpyxl=2.4.0 + - patsy + - psycopg2 + - py + - pyarrow=0.9.0 + - PyCrypto + - pymysql=0.6.3 + - pytables + - blosc=1.14.3 + - python-blosc + - python-dateutil=2.5.0 + - python=2.7* + - pytz=2013b + - s3fs + - scipy + - sqlalchemy=0.9.6 + - xarray=0.9.6 + - xlrd=1.0.0 + - xlsxwriter=0.5.2 + - xlwt=0.7.5 + # universal + - pytest>=4.0.2 + - pytest-xdist + - pytest-mock + - moto==1.3.4 + - hypothesis>=3.58.0 + - isort + - pip: + - backports.lzma + - pandas-gbq + - pathlib diff --git a/ci/deps/travis-36-doc.yaml b/ci/deps/travis-36-doc.yaml new file mode 100644 index 0000000000000..6f33bc58a8b21 --- /dev/null +++ b/ci/deps/travis-36-doc.yaml @@ -0,0 +1,46 @@ +name: pandas-dev +channels: + - defaults + - conda-forge +dependencies: + - beautifulsoup4 + - bottleneck + - cython>=0.28.2 + - fastparquet>=0.2.1 + - gitpython + - html5lib + - hypothesis>=3.58.0 + - ipykernel + - ipython + - ipywidgets + - lxml + - matplotlib + - nbconvert + - nbformat + - nbsphinx + - notebook + - numexpr + - numpy=1.13* + - numpydoc + - openpyxl + - pandoc + - pyarrow + - pyqt + - pytables + - python-dateutil + - python-snappy + - python=3.6* + - pytz + - scipy + - seaborn + - sphinx + - sqlalchemy + - statsmodels + - xarray + - xlrd + - xlsxwriter + - xlwt + # universal + - pytest>=4.0.2 + - pytest-xdist + - isort diff --git a/ci/deps/travis-36-locale.yaml b/ci/deps/travis-36-locale.yaml new file mode 100644 index 0000000000000..34b289e6c0c2f --- /dev/null +++ b/ci/deps/travis-36-locale.yaml @@ -0,0 +1,37 @@ +name: pandas-dev +channels: + - defaults + - conda-forge +dependencies: + - beautifulsoup4 + - cython>=0.28.2 + - html5lib + - ipython + - jinja2 + - lxml + - matplotlib + - nomkl + - numexpr + - numpy + - openpyxl + - psycopg2 + - pymysql + - pytables + - python-dateutil + - python=3.6* + - pytz + - s3fs + - scipy + - sqlalchemy + - xarray + - xlrd + - xlsxwriter + - xlwt + # universal + - pytest>=4.0.2 + - pytest-xdist + - pytest-mock + - moto + - isort + - pip: + - hypothesis>=3.58.0 diff --git a/ci/deps/travis-36-slow.yaml b/ci/deps/travis-36-slow.yaml new file mode 100644 index 0000000000000..46875d59411d9 --- /dev/null +++ b/ci/deps/travis-36-slow.yaml @@ -0,0 +1,33 @@ +name: pandas-dev +channels: + - defaults + - conda-forge +dependencies: + - beautifulsoup4 + - cython>=0.28.2 + - html5lib + - lxml + - matplotlib + - numexpr + - numpy + - openpyxl + - patsy + - psycopg2 + - pymysql + - pytables + - python-dateutil + - python=3.6* + - pytz + - s3fs + - scipy + - sqlalchemy + - xlrd + - xlsxwriter + - xlwt + # universal + - pytest>=4.0.2 + - pytest-xdist + - pytest-mock + - moto + - hypothesis>=3.58.0 + - isort diff --git a/ci/deps/travis-36.yaml b/ci/deps/travis-36.yaml new file mode 100644 index 0000000000000..06fc0d76a3d16 --- /dev/null +++ b/ci/deps/travis-36.yaml @@ -0,0 +1,47 @@ +name: pandas-dev +channels: + - defaults + - conda-forge +dependencies: + - beautifulsoup4 + - botocore>=1.11 + - cython>=0.28.2 + - dask + - fastparquet>=0.2.1 + - gcsfs + - geopandas + - html5lib + - matplotlib + - nomkl + - numexpr + - numpy + - openpyxl + - psycopg2 + - pyarrow=0.9.0 + - pymysql + - pytables + - python-snappy + - python=3.6.6 + - pytz + - s3fs + - scikit-learn + - scipy + - sqlalchemy + - statsmodels + - xarray + - xlrd + - xlsxwriter + - xlwt + # universal + - pytest>=4.0.2 + - pytest-xdist + - pytest-cov + - pytest-mock + - hypothesis>=3.58.0 + - isort + - pip: + - brotlipy + - coverage + - moto + - pandas-datareader + - python-dateutil diff --git a/ci/deps/travis-37.yaml b/ci/deps/travis-37.yaml new file mode 100644 index 0000000000000..f71d29fe13378 --- /dev/null +++ b/ci/deps/travis-37.yaml @@ -0,0 +1,22 @@ +name: pandas-dev +channels: + - defaults + - conda-forge + - c3i_test +dependencies: + - python=3.7 + - botocore>=1.11 + - cython>=0.28.2 + - numpy + - python-dateutil + - nomkl + - pyarrow + - pytz + - pytest>=4.0.2 + - pytest-xdist + - pytest-mock + - hypothesis>=3.58.0 + - s3fs + - isort + - pip: + - moto diff --git a/ci/incremental/build.cmd b/ci/incremental/build.cmd new file mode 100644 index 0000000000000..2cce38c03f406 --- /dev/null +++ b/ci/incremental/build.cmd @@ -0,0 +1,9 @@ +@rem https://github.com/numba/numba/blob/master/buildscripts/incremental/build.cmd + +@rem Build numba extensions without silencing compile errors +python setup.py build_ext -q --inplace + +@rem Install pandas locally +python -m pip install -e . + +if %errorlevel% neq 0 exit /b %errorlevel% diff --git a/ci/incremental/build.sh b/ci/incremental/build.sh new file mode 100755 index 0000000000000..05648037935a3 --- /dev/null +++ b/ci/incremental/build.sh @@ -0,0 +1,16 @@ +#!/bin/bash + +# Make sure any error below is reported as such +set -v -e + +echo "[building extensions]" +python setup.py build_ext -q --inplace +python -m pip install -e . + +echo +echo "[show environment]" +conda list + +echo +echo "[done]" +exit 0 diff --git a/ci/incremental/install_miniconda.sh b/ci/incremental/install_miniconda.sh new file mode 100755 index 0000000000000..a47dfdb324b34 --- /dev/null +++ b/ci/incremental/install_miniconda.sh @@ -0,0 +1,19 @@ +#!/bin/bash + +set -v -e + +# Install Miniconda +unamestr=`uname` +if [[ "$unamestr" == 'Linux' ]]; then + if [[ "$BITS32" == "yes" ]]; then + wget -q https://repo.continuum.io/miniconda/Miniconda3-latest-Linux-x86.sh -O miniconda.sh + else + wget -q https://repo.continuum.io/miniconda/Miniconda3-latest-Linux-x86_64.sh -O miniconda.sh + fi +elif [[ "$unamestr" == 'Darwin' ]]; then + wget -q https://repo.continuum.io/miniconda/Miniconda3-latest-MacOSX-x86_64.sh -O miniconda.sh +else + echo Error +fi +chmod +x miniconda.sh +./miniconda.sh -b diff --git a/ci/incremental/setup_conda_environment.cmd b/ci/incremental/setup_conda_environment.cmd new file mode 100644 index 0000000000000..c104d78591384 --- /dev/null +++ b/ci/incremental/setup_conda_environment.cmd @@ -0,0 +1,21 @@ +@rem https://github.com/numba/numba/blob/master/buildscripts/incremental/setup_conda_environment.cmd +@rem The cmd /C hack circumvents a regression where conda installs a conda.bat +@rem script in non-root environments. +set CONDA_INSTALL=cmd /C conda install -q -y +set PIP_INSTALL=pip install -q + +@echo on + +@rem Deactivate any environment +call deactivate +@rem Display root environment (for debugging) +conda list +@rem Clean up any left-over from a previous build +conda remove --all -q -y -n pandas-dev +@rem Scipy, CFFI, jinja2 and IPython are optional dependencies, but exercised in the test suite +conda env create --file=ci\deps\azure-windows-%CONDA_PY%.yaml + +call activate pandas-dev +conda list + +if %errorlevel% neq 0 exit /b %errorlevel% diff --git a/ci/incremental/setup_conda_environment.sh b/ci/incremental/setup_conda_environment.sh new file mode 100755 index 0000000000000..f174c17a614d8 --- /dev/null +++ b/ci/incremental/setup_conda_environment.sh @@ -0,0 +1,52 @@ +#!/bin/bash + +set -v -e + +CONDA_INSTALL="conda install -q -y" +PIP_INSTALL="pip install -q" + + +# Deactivate any environment +source deactivate +# Display root environment (for debugging) +conda list +# Clean up any left-over from a previous build +# (note workaround for https://github.com/conda/conda/issues/2679: +# `conda env remove` issue) +conda remove --all -q -y -n pandas-dev + +echo +echo "[create env]" +time conda env create -q --file="${ENV_FILE}" || exit 1 + +set +v +source activate pandas-dev +set -v + +# remove any installed pandas package +# w/o removing anything else +echo +echo "[removing installed pandas]" +conda remove pandas -y --force || true +pip uninstall -y pandas || true + +echo +echo "[no installed pandas]" +conda list pandas + +if [ -n "$LOCALE_OVERRIDE" ]; then + sudo locale-gen "$LOCALE_OVERRIDE" +fi + +# # Install the compiler toolchain +# if [[ $(uname) == Linux ]]; then +# if [[ "$CONDA_SUBDIR" == "linux-32" || "$BITS32" == "yes" ]] ; then +# $CONDA_INSTALL gcc_linux-32 gxx_linux-32 +# else +# $CONDA_INSTALL gcc_linux-64 gxx_linux-64 +# fi +# elif [[ $(uname) == Darwin ]]; then +# $CONDA_INSTALL clang_osx-64 clangxx_osx-64 +# # Install llvm-openmp and intel-openmp on OSX too +# $CONDA_INSTALL llvm-openmp intel-openmp +# fi diff --git a/ci/install.ps1 b/ci/install.ps1 deleted file mode 100644 index 64ec7f81884cd..0000000000000 --- a/ci/install.ps1 +++ /dev/null @@ -1,92 +0,0 @@ -# Sample script to install Miniconda under Windows -# Authors: Olivier Grisel, Jonathan Helmus and Kyle Kastner, Robert McGibbon -# License: CC0 1.0 Universal: http://creativecommons.org/publicdomain/zero/1.0/ - -$MINICONDA_URL = "http://repo.continuum.io/miniconda/" - - -function DownloadMiniconda ($python_version, $platform_suffix) { - $webclient = New-Object System.Net.WebClient - $filename = "Miniconda3-latest-Windows-" + $platform_suffix + ".exe" - $url = $MINICONDA_URL + $filename - - $basedir = $pwd.Path + "\" - $filepath = $basedir + $filename - if (Test-Path $filename) { - Write-Host "Reusing" $filepath - return $filepath - } - - # Download and retry up to 3 times in case of network transient errors. - Write-Host "Downloading" $filename "from" $url - $retry_attempts = 2 - for($i=0; $i -lt $retry_attempts; $i++){ - try { - $webclient.DownloadFile($url, $filepath) - break - } - Catch [Exception]{ - Start-Sleep 1 - } - } - if (Test-Path $filepath) { - Write-Host "File saved at" $filepath - } else { - # Retry once to get the error message if any at the last try - $webclient.DownloadFile($url, $filepath) - } - return $filepath -} - - -function InstallMiniconda ($python_version, $architecture, $python_home) { - Write-Host "Installing Python" $python_version "for" $architecture "bit architecture to" $python_home - if (Test-Path $python_home) { - Write-Host $python_home "already exists, skipping." - return $false - } - if ($architecture -match "32") { - $platform_suffix = "x86" - } else { - $platform_suffix = "x86_64" - } - - $filepath = DownloadMiniconda $python_version $platform_suffix - Write-Host "Installing" $filepath "to" $python_home - $install_log = $python_home + ".log" - $args = "/S /D=$python_home" - Write-Host $filepath $args - Start-Process -FilePath $filepath -ArgumentList $args -Wait -Passthru - if (Test-Path $python_home) { - Write-Host "Python $python_version ($architecture) installation complete" - } else { - Write-Host "Failed to install Python in $python_home" - Get-Content -Path $install_log - Exit 1 - } -} - - -function InstallCondaPackages ($python_home, $spec) { - $conda_path = $python_home + "\Scripts\conda.exe" - $args = "install --yes " + $spec - Write-Host ("conda " + $args) - Start-Process -FilePath "$conda_path" -ArgumentList $args -Wait -Passthru -} - -function UpdateConda ($python_home) { - $conda_path = $python_home + "\Scripts\conda.exe" - Write-Host "Updating conda..." - $args = "update --yes conda" - Write-Host $conda_path $args - Start-Process -FilePath "$conda_path" -ArgumentList $args -Wait -Passthru -} - - -function main () { - InstallMiniconda "3.5" $env:PYTHON_ARCH $env:CONDA_ROOT - UpdateConda $env:CONDA_ROOT - InstallCondaPackages $env:CONDA_ROOT "conda-build jinja2 anaconda-client" -} - -main diff --git a/ci/install_circle.sh b/ci/install_circle.sh deleted file mode 100755 index 00e14b10ebbd6..0000000000000 --- a/ci/install_circle.sh +++ /dev/null @@ -1,85 +0,0 @@ -#!/usr/bin/env bash - -home_dir=$(pwd) -echo "[home_dir: $home_dir]" - -echo "[ls -ltr]" -ls -ltr - -echo "[Using clean Miniconda install]" -rm -rf "$MINICONDA_DIR" - -# install miniconda -wget http://repo.continuum.io/miniconda/Miniconda3-latest-Linux-x86_64.sh -q -O miniconda.sh || exit 1 -bash miniconda.sh -b -p "$MINICONDA_DIR" || exit 1 - -export PATH="$MINICONDA_DIR/bin:$PATH" - -echo "[update conda]" -conda config --set ssl_verify false || exit 1 -conda config --set always_yes true --set changeps1 false || exit 1 -conda update -q conda - -# add the pandas channel to take priority -# to add extra packages -echo "[add channels]" -conda config --add channels pandas || exit 1 -conda config --remove channels defaults || exit 1 -conda config --add channels defaults || exit 1 - -# Useful for debugging any issues with conda -conda info -a || exit 1 - -# support env variables passed -export ENVS_FILE=".envs" - -# make sure that the .envs file exists. it is ok if it is empty -touch $ENVS_FILE - -# assume all command line arguments are environmental variables -for var in "$@" -do - echo "export $var" >> $ENVS_FILE -done - -echo "[environmental variable file]" -cat $ENVS_FILE -source $ENVS_FILE - -export REQ_BUILD=ci/requirements-${JOB}.build -export REQ_RUN=ci/requirements-${JOB}.run -export REQ_PIP=ci/requirements-${JOB}.pip - -# edit the locale override if needed -if [ -n "$LOCALE_OVERRIDE" ]; then - echo "[Adding locale to the first line of pandas/__init__.py]" - rm -f pandas/__init__.pyc - sedc="3iimport locale\nlocale.setlocale(locale.LC_ALL, '$LOCALE_OVERRIDE')\n" - sed -i "$sedc" pandas/__init__.py - echo "[head -4 pandas/__init__.py]" - head -4 pandas/__init__.py - echo -fi - -# create envbuild deps -echo "[create env: ${REQ_BUILD}]" -time conda create -n pandas -q --file=${REQ_BUILD} || exit 1 -time conda install -n pandas pytest || exit 1 - -source activate pandas - -# build but don't install -echo "[build em]" -time python setup.py build_ext --inplace || exit 1 - -# we may have run installations -echo "[conda installs: ${REQ_RUN}]" -if [ -e ${REQ_RUN} ]; then - time conda install -q --file=${REQ_RUN} || exit 1 -fi - -# we may have additional pip installs -echo "[pip installs: ${REQ_PIP}]" -if [ -e ${REQ_PIP} ]; then - pip install -r $REQ_PIP -fi diff --git a/ci/install_db_circle.sh b/ci/install_db_circle.sh deleted file mode 100755 index a00f74f009f54..0000000000000 --- a/ci/install_db_circle.sh +++ /dev/null @@ -1,8 +0,0 @@ -#!/bin/bash - -echo "installing dbs" -mysql -e 'create database pandas_nosetest;' -psql -c 'create database pandas_nosetest;' -U postgres - -echo "done" -exit 0 diff --git a/ci/install_travis.sh b/ci/install_travis.sh index f71df979c9df0..d1a940f119228 100755 --- a/ci/install_travis.sh +++ b/ci/install_travis.sh @@ -34,9 +34,9 @@ fi # install miniconda if [ "${TRAVIS_OS_NAME}" == "osx" ]; then - time wget http://repo.continuum.io/miniconda/Miniconda3-latest-MacOSX-x86_64.sh -O miniconda.sh || exit 1 + time wget http://repo.continuum.io/miniconda/Miniconda3-latest-MacOSX-x86_64.sh -q -O miniconda.sh || exit 1 else - time wget http://repo.continuum.io/miniconda/Miniconda3-latest-Linux-x86_64.sh -O miniconda.sh || exit 1 + time wget http://repo.continuum.io/miniconda/Miniconda3-latest-Linux-x86_64.sh -q -O miniconda.sh || exit 1 fi time bash miniconda.sh -b -p "$MINICONDA_DIR" || exit 1 @@ -47,22 +47,9 @@ which conda echo echo "[update conda]" conda config --set ssl_verify false || exit 1 -conda config --set always_yes true --set changeps1 false || exit 1 +conda config --set quiet true --set always_yes true --set changeps1 false || exit 1 conda update -q conda -echo -echo "[add channels]" -# add the pandas channel to take priority -# to add extra packages -conda config --add channels pandas || exit 1 -conda config --remove channels defaults || exit 1 -conda config --add channels defaults || exit 1 - -if [ "$CONDA_FORGE" ]; then - # add conda-forge channel as priority - conda config --add channels conda-forge || exit 1 -fi - # Useful for debugging any issues with conda conda info -a || exit 1 @@ -93,89 +80,29 @@ echo echo "[create env]" # create our environment -REQ="ci/requirements-${JOB}.build" -time conda create -n pandas --file=${REQ} || exit 1 +time conda env create -q --file="${ENV_FILE}" || exit 1 -source activate pandas +source activate pandas-dev -# may have addtl installation instructions for this build +# remove any installed pandas package +# w/o removing anything else echo -echo "[build addtl installs]" -REQ="ci/requirements-${JOB}.build.sh" -if [ -e ${REQ} ]; then - time bash $REQ || exit 1 -fi - -time conda install -n pandas pytest -time pip install pytest-xdist - -if [ "$LINT" ]; then - conda install flake8 - pip install cpplint -fi - -if [ "$COVERAGE" ]; then - pip install coverage pytest-cov -fi - -echo -if [ "$BUILD_TEST" ]; then - - # build & install testing - echo ["Starting installation test."] - python setup.py clean - python setup.py build_ext --inplace - python setup.py sdist --formats=gztar - conda uninstall cython - pip install dist/*tar.gz || exit 1 - -else - - # build but don't install - echo "[build em]" - time python setup.py build_ext --inplace || exit 1 - -fi +echo "[removing installed pandas]" +conda remove pandas -y --force +pip uninstall -y pandas -# we may have run installations echo -echo "[conda installs]" -REQ="ci/requirements-${JOB}.run" -if [ -e ${REQ} ]; then - time conda install -n pandas --file=${REQ} || exit 1 -fi +echo "[no installed pandas]" +conda list pandas +pip list --format columns |grep pandas -# we may have additional pip installs -echo -echo "[pip installs]" -REQ="ci/requirements-${JOB}.pip" -if [ -e ${REQ} ]; then - pip install -r $REQ -fi +# build and install +echo "[running setup.py develop]" +python setup.py develop || exit 1 -# may have addtl installation instructions for this build echo -echo "[addtl installs]" -REQ="ci/requirements-${JOB}.sh" -if [ -e ${REQ} ]; then - time bash $REQ || exit 1 -fi - -# finish install if we are not doing a build-testk -if [ -z "$BUILD_TEST" ]; then - - # remove any installed pandas package - # w/o removing anything else - echo - echo "[removing installed pandas]" - conda remove pandas --force - - # install our pandas - echo - echo "[running setup.py develop]" - python setup.py develop || exit 1 - -fi +echo "[show environment]" +conda list echo echo "[done]" diff --git a/ci/lint.sh b/ci/lint.sh deleted file mode 100755 index ed3af2568811c..0000000000000 --- a/ci/lint.sh +++ /dev/null @@ -1,68 +0,0 @@ -#!/bin/bash - -echo "inside $0" - -source activate pandas - -RET=0 - -if [ "$LINT" ]; then - - # pandas/_libs/src is C code, so no need to search there. - echo "Linting *.py" - flake8 pandas --filename=*.py --exclude pandas/_libs/src - if [ $? -ne "0" ]; then - RET=1 - fi - echo "Linting *.py DONE" - - echo "Linting *.pyx" - flake8 pandas --filename=*.pyx --select=E501,E302,E203,E111,E114,E221,E303,E128,E231,E126 - if [ $? -ne "0" ]; then - RET=1 - fi - echo "Linting *.pyx DONE" - - echo "Linting *.pxi.in" - for path in 'src' - do - echo "linting -> pandas/$path" - flake8 pandas/$path --filename=*.pxi.in --select=E501,E302,E203,E111,E114,E221,E303,E231,E126 - if [ $? -ne "0" ]; then - RET=1 - fi - - done - echo "Linting *.pxi.in DONE" - - # readability/casting: Warnings about C casting instead of C++ casting - # runtime/int: Warnings about using C number types instead of C++ ones - # build/include_subdir: Warnings about prefacing included header files with directory - - # We don't lint all C files because we don't want to lint any that are built - # from Cython files nor do we want to lint C files that we didn't modify for - # this particular codebase (e.g. src/headers, src/klib, src/msgpack). However, - # we can lint all header files since they aren't "generated" like C files are. - echo "Linting *.c and *.h" - for path in '*.h' 'period_helper.c' 'datetime' 'parser' 'ujson' - do - echo "linting -> pandas/_libs/src/$path" - cpplint --quiet --extensions=c,h --headers=h --filter=-readability/casting,-runtime/int,-build/include_subdir --recursive pandas/_libs/src/$path - if [ $? -ne "0" ]; then - RET=1 - fi - done - echo "Linting *.c and *.h DONE" - - echo "Check for invalid testing" - grep -r -E --include '*.py' --exclude testing.py '(numpy|np)\.testing' pandas - if [ $? = "0" ]; then - RET=1 - fi - echo "Check for invalid testing DONE" - -else - echo "NOT Linting" -fi - -exit $RET diff --git a/ci/print_skipped.py b/ci/print_skipped.py index dd2180f6eeb19..67bc7b556cd43 100755 --- a/ci/print_skipped.py +++ b/ci/print_skipped.py @@ -10,7 +10,7 @@ def parse_results(filename): root = tree.getroot() skipped = [] - current_class = old_class = '' + current_class = '' i = 1 assert i - 1 == len(skipped) for el in root.findall('testcase'): @@ -24,7 +24,9 @@ def parse_results(filename): out = '' if old_class != current_class: ndigits = int(math.log(i, 10) + 1) - out += ('-' * (len(name + msg) + 4 + ndigits) + '\n') # 4 for : + space + # + space + + # 4 for : + space + # + space + out += ('-' * (len(name + msg) + 4 + ndigits) + '\n') out += '#{i} {name}: {msg}'.format(i=i, name=name, msg=msg) skipped.append(out) i += 1 diff --git a/ci/print_versions.py b/ci/print_versions.py deleted file mode 100755 index 8be795174d76d..0000000000000 --- a/ci/print_versions.py +++ /dev/null @@ -1,28 +0,0 @@ -#!/usr/bin/env python - - -def show_versions(as_json=False): - import imp - import os - fn = __file__ - this_dir = os.path.dirname(fn) - pandas_dir = os.path.abspath(os.path.join(this_dir, "..")) - sv_path = os.path.join(pandas_dir, 'pandas', 'util') - mod = imp.load_module( - 'pvmod', *imp.find_module('print_versions', [sv_path])) - return mod.show_versions(as_json) - - -if __name__ == '__main__': - # optparse is 2.6-safe - from optparse import OptionParser - parser = OptionParser() - parser.add_option("-j", "--json", metavar="FILE", nargs=1, - help="Save output as JSON into file, pass in '-' to output to stdout") - - (options, args) = parser.parse_args() - - if options.json == "-": - options.json = True - - show_versions(as_json=options.json) diff --git a/ci/requirements-2.7.build b/ci/requirements-2.7.build deleted file mode 100644 index 415df13179fcf..0000000000000 --- a/ci/requirements-2.7.build +++ /dev/null @@ -1,6 +0,0 @@ -python=2.7* -python-dateutil=2.4.1 -pytz=2013b -nomkl -numpy -cython=0.23 diff --git a/ci/requirements-2.7.pip b/ci/requirements-2.7.pip deleted file mode 100644 index eb796368e7820..0000000000000 --- a/ci/requirements-2.7.pip +++ /dev/null @@ -1,8 +0,0 @@ -blosc -pandas-gbq -pathlib -backports.lzma -py -PyCrypto -mock -ipython diff --git a/ci/requirements-2.7.run b/ci/requirements-2.7.run deleted file mode 100644 index 62e31e4ae24e3..0000000000000 --- a/ci/requirements-2.7.run +++ /dev/null @@ -1,22 +0,0 @@ -python-dateutil=2.4.1 -pytz=2013b -numpy -xlwt=0.7.5 -numexpr -pytables -matplotlib -openpyxl=1.6.2 -xlrd=0.9.2 -sqlalchemy=0.9.6 -lxml=3.2.1 -scipy -xlsxwriter=0.4.6 -s3fs -bottleneck -psycopg2=2.5.2 -patsy -pymysql=0.6.3 -html5lib=1.0b2 -beautiful-soup=4.2.1 -jinja2=2.8 -xarray=0.8.0 diff --git a/ci/requirements-2.7.sh b/ci/requirements-2.7.sh deleted file mode 100644 index 64d470e5c6e0e..0000000000000 --- a/ci/requirements-2.7.sh +++ /dev/null @@ -1,7 +0,0 @@ -#!/bin/bash - -source activate pandas - -echo "install 27" - -conda install -n pandas -c conda-forge feather-format diff --git a/ci/requirements-2.7_BUILD_TEST.build b/ci/requirements-2.7_BUILD_TEST.build deleted file mode 100644 index aadec00cb7ebf..0000000000000 --- a/ci/requirements-2.7_BUILD_TEST.build +++ /dev/null @@ -1,6 +0,0 @@ -python=2.7* -dateutil -pytz -nomkl -numpy -cython diff --git a/ci/requirements-2.7_COMPAT.build b/ci/requirements-2.7_COMPAT.build deleted file mode 100644 index 0e1ccf9eac9bf..0000000000000 --- a/ci/requirements-2.7_COMPAT.build +++ /dev/null @@ -1,5 +0,0 @@ -python=2.7* -numpy=1.7.1 -cython=0.23 -dateutil=1.5 -pytz=2013b diff --git a/ci/requirements-2.7_COMPAT.pip b/ci/requirements-2.7_COMPAT.pip deleted file mode 100644 index 9533a630d06a4..0000000000000 --- a/ci/requirements-2.7_COMPAT.pip +++ /dev/null @@ -1,2 +0,0 @@ -openpyxl -argparse diff --git a/ci/requirements-2.7_COMPAT.run b/ci/requirements-2.7_COMPAT.run deleted file mode 100644 index d27b6a72c2d15..0000000000000 --- a/ci/requirements-2.7_COMPAT.run +++ /dev/null @@ -1,16 +0,0 @@ -numpy=1.7.1 -dateutil=1.5 -pytz=2013b -scipy=0.11.0 -xlwt=0.7.5 -xlrd=0.9.2 -bottleneck=0.8.0 -numexpr=2.2.2 -pytables=3.0.0 -html5lib=1.0b2 -beautiful-soup=4.2.0 -psycopg2=2.5.1 -pymysql=0.6.0 -sqlalchemy=0.7.8 -xlsxwriter=0.4.6 -jinja2=2.8 diff --git a/ci/requirements-2.7_LOCALE.build b/ci/requirements-2.7_LOCALE.build deleted file mode 100644 index 4a37ce8fbe161..0000000000000 --- a/ci/requirements-2.7_LOCALE.build +++ /dev/null @@ -1,5 +0,0 @@ -python=2.7* -python-dateutil -pytz=2013b -numpy=1.8.2 -cython=0.23 diff --git a/ci/requirements-2.7_LOCALE.pip b/ci/requirements-2.7_LOCALE.pip deleted file mode 100644 index cf8e6b8b3d3a6..0000000000000 --- a/ci/requirements-2.7_LOCALE.pip +++ /dev/null @@ -1 +0,0 @@ -blosc diff --git a/ci/requirements-2.7_LOCALE.run b/ci/requirements-2.7_LOCALE.run deleted file mode 100644 index 5d7cc31b7d55e..0000000000000 --- a/ci/requirements-2.7_LOCALE.run +++ /dev/null @@ -1,14 +0,0 @@ -python-dateutil -pytz=2013b -numpy=1.8.2 -xlwt=0.7.5 -openpyxl=1.6.2 -xlsxwriter=0.4.6 -xlrd=0.9.2 -bottleneck=0.8.0 -matplotlib=1.3.1 -sqlalchemy=0.8.1 -html5lib=1.0b2 -lxml=3.2.1 -scipy -beautiful-soup=4.2.1 diff --git a/ci/requirements-2.7_SLOW.build b/ci/requirements-2.7_SLOW.build deleted file mode 100644 index 0f4a2c6792e6b..0000000000000 --- a/ci/requirements-2.7_SLOW.build +++ /dev/null @@ -1,5 +0,0 @@ -python=2.7* -python-dateutil -pytz -numpy=1.8.2 -cython diff --git a/ci/requirements-2.7_SLOW.run b/ci/requirements-2.7_SLOW.run deleted file mode 100644 index c2d2a14285ad6..0000000000000 --- a/ci/requirements-2.7_SLOW.run +++ /dev/null @@ -1,20 +0,0 @@ -python-dateutil -pytz -numpy=1.8.2 -matplotlib=1.3.1 -scipy -patsy -xlwt -openpyxl -xlsxwriter -xlrd -numexpr -pytables -sqlalchemy -lxml -s3fs -bottleneck -psycopg2 -pymysql -html5lib -beautiful-soup diff --git a/ci/requirements-2.7_WIN.run b/ci/requirements-2.7_WIN.run deleted file mode 100644 index f953682f52d45..0000000000000 --- a/ci/requirements-2.7_WIN.run +++ /dev/null @@ -1,18 +0,0 @@ -dateutil -pytz -numpy=1.10* -xlwt -numexpr -pytables==3.2.2 -matplotlib -openpyxl -xlrd -sqlalchemy -lxml=3.2.1 -scipy -xlsxwriter -s3fs -bottleneck -html5lib -beautiful-soup -jinja2=2.8 diff --git a/ci/requirements-3.4.build b/ci/requirements-3.4.build deleted file mode 100644 index e8a957f70d40e..0000000000000 --- a/ci/requirements-3.4.build +++ /dev/null @@ -1,4 +0,0 @@ -python=3.4* -numpy=1.8.1 -cython=0.24.1 -libgfortran=1.0 diff --git a/ci/requirements-3.4.pip b/ci/requirements-3.4.pip deleted file mode 100644 index 4e5fe52d56cf1..0000000000000 --- a/ci/requirements-3.4.pip +++ /dev/null @@ -1,2 +0,0 @@ -python-dateutil==2.2 -blosc diff --git a/ci/requirements-3.4.run b/ci/requirements-3.4.run deleted file mode 100644 index 3e12adae7dd9f..0000000000000 --- a/ci/requirements-3.4.run +++ /dev/null @@ -1,18 +0,0 @@ -pytz=2015.7 -numpy=1.8.1 -openpyxl -xlsxwriter -xlrd -xlwt -html5lib -patsy -beautiful-soup -scipy -numexpr -pytables -lxml -sqlalchemy -bottleneck -pymysql=0.6.3 -psycopg2 -jinja2=2.8 diff --git a/ci/requirements-3.4_SLOW.build b/ci/requirements-3.4_SLOW.build deleted file mode 100644 index 88212053af472..0000000000000 --- a/ci/requirements-3.4_SLOW.build +++ /dev/null @@ -1,6 +0,0 @@ -python=3.4* -python-dateutil -pytz -nomkl -numpy=1.10* -cython diff --git a/ci/requirements-3.4_SLOW.run b/ci/requirements-3.4_SLOW.run deleted file mode 100644 index 90156f62c6e71..0000000000000 --- a/ci/requirements-3.4_SLOW.run +++ /dev/null @@ -1,20 +0,0 @@ -python-dateutil -pytz -numpy=1.10* -openpyxl -xlsxwriter -xlrd -xlwt -html5lib -patsy -beautiful-soup -scipy -numexpr=2.4.6 -pytables -matplotlib -lxml -sqlalchemy -bottleneck -pymysql -psycopg2 -jinja2=2.8 diff --git a/ci/requirements-3.4_SLOW.sh b/ci/requirements-3.4_SLOW.sh deleted file mode 100644 index 24f1e042ed69e..0000000000000 --- a/ci/requirements-3.4_SLOW.sh +++ /dev/null @@ -1,7 +0,0 @@ -#!/bin/bash - -source activate pandas - -echo "install 34_slow" - -conda install -n pandas -c conda-forge matplotlib diff --git a/ci/requirements-3.5.build b/ci/requirements-3.5.build deleted file mode 100644 index 76227e106e1fd..0000000000000 --- a/ci/requirements-3.5.build +++ /dev/null @@ -1,6 +0,0 @@ -python=3.5* -python-dateutil -pytz -nomkl -numpy=1.11.3 -cython diff --git a/ci/requirements-3.5.pip b/ci/requirements-3.5.pip deleted file mode 100644 index 6e4f7b65f9728..0000000000000 --- a/ci/requirements-3.5.pip +++ /dev/null @@ -1,2 +0,0 @@ -xarray==0.9.1 -pandas-gbq diff --git a/ci/requirements-3.5.run b/ci/requirements-3.5.run deleted file mode 100644 index 43e6814ed6c8e..0000000000000 --- a/ci/requirements-3.5.run +++ /dev/null @@ -1,21 +0,0 @@ -python-dateutil -pytz -numpy=1.11.3 -openpyxl -xlsxwriter -xlrd -xlwt -scipy -numexpr -pytables -html5lib -lxml -matplotlib -jinja2 -bottleneck -sqlalchemy -pymysql -psycopg2 -s3fs -beautifulsoup4 -ipython diff --git a/ci/requirements-3.5.sh b/ci/requirements-3.5.sh deleted file mode 100644 index d0f0b81802dc6..0000000000000 --- a/ci/requirements-3.5.sh +++ /dev/null @@ -1,7 +0,0 @@ -#!/bin/bash - -source activate pandas - -echo "install 35" - -conda install -n pandas -c conda-forge feather-format diff --git a/ci/requirements-3.5_ASCII.build b/ci/requirements-3.5_ASCII.build deleted file mode 100644 index f7befe3b31865..0000000000000 --- a/ci/requirements-3.5_ASCII.build +++ /dev/null @@ -1,6 +0,0 @@ -python=3.5* -python-dateutil -pytz -nomkl -numpy -cython diff --git a/ci/requirements-3.5_ASCII.run b/ci/requirements-3.5_ASCII.run deleted file mode 100644 index b9d543f557d06..0000000000000 --- a/ci/requirements-3.5_ASCII.run +++ /dev/null @@ -1,3 +0,0 @@ -python-dateutil -pytz -numpy diff --git a/ci/requirements-3.5_DOC.build b/ci/requirements-3.5_DOC.build deleted file mode 100644 index 73aeb3192242f..0000000000000 --- a/ci/requirements-3.5_DOC.build +++ /dev/null @@ -1,5 +0,0 @@ -python=3.5* -python-dateutil -pytz -numpy -cython diff --git a/ci/requirements-3.5_DOC.run b/ci/requirements-3.5_DOC.run deleted file mode 100644 index 644a16f51f4b6..0000000000000 --- a/ci/requirements-3.5_DOC.run +++ /dev/null @@ -1,21 +0,0 @@ -ipython -ipykernel -sphinx -nbconvert -nbformat -notebook -matplotlib -scipy -lxml -beautifulsoup4 -html5lib -pytables -openpyxl=1.8.5 -xlrd -xlwt -xlsxwriter -sqlalchemy -numexpr -bottleneck -statsmodels -pyqt=4.11.4 diff --git a/ci/requirements-3.5_DOC.sh b/ci/requirements-3.5_DOC.sh deleted file mode 100644 index 1a5d4643edcf2..0000000000000 --- a/ci/requirements-3.5_DOC.sh +++ /dev/null @@ -1,11 +0,0 @@ -#!/bin/bash - -source activate pandas - -echo "[install DOC_BUILD deps]" - -pip install pandas-gbq - -conda install -n pandas -c conda-forge feather-format - -conda install -n pandas -c r r rpy2 --yes diff --git a/ci/requirements-3.5_OSX.build b/ci/requirements-3.5_OSX.build deleted file mode 100644 index f5bc01b67a20a..0000000000000 --- a/ci/requirements-3.5_OSX.build +++ /dev/null @@ -1,4 +0,0 @@ -python=3.5* -nomkl -numpy=1.10.4 -cython diff --git a/ci/requirements-3.5_OSX.pip b/ci/requirements-3.5_OSX.pip deleted file mode 100644 index d1fc1fe24a079..0000000000000 --- a/ci/requirements-3.5_OSX.pip +++ /dev/null @@ -1 +0,0 @@ -python-dateutil==2.5.3 diff --git a/ci/requirements-3.5_OSX.run b/ci/requirements-3.5_OSX.run deleted file mode 100644 index 1d83474d10f2f..0000000000000 --- a/ci/requirements-3.5_OSX.run +++ /dev/null @@ -1,16 +0,0 @@ -pytz -numpy=1.10.4 -openpyxl -xlsxwriter -xlrd -xlwt -numexpr -pytables -html5lib -lxml -matplotlib -jinja2 -bottleneck -xarray -s3fs -beautifulsoup4 diff --git a/ci/requirements-3.5_OSX.sh b/ci/requirements-3.5_OSX.sh deleted file mode 100644 index cfbd2882a8a2d..0000000000000 --- a/ci/requirements-3.5_OSX.sh +++ /dev/null @@ -1,7 +0,0 @@ -#!/bin/bash - -source activate pandas - -echo "install 35_OSX" - -conda install -n pandas -c conda-forge feather-format diff --git a/ci/requirements-3.6.build b/ci/requirements-3.6.build deleted file mode 100644 index 1c4b46aea3865..0000000000000 --- a/ci/requirements-3.6.build +++ /dev/null @@ -1,6 +0,0 @@ -python=3.6* -python-dateutil -pytz -nomkl -numpy -cython diff --git a/ci/requirements-3.6.run b/ci/requirements-3.6.run deleted file mode 100644 index 41c9680ce1b7e..0000000000000 --- a/ci/requirements-3.6.run +++ /dev/null @@ -1,22 +0,0 @@ -python-dateutil -pytz -numpy -scipy -openpyxl -xlsxwriter -xlrd -xlwt -numexpr -pytables -matplotlib -lxml -html5lib -jinja2 -sqlalchemy -pymysql -feather-format -# psycopg2 (not avail on defaults ATM) -beautifulsoup4 -s3fs -xarray -ipython diff --git a/ci/requirements-3.6_NUMPY_DEV.build b/ci/requirements-3.6_NUMPY_DEV.build deleted file mode 100644 index 738366867a217..0000000000000 --- a/ci/requirements-3.6_NUMPY_DEV.build +++ /dev/null @@ -1,4 +0,0 @@ -python=3.6* -python-dateutil -pytz -cython diff --git a/ci/requirements-3.6_NUMPY_DEV.build.sh b/ci/requirements-3.6_NUMPY_DEV.build.sh deleted file mode 100644 index 4af1307f26a18..0000000000000 --- a/ci/requirements-3.6_NUMPY_DEV.build.sh +++ /dev/null @@ -1,14 +0,0 @@ -#!/bin/bash - -source activate pandas - -echo "install numpy master wheel" - -# remove the system installed numpy -pip uninstall numpy -y - -# install numpy wheel from master -PRE_WHEELS="https://7933911d6844c6c53a7d-47bd50c35cd79bd838daf386af554a83.ssl.cf2.rackcdn.com" -pip install --pre --upgrade --timeout=60 -f $PRE_WHEELS numpy scipy - -true diff --git a/ci/requirements-3.6_NUMPY_DEV.run b/ci/requirements-3.6_NUMPY_DEV.run deleted file mode 100644 index 0aa987baefb1d..0000000000000 --- a/ci/requirements-3.6_NUMPY_DEV.run +++ /dev/null @@ -1,2 +0,0 @@ -python-dateutil -pytz diff --git a/ci/requirements-3.6_WIN.run b/ci/requirements-3.6_WIN.run deleted file mode 100644 index 840d2867e9297..0000000000000 --- a/ci/requirements-3.6_WIN.run +++ /dev/null @@ -1,13 +0,0 @@ -python-dateutil -pytz -numpy=1.12* -openpyxl -xlsxwriter -xlrd -xlwt -scipy -feather-format -numexpr -pytables -matplotlib -blosc diff --git a/ci/requirements_all.txt b/ci/requirements_all.txt deleted file mode 100644 index 4ff80a478f247..0000000000000 --- a/ci/requirements_all.txt +++ /dev/null @@ -1,26 +0,0 @@ -pytest -pytest-cov -pytest-xdist -flake8 -sphinx -ipython -python-dateutil -pytz -openpyxl -xlsxwriter -xlrd -xlwt -html5lib -patsy -beautiful-soup -numpy -cython -scipy -numexpr -pytables -matplotlib -lxml -sqlalchemy -bottleneck -pymysql -Jinja2 diff --git a/ci/requirements_dev.txt b/ci/requirements_dev.txt deleted file mode 100644 index 1e051802ec9f8..0000000000000 --- a/ci/requirements_dev.txt +++ /dev/null @@ -1,7 +0,0 @@ -python-dateutil -pytz -numpy -cython -pytest -pytest-cov -flake8 diff --git a/ci/run_build_docs.sh b/ci/run_build_docs.sh deleted file mode 100755 index 2909b9619552e..0000000000000 --- a/ci/run_build_docs.sh +++ /dev/null @@ -1,10 +0,0 @@ -#!/bin/bash - -echo "inside $0" - -"$TRAVIS_BUILD_DIR"/ci/build_docs.sh 2>&1 - -# wait until subprocesses finish (build_docs.sh) -wait - -exit 0 diff --git a/ci/run_circle.sh b/ci/run_circle.sh deleted file mode 100755 index 0e46d28ab6fc4..0000000000000 --- a/ci/run_circle.sh +++ /dev/null @@ -1,9 +0,0 @@ -#!/usr/bin/env bash - -echo "[running tests]" -export PATH="$MINICONDA_DIR/bin:$PATH" - -source activate pandas - -echo "pytest --junitxml=$CIRCLE_TEST_REPORTS/reports/junit.xml $@ pandas" -pytest --junitxml=$CIRCLE_TEST_REPORTS/reports/junit.xml $@ pandas diff --git a/ci/run_tests.sh b/ci/run_tests.sh new file mode 100755 index 0000000000000..ee46da9f52eab --- /dev/null +++ b/ci/run_tests.sh @@ -0,0 +1,58 @@ +#!/bin/bash + +set -e + +if [ "$DOC" ]; then + echo "We are not running pytest as this is a doc-build" + exit 0 +fi + +# Workaround for pytest-xdist flaky collection order +# https://github.com/pytest-dev/pytest/issues/920 +# https://github.com/pytest-dev/pytest/issues/1075 +export PYTHONHASHSEED=$(python -c 'import random; print(random.randint(1, 4294967295))') + +if [ -n "$LOCALE_OVERRIDE" ]; then + export LC_ALL="$LOCALE_OVERRIDE" + export LANG="$LOCALE_OVERRIDE" + PANDAS_LOCALE=`python -c 'import pandas; pandas.get_option("display.encoding")'` + if [[ "$LOCALE_OVERIDE" != "$PANDAS_LOCALE" ]]; then + echo "pandas could not detect the locale. System locale: $LOCALE_OVERRIDE, pandas detected: $PANDAS_LOCALE" + # TODO Not really aborting the tests until https://github.com/pandas-dev/pandas/issues/23923 is fixed + # exit 1 + fi +fi +if [[ "not network" == *"$PATTERN"* ]]; then + export http_proxy= https_proxy=; +fi + + +if [ -n "$PATTERN" ]; then + PATTERN=" and $PATTERN" +fi + +for TYPE in single multiple +do + if [ "$COVERAGE" ]; then + COVERAGE_FNAME="/tmp/coc-$TYPE.xml" + COVERAGE="-s --cov=pandas --cov-report=xml:$COVERAGE_FNAME" + fi + + TYPE_PATTERN=$TYPE + NUM_JOBS=1 + if [[ "$TYPE_PATTERN" == "multiple" ]]; then + TYPE_PATTERN="not single" + NUM_JOBS=2 + fi + + PYTEST_CMD="pytest -m \"$TYPE_PATTERN$PATTERN\" -n $NUM_JOBS -s --strict --durations=10 --junitxml=test-data-$TYPE.xml $TEST_ARGS $COVERAGE pandas" + echo $PYTEST_CMD + # if no tests are found (the case of "single and slow"), pytest exits with code 5, and would make the script fail, if not for the below code + sh -c "$PYTEST_CMD; ret=\$?; [ \$ret = 5 ] && exit 0 || exit \$ret" + + if [[ "$COVERAGE" && $? == 0 ]]; then + echo "uploading coverage for $TYPE tests" + echo "bash <(curl -s https://codecov.io/bash) -Z -c -F $TYPE -f $COVERAGE_FNAME" + bash <(curl -s https://codecov.io/bash) -Z -c -F $TYPE -f $COVERAGE_FNAME + fi +done diff --git a/ci/script_multi.sh b/ci/script_multi.sh deleted file mode 100755 index 88ecaf344a410..0000000000000 --- a/ci/script_multi.sh +++ /dev/null @@ -1,36 +0,0 @@ -#!/bin/bash - -echo "[script multi]" - -source activate pandas - -if [ -n "$LOCALE_OVERRIDE" ]; then - export LC_ALL="$LOCALE_OVERRIDE"; - echo "Setting LC_ALL to $LOCALE_OVERRIDE" - - pycmd='import pandas; print("pandas detected console encoding: %s" % pandas.get_option("display.encoding"))' - python -c "$pycmd" -fi - -# Workaround for pytest-xdist flaky collection order -# https://github.com/pytest-dev/pytest/issues/920 -# https://github.com/pytest-dev/pytest/issues/1075 -export PYTHONHASHSEED=$(python -c 'import random; print(random.randint(1, 4294967295))') -echo PYTHONHASHSEED=$PYTHONHASHSEED - -if [ "$BUILD_TEST" ]; then - cd /tmp - python -c "import pandas; pandas.test(['-n 2'])" -elif [ "$DOC" ]; then - echo "We are not running pytest as this is a doc-build" -elif [ "$COVERAGE" ]; then - echo pytest -s -n 2 -m "not single" --cov=pandas --cov-report xml:/tmp/cov-multiple.xml --junitxml=/tmp/multiple.xml $TEST_ARGS pandas - pytest -s -n 2 -m "not single" --cov=pandas --cov-report xml:/tmp/cov-multiple.xml --junitxml=/tmp/multiple.xml $TEST_ARGS pandas -else - echo pytest -n 2 -m "not single" --junitxml=/tmp/multiple.xml $TEST_ARGS pandas - pytest -n 2 -m "not single" --junitxml=/tmp/multiple.xml $TEST_ARGS pandas # TODO: doctest -fi - -RET="$?" - -exit "$RET" diff --git a/ci/script_single.sh b/ci/script_single.sh deleted file mode 100755 index db637679f0e0f..0000000000000 --- a/ci/script_single.sh +++ /dev/null @@ -1,29 +0,0 @@ -#!/bin/bash - -echo "[script_single]" - -source activate pandas - -if [ -n "$LOCALE_OVERRIDE" ]; then - export LC_ALL="$LOCALE_OVERRIDE"; - echo "Setting LC_ALL to $LOCALE_OVERRIDE" - - pycmd='import pandas; print("pandas detected console encoding: %s" % pandas.get_option("display.encoding"))' - python -c "$pycmd" -fi - -if [ "$BUILD_TEST" ]; then - echo "We are not running pytest as this is a build test." -elif [ "$DOC" ]; then - echo "We are not running pytest as this is a doc-build" -elif [ "$COVERAGE" ]; then - echo pytest -s -m "single" --cov=pandas --cov-report xml:/tmp/cov-single.xml --junitxml=/tmp/single.xml $TEST_ARGS pandas - pytest -s -m "single" --cov=pandas --cov-report xml:/tmp/cov-single.xml --junitxml=/tmp/single.xml $TEST_ARGS pandas -else - echo pytest -m "single" --junitxml=/tmp/single.xml $TEST_ARGS pandas - pytest -m "single" --junitxml=/tmp/single.xml $TEST_ARGS pandas # TODO: doctest -fi - -RET="$?" - -exit "$RET" diff --git a/ci/show_circle.sh b/ci/show_circle.sh deleted file mode 100755 index bfaa65c1d84f2..0000000000000 --- a/ci/show_circle.sh +++ /dev/null @@ -1,8 +0,0 @@ -#!/usr/bin/env bash - -echo "[installed versions]" - -export PATH="$MINICONDA_DIR/bin:$PATH" -source activate pandas - -python -c "import pandas; pandas.show_versions();" diff --git a/ci/upload_coverage.sh b/ci/upload_coverage.sh deleted file mode 100755 index a7ef2fa908079..0000000000000 --- a/ci/upload_coverage.sh +++ /dev/null @@ -1,12 +0,0 @@ -#!/bin/bash - -if [ -z "$COVERAGE" ]; then - echo "coverage is not selected for this build" - exit 0 -fi - -source activate pandas - -echo "uploading coverage" -bash <(curl -s https://codecov.io/bash) -Z -c -F single -f /tmp/cov-single.xml -bash <(curl -s https://codecov.io/bash) -Z -c -F multiple -f /tmp/cov-multiple.xml diff --git a/circle.yml b/circle.yml deleted file mode 100644 index fa2da0680f388..0000000000000 --- a/circle.yml +++ /dev/null @@ -1,38 +0,0 @@ -machine: - environment: - # these are globally set - MINICONDA_DIR: /home/ubuntu/miniconda3 - - -database: - override: - - ./ci/install_db_circle.sh - - -checkout: - post: - # since circleci does a shallow fetch - # we need to populate our tags - - git fetch --depth=1000 - - -dependencies: - override: - - > - case $CIRCLE_NODE_INDEX in - 0) - sudo apt-get install language-pack-it && ./ci/install_circle.sh JOB="2.7_COMPAT" LOCALE_OVERRIDE="it_IT.UTF-8" ;; - 1) - sudo apt-get install language-pack-zh-hans && ./ci/install_circle.sh JOB="3.4_SLOW" LOCALE_OVERRIDE="zh_CN.UTF-8" ;; - 2) - sudo apt-get install language-pack-zh-hans && ./ci/install_circle.sh JOB="3.4" LOCALE_OVERRIDE="zh_CN.UTF-8" ;; - 3) - ./ci/install_circle.sh JOB="3.5_ASCII" LOCALE_OVERRIDE="C" ;; - esac - - ./ci/show_circle.sh - - -test: - override: - - case $CIRCLE_NODE_INDEX in 0) ./ci/run_circle.sh --skip-slow --skip-network ;; 1) ./ci/run_circle.sh --only-slow --skip-network ;; 2) ./ci/run_circle.sh --skip-slow --skip-network ;; 3) ./ci/run_circle.sh --skip-slow --skip-network ;; esac: - parallel: true diff --git a/codecov.yml b/codecov.yml index b4552563deeaa..512bc2e82a736 100644 --- a/codecov.yml +++ b/codecov.yml @@ -5,7 +5,9 @@ coverage: status: project: default: + enabled: no target: '82' patch: default: + enabled: no target: '50' diff --git a/conda.recipe/meta.yaml b/conda.recipe/meta.yaml index 2aee11772896f..f92090fecccf3 100644 --- a/conda.recipe/meta.yaml +++ b/conda.recipe/meta.yaml @@ -1,9 +1,9 @@ package: name: pandas - version: {{ GIT_DESCRIBE_TAG|replace("v","") }} + version: {{ environ.get('GIT_DESCRIBE_TAG','').replace('v', '', 1) }} build: - number: {{ GIT_DESCRIBE_NUMBER|int }} + number: {{ environ.get('GIT_DESCRIBE_NUMBER', 0) }} {% if GIT_DESCRIBE_NUMBER|int == 0 %}string: np{{ CONDA_NPY }}py{{ CONDA_PY }}_0 {% else %}string: np{{ CONDA_NPY }}py{{ CONDA_PY }}_{{ GIT_BUILD_STR }}{% endif %} @@ -12,22 +12,28 @@ source: requirements: build: + - {{ compiler('c') }} + - {{ compiler('cxx') }} + host: - python + - pip - cython - - numpy x.x - - setuptools + - numpy + - setuptools >=3.3 + - python-dateutil >=2.5.0 - pytz - - python-dateutil - run: - - python - - numpy x.x - - python-dateutil + - python {{ python }} + - {{ pin_compatible('numpy') }} + - python-dateutil >=2.5.0 - pytz test: - imports: - - pandas + requires: + - pytest + commands: + - python -c "import pandas; pandas.test()" + about: home: http://pandas.pydata.org diff --git a/doc/README.rst b/doc/README.rst index a3733846d9ed1..5423e7419d03b 100644 --- a/doc/README.rst +++ b/doc/README.rst @@ -1,169 +1 @@ -.. _contributing.docs: - -Contributing to the documentation -================================= - -If you're not the developer type, contributing to the documentation is still -of huge value. You don't even have to be an expert on -*pandas* to do so! Something as simple as rewriting small passages for clarity -as you reference the docs is a simple but effective way to contribute. The -next person to read that passage will be in your debt! - -Actually, there are sections of the docs that are worse off by being written -by experts. If something in the docs doesn't make sense to you, updating the -relevant section after you figure it out is a simple way to ensure it will -help the next person. - -.. contents:: Table of contents: - :local: - - -About the pandas documentation ------------------------------- - -The documentation is written in **reStructuredText**, which is almost like writing -in plain English, and built using `Sphinx `__. The -Sphinx Documentation has an excellent `introduction to reST -`__. Review the Sphinx docs to perform more -complex changes to the documentation as well. - -Some other important things to know about the docs: - -- The pandas documentation consists of two parts: the docstrings in the code - itself and the docs in this folder ``pandas/doc/``. - - The docstrings provide a clear explanation of the usage of the individual - functions, while the documentation in this folder consists of tutorial-like - overviews per topic together with some other information (what's new, - installation, etc). - -- The docstrings follow the **Numpy Docstring Standard** which is used widely - in the Scientific Python community. This standard specifies the format of - the different sections of the docstring. See `this document - `_ - for a detailed explanation, or look at some of the existing functions to - extend it in a similar manner. - -- The tutorials make heavy use of the `ipython directive - `_ sphinx extension. - This directive lets you put code in the documentation which will be run - during the doc build. For example: - - :: - - .. ipython:: python - - x = 2 - x**3 - - will be rendered as - - :: - - In [1]: x = 2 - - In [2]: x**3 - Out[2]: 8 - - This means that almost all code examples in the docs are always run (and the - output saved) during the doc build. This way, they will always be up to date, - but it makes the doc building a bit more complex. - - -How to build the pandas documentation -------------------------------------- - -Requirements -^^^^^^^^^^^^ - -To build the pandas docs there are some extra requirements: you will need to -have ``sphinx`` and ``ipython`` installed. `numpydoc -`_ is used to parse the docstrings that -follow the Numpy Docstring Standard (see above), but you don't need to install -this because a local copy of ``numpydoc`` is included in the pandas source -code. - -Furthermore, it is recommended to have all `optional dependencies -`_ -installed. This is not needed, but be aware that you will see some error -messages. Because all the code in the documentation is executed during the doc -build, the examples using this optional dependencies will generate errors. -Run ``pd.show_versions()`` to get an overview of the installed version of all -dependencies. - -.. warning:: - - Sphinx version >= 1.2.2 or the older 1.1.3 is required. - -Building pandas -^^^^^^^^^^^^^^^ - -For a step-by-step overview on how to set up your environment, to work with -the pandas code and git, see `the developer pages -`_. -When you start to work on some docs, be sure to update your code to the latest -development version ('master'):: - - git fetch upstream - git rebase upstream/master - -Often it will be necessary to rebuild the C extension after updating:: - - python setup.py build_ext --inplace - -Building the documentation -^^^^^^^^^^^^^^^^^^^^^^^^^^ - -So how do you build the docs? Navigate to your local folder -``pandas/doc/`` directory in the console and run:: - - python make.py html - -And then you can find the html output in the folder ``pandas/doc/build/html/``. - -The first time it will take quite a while, because it has to run all the code -examples in the documentation and build all generated docstring pages. -In subsequent evocations, sphinx will try to only build the pages that have -been modified. - -If you want to do a full clean build, do:: - - python make.py clean - python make.py build - - -Starting with 0.13.1 you can tell ``make.py`` to compile only a single section -of the docs, greatly reducing the turn-around time for checking your changes. -You will be prompted to delete `.rst` files that aren't required, since the -last committed version can always be restored from git. - -:: - - #omit autosummary and API section - python make.py clean - python make.py --no-api - - # compile the docs with only a single - # section, that which is in indexing.rst - python make.py clean - python make.py --single indexing - -For comparison, a full doc build may take 10 minutes. a ``-no-api`` build -may take 3 minutes and a single section may take 15 seconds. - -Where to start? ---------------- - -There are a number of issues listed under `Docs -`_ -and `Good as first PR -`_ -where you could start out. - -Or maybe you have an idea of your own, by using pandas, looking for something -in the documentation and thinking 'this can be improved', let's do something -about that! - -Feel free to ask questions on `mailing list -`_ or submit an -issue on Github. +See `contributing.rst `_ in this repo. diff --git a/doc/_templates/api_redirect.html b/doc/_templates/api_redirect.html index 24bdd8363830f..c04a8b58ce544 100644 --- a/doc/_templates/api_redirect.html +++ b/doc/_templates/api_redirect.html @@ -1,15 +1,10 @@ -{% set pgn = pagename.split('.') -%} -{% if pgn[-2][0].isupper() -%} - {% set redirect = ["pandas", pgn[-2], pgn[-1], 'html']|join('.') -%} -{% else -%} - {% set redirect = ["pandas", pgn[-1], 'html']|join('.') -%} -{% endif -%} +{% set redirect = redirects[pagename.split("/")[-1]] %} - + This API page has moved -

This API page has moved here.


This API page has moved here.

- \ No newline at end of file + diff --git a/doc/cheatsheet/Pandas_Cheat_Sheet.pdf b/doc/cheatsheet/Pandas_Cheat_Sheet.pdf index d504926d22580..48da05d053b96 100644 Binary files a/doc/cheatsheet/Pandas_Cheat_Sheet.pdf and b/doc/cheatsheet/Pandas_Cheat_Sheet.pdf differ diff --git a/doc/cheatsheet/Pandas_Cheat_Sheet.pptx b/doc/cheatsheet/Pandas_Cheat_Sheet.pptx index 76ae8f1e39d4e..039b3898fa301 100644 Binary files a/doc/cheatsheet/Pandas_Cheat_Sheet.pptx and b/doc/cheatsheet/Pandas_Cheat_Sheet.pptx differ diff --git a/doc/cheatsheet/Pandas_Cheat_Sheet_JA.pdf b/doc/cheatsheet/Pandas_Cheat_Sheet_JA.pdf new file mode 100644 index 0000000000000..cf1e40e627f33 Binary files /dev/null and b/doc/cheatsheet/Pandas_Cheat_Sheet_JA.pdf differ diff --git a/doc/cheatsheet/Pandas_Cheat_Sheet_JA.pptx b/doc/cheatsheet/Pandas_Cheat_Sheet_JA.pptx new file mode 100644 index 0000000000000..564d92ddbb56a Binary files /dev/null and b/doc/cheatsheet/Pandas_Cheat_Sheet_JA.pptx differ diff --git a/doc/cheatsheet/README.txt b/doc/cheatsheet/README.txt index e2f6ec042e9cc..d32fe5bcd05a6 100644 --- a/doc/cheatsheet/README.txt +++ b/doc/cheatsheet/README.txt @@ -2,3 +2,7 @@ The Pandas Cheat Sheet was created using Microsoft Powerpoint 2013. To create the PDF version, within Powerpoint, simply do a "Save As" and pick "PDF' as the format. +This cheat sheet was inspired by the RstudioData Wrangling Cheatsheet[1], written by Irv Lustig, Princeton Consultants[2]. + +[1]: https://www.rstudio.com/wp-content/uploads/2015/02/data-wrangling-cheatsheet.pdf +[2]: http://www.princetonoptimization.com/ diff --git a/doc/make.py b/doc/make.py index 30cd2ad8b61c9..6ffbd3ef86e68 100755 --- a/doc/make.py +++ b/doc/make.py @@ -1,476 +1,341 @@ #!/usr/bin/env python - """ Python script for building documentation. To build the docs you must have all optional dependencies for pandas installed. See the installation instructions for a list of these. -Note: currently latex builds do not work because of table formats that are not -supported in the latex generation. - -2014-01-30: Latex has some issues but 'latex_forced' works ok for 0.13.0-400 or so - Usage ----- -python make.py clean -python make.py html + $ python make.py clean + $ python make.py html + $ python make.py latex """ -from __future__ import print_function - -import io -import glob # noqa +import importlib +import sys import os import shutil -import sys -from contextlib import contextmanager - -import sphinx # noqa +import csv +import subprocess import argparse -import jinja2 # noqa - -os.environ['PYTHONPATH'] = '..' - -SPHINX_BUILD = 'sphinxbuild' - - -def upload_dev(user='pandas'): - 'push a copy to the pydata dev directory' - if os.system('cd build/html; rsync -avz . {0}@pandas.pydata.org' - ':/usr/share/nginx/pandas/pandas-docs/dev/ -essh'.format(user)): - raise SystemExit('Upload to Pydata Dev failed') - - -def upload_dev_pdf(user='pandas'): - 'push a copy to the pydata dev directory' - if os.system('cd build/latex; scp pandas.pdf {0}@pandas.pydata.org' - ':/usr/share/nginx/pandas/pandas-docs/dev/'.format(user)): - raise SystemExit('PDF upload to Pydata Dev failed') - - -def upload_stable(user='pandas'): - 'push a copy to the pydata stable directory' - if os.system('cd build/html; rsync -avz . {0}@pandas.pydata.org' - ':/usr/share/nginx/pandas/pandas-docs/stable/ -essh'.format(user)): - raise SystemExit('Upload to stable failed') - - -def upload_stable_pdf(user='pandas'): - 'push a copy to the pydata dev directory' - if os.system('cd build/latex; scp pandas.pdf {0}@pandas.pydata.org' - ':/usr/share/nginx/pandas/pandas-docs/stable/'.format(user)): - raise SystemExit('PDF upload to stable failed') - - -def upload_prev(ver, doc_root='./', user='pandas'): - 'push a copy of older release to appropriate version directory' - local_dir = doc_root + 'build/html' - remote_dir = '/usr/share/nginx/pandas/pandas-docs/version/%s/' % ver - cmd = 'cd %s; rsync -avz . %s@pandas.pydata.org:%s -essh' - cmd = cmd % (local_dir, user, remote_dir) - print(cmd) - if os.system(cmd): - raise SystemExit( - 'Upload to %s from %s failed' % (remote_dir, local_dir)) - - local_dir = doc_root + 'build/latex' - pdf_cmd = 'cd %s; scp pandas.pdf %s@pandas.pydata.org:%s' - pdf_cmd = pdf_cmd % (local_dir, user, remote_dir) - if os.system(pdf_cmd): - raise SystemExit('Upload PDF to %s from %s failed' % (ver, doc_root)) - -def build_pandas(): - os.chdir('..') - os.system('python setup.py clean') - os.system('python setup.py build_ext --inplace') - os.chdir('doc') - -def build_prev(ver): - if os.system('git checkout v%s' % ver) != 1: - os.chdir('..') - os.system('python setup.py clean') - os.system('python setup.py build_ext --inplace') - os.chdir('doc') - os.system('python make.py clean') - os.system('python make.py html') - os.system('python make.py latex') - os.system('git checkout master') - - -def clean(): - if os.path.exists('build'): - shutil.rmtree('build') - - if os.path.exists('source/generated'): - shutil.rmtree('source/generated') - +import webbrowser +import docutils +import docutils.parsers.rst -@contextmanager -def cleanup_nb(nb): - try: - yield - finally: - try: - os.remove(nb + '.executed') - except OSError: - pass +DOC_PATH = os.path.dirname(os.path.abspath(__file__)) +SOURCE_PATH = os.path.join(DOC_PATH, 'source') +BUILD_PATH = os.path.join(DOC_PATH, 'build') +REDIRECTS_FILE = os.path.join(DOC_PATH, 'redirects.csv') -def get_kernel(): - """Find the kernel name for your python version""" - return 'python%s' % sys.version_info.major - -def execute_nb(src, dst, allow_errors=False, timeout=1000, kernel_name=''): - """ - Execute notebook in `src` and write the output to `dst` - - Parameters - ---------- - src, dst: str - path to notebook - allow_errors: bool - timeout: int - kernel_name: str - defualts to value set in notebook metadata - - Returns - ------- - dst: str +class DocBuilder: """ - import nbformat - from nbconvert.preprocessors import ExecutePreprocessor - - with io.open(src, encoding='utf-8') as f: - nb = nbformat.read(f, as_version=4) - - ep = ExecutePreprocessor(allow_errors=allow_errors, - timeout=timeout, - kernel_name=kernel_name) - ep.preprocess(nb, resources={}) + Class to wrap the different commands of this script. - with io.open(dst, 'wt', encoding='utf-8') as f: - nbformat.write(nb, f) - return dst - - -def convert_nb(src, dst, to='html', template_file='basic'): - """ - Convert a notebook `src`. - - Parameters - ---------- - src, dst: str - filepaths - to: {'rst', 'html'} - format to export to - template_file: str - name of template file to use. Default 'basic' + All public methods of this class can be called as parameters of the + script. """ - from nbconvert import HTMLExporter, RSTExporter - - dispatch = {'rst': RSTExporter, 'html': HTMLExporter} - exporter = dispatch[to.lower()](template_file=template_file) - - (body, resources) = exporter.from_filename(src) - with io.open(dst, 'wt', encoding='utf-8') as f: - f.write(body) - return dst - - -def html(): - check_build() - - notebooks = [ - 'source/html-styling.ipynb', - ] + def __init__(self, num_jobs=0, include_api=True, single_doc=None, + verbosity=0, warnings_are_errors=False): + self.num_jobs = num_jobs + self.verbosity = verbosity + self.warnings_are_errors = warnings_are_errors + + if single_doc: + single_doc = self._process_single_doc(single_doc) + include_api = False + os.environ['SPHINX_PATTERN'] = single_doc + elif not include_api: + os.environ['SPHINX_PATTERN'] = '-api' + + self.single_doc_html = None + if single_doc and single_doc.endswith('.rst'): + self.single_doc_html = os.path.splitext(single_doc)[0] + '.html' + elif single_doc: + self.single_doc_html = 'reference/api/pandas.{}.html'.format( + single_doc) + + def _process_single_doc(self, single_doc): + """ + Make sure the provided value for --single is a path to an existing + .rst/.ipynb file, or a pandas object that can be imported. + + For example, categorial.rst or pandas.DataFrame.head. For the latter, + return the corresponding file path + (e.g. reference/api/pandas.DataFrame.head.rst). + """ + base_name, extension = os.path.splitext(single_doc) + if extension in ('.rst', '.ipynb'): + if os.path.exists(os.path.join(SOURCE_PATH, single_doc)): + return single_doc + else: + raise FileNotFoundError('File {} not found'.format(single_doc)) - for nb in notebooks: - with cleanup_nb(nb): + elif single_doc.startswith('pandas.'): try: - print("Converting %s" % nb) - kernel_name = get_kernel() - executed = execute_nb(nb, nb + '.executed', allow_errors=True, - kernel_name=kernel_name) - convert_nb(executed, nb.rstrip('.ipynb') + '.html') - except (ImportError, IndexError) as e: - print(e) - print("Failed to convert %s" % nb) - - if os.system('sphinx-build -P -b html -d build/doctrees ' - 'source build/html'): - raise SystemExit("Building HTML failed.") - try: - # remove stale file - os.remove('source/html-styling.html') - os.remove('build/html/pandas.zip') - except: - pass - - -def zip_html(): - try: - print("\nZipping up HTML docs...") - # just in case the wonky build box doesn't have zip - # don't fail this. - os.system('cd build; rm -f html/pandas.zip; zip html/pandas.zip -r -q html/* ') - print("\n") - except: - pass - -def latex(): - check_build() - if sys.platform != 'win32': - # LaTeX format. - if os.system('sphinx-build -j 2 -b latex -d build/doctrees ' - 'source build/latex'): - raise SystemExit("Building LaTeX failed.") - # Produce pdf. - - os.chdir('build/latex') - - # Call the makefile produced by sphinx... - if os.system('make'): - print("Rendering LaTeX failed.") - print("You may still be able to get a usable PDF file by going into 'build/latex'") - print("and executing 'pdflatex pandas.tex' for the requisite number of passes.") - print("Or using the 'latex_forced' target") - raise SystemExit - - os.chdir('../..') - else: - print('latex build has not been tested on windows') - -def latex_forced(): - check_build() - if sys.platform != 'win32': - # LaTeX format. - if os.system('sphinx-build -j 2 -b latex -d build/doctrees ' - 'source build/latex'): - raise SystemExit("Building LaTeX failed.") - # Produce pdf. - - os.chdir('build/latex') - - # Manually call pdflatex, 3 passes should ensure latex fixes up - # all the required cross-references and such. - os.system('pdflatex -interaction=nonstopmode pandas.tex') - os.system('pdflatex -interaction=nonstopmode pandas.tex') - os.system('pdflatex -interaction=nonstopmode pandas.tex') - raise SystemExit("You should check the file 'build/latex/pandas.pdf' for problems.") - - os.chdir('../..') - else: - print('latex build has not been tested on windows') - - -def check_build(): - build_dirs = [ - 'build', 'build/doctrees', 'build/html', - 'build/latex', 'build/plots', 'build/_static', - 'build/_templates'] - for d in build_dirs: - try: - os.mkdir(d) - except OSError: - pass - - -def all(): - # clean() - html() - - -def auto_dev_build(debug=False): - msg = '' - try: - step = 'clean' - clean() - step = 'html' - html() - step = 'upload dev' - upload_dev() - if not debug: - sendmail(step) - - step = 'latex' - latex() - step = 'upload pdf' - upload_dev_pdf() - if not debug: - sendmail(step) - except (Exception, SystemExit) as inst: - msg = str(inst) + '\n' - sendmail(step, '[ERROR] ' + msg) - - -def sendmail(step=None, err_msg=None): - from_name, to_name = _get_config() - - if step is None: - step = '' - - if err_msg is None or '[ERROR]' not in err_msg: - msgstr = 'Daily docs %s completed successfully' % step - subject = "DOC: %s successful" % step - else: - msgstr = err_msg - subject = "DOC: %s failed" % step - - import smtplib - from email.MIMEText import MIMEText - msg = MIMEText(msgstr) - msg['Subject'] = subject - msg['From'] = from_name - msg['To'] = to_name - - server_str, port, login, pwd = _get_credentials() - server = smtplib.SMTP(server_str, port) - server.ehlo() - server.starttls() - server.ehlo() - - server.login(login, pwd) - try: - server.sendmail(from_name, to_name, msg.as_string()) - finally: - server.close() - - -def _get_dir(subdir=None): - import getpass - USERNAME = getpass.getuser() - if sys.platform == 'darwin': - HOME = '/Users/%s' % USERNAME - else: - HOME = '/home/%s' % USERNAME - - if subdir is None: - subdir = '/code/scripts/config' - conf_dir = '%s/%s' % (HOME, subdir) - return conf_dir - - -def _get_credentials(): - tmp_dir = _get_dir() - cred = '%s/credentials' % tmp_dir - with open(cred, 'r') as fh: - server, port, un, domain = fh.read().split(',') - port = int(port) - login = un + '@' + domain + '.com' - - import base64 - with open('%s/cron_email_pwd' % tmp_dir, 'r') as fh: - pwd = base64.b64decode(fh.read()) - - return server, port, login, pwd - - -def _get_config(): - tmp_dir = _get_dir() - with open('%s/addresses' % tmp_dir, 'r') as fh: - from_name, to_name = fh.read().split(',') - return from_name, to_name - -funcd = { - 'html': html, - 'zip_html': zip_html, - 'upload_dev': upload_dev, - 'upload_stable': upload_stable, - 'upload_dev_pdf': upload_dev_pdf, - 'upload_stable_pdf': upload_stable_pdf, - 'latex': latex, - 'latex_forced': latex_forced, - 'clean': clean, - 'auto_dev': auto_dev_build, - 'auto_debug': lambda: auto_dev_build(True), - 'build_pandas': build_pandas, - 'all': all, -} - -small_docs = False - -# current_dir = os.getcwd() -# os.chdir(os.path.dirname(os.path.join(current_dir, __file__))) - -import argparse -argparser = argparse.ArgumentParser(description=""" -pandas documentation builder -""".strip()) - -# argparser.add_argument('-arg_name', '--arg_name', -# metavar='label for arg help', -# type=str|etc, -# nargs='N|*|?|+|argparse.REMAINDER', -# required=False, -# #choices='abc', -# help='help string', -# action='store|store_true') - -# args = argparser.parse_args() - -#print args.accumulate(args.integers) - -def generate_index(api=True, single=False, **kwds): - from jinja2 import Template - with open("source/index.rst.template") as f: - t = Template(f.read()) - - with open("source/index.rst","w") as f: - f.write(t.render(api=api,single=single,**kwds)) + obj = pandas # noqa: F821 + for name in single_doc.split('.'): + obj = getattr(obj, name) + except AttributeError: + raise ImportError('Could not import {}'.format(single_doc)) + else: + return single_doc[len('pandas.'):] + else: + raise ValueError(('--single={} not understood. Value should be a ' + 'valid path to a .rst or .ipynb file, or a ' + 'valid pandas object (e.g. categorical.rst or ' + 'pandas.DataFrame.head)').format(single_doc)) + + @staticmethod + def _run_os(*args): + """ + Execute a command as a OS terminal. + + Parameters + ---------- + *args : list of str + Command and parameters to be executed + + Examples + -------- + >>> DocBuilder()._run_os('python', '--version') + """ + subprocess.check_call(args, stdout=sys.stdout, stderr=sys.stderr) + + def _sphinx_build(self, kind): + """ + Call sphinx to build documentation. + + Attribute `num_jobs` from the class is used. + + Parameters + ---------- + kind : {'html', 'latex'} + + Examples + -------- + >>> DocBuilder(num_jobs=4)._sphinx_build('html') + """ + if kind not in ('html', 'latex'): + raise ValueError('kind must be html or latex, ' + 'not {}'.format(kind)) + + cmd = ['sphinx-build', '-b', kind] + if self.num_jobs: + cmd += ['-j', str(self.num_jobs)] + if self.warnings_are_errors: + cmd += ['-W', '--keep-going'] + if self.verbosity: + cmd.append('-{}'.format('v' * self.verbosity)) + cmd += ['-d', os.path.join(BUILD_PATH, 'doctrees'), + SOURCE_PATH, os.path.join(BUILD_PATH, kind)] + return subprocess.call(cmd) + + def _open_browser(self, single_doc_html): + """ + Open a browser tab showing single + """ + url = os.path.join('file://', DOC_PATH, 'build', 'html', + single_doc_html) + webbrowser.open(url, new=2) + + def _get_page_title(self, page): + """ + Open the rst file `page` and extract its title. + """ + fname = os.path.join(SOURCE_PATH, '{}.rst'.format(page)) + option_parser = docutils.frontend.OptionParser( + components=(docutils.parsers.rst.Parser,)) + doc = docutils.utils.new_document( + '', + option_parser.get_default_values()) + with open(fname) as f: + data = f.read() + + parser = docutils.parsers.rst.Parser() + # do not generate any warning when parsing the rst + with open(os.devnull, 'a') as f: + doc.reporter.stream = f + parser.parse(data, doc) + + section = next(node for node in doc.children + if isinstance(node, docutils.nodes.section)) + title = next(node for node in section.children + if isinstance(node, docutils.nodes.title)) + + return title.astext() + + def _add_redirects(self): + """ + Create in the build directory an html file with a redirect, + for every row in REDIRECTS_FILE. + """ + html = ''' + + + + + +

+ The page has been moved to {title} +

+ + + ''' + with open(REDIRECTS_FILE) as mapping_fd: + reader = csv.reader(mapping_fd) + for row in reader: + if not row or row[0].strip().startswith('#'): + continue + + path = os.path.join(BUILD_PATH, + 'html', + *row[0].split('/')) + '.html' + + try: + title = self._get_page_title(row[1]) + except Exception: + # the file can be an ipynb and not an rst, or docutils + # may not be able to read the rst because it has some + # sphinx specific stuff + title = 'this page' + + if os.path.exists(path): + raise RuntimeError(( + 'Redirection would overwrite an existing file: ' + '{}').format(path)) + + with open(path, 'w') as moved_page_fd: + moved_page_fd.write( + html.format(url='{}.html'.format(row[1]), + title=title)) + + def html(self): + """ + Build HTML documentation. + """ + ret_code = self._sphinx_build('html') + zip_fname = os.path.join(BUILD_PATH, 'html', 'pandas.zip') + if os.path.exists(zip_fname): + os.remove(zip_fname) + + if self.single_doc_html is not None: + self._open_browser(self.single_doc_html) + else: + self._add_redirects() + return ret_code + + def latex(self, force=False): + """ + Build PDF documentation. + """ + if sys.platform == 'win32': + sys.stderr.write('latex build has not been tested on windows\n') + else: + ret_code = self._sphinx_build('latex') + os.chdir(os.path.join(BUILD_PATH, 'latex')) + if force: + for i in range(3): + self._run_os('pdflatex', + '-interaction=nonstopmode', + 'pandas.tex') + raise SystemExit('You should check the file ' + '"build/latex/pandas.pdf" for problems.') + else: + self._run_os('make') + return ret_code + + def latex_forced(self): + """ + Build PDF documentation with retries to find missing references. + """ + return self.latex(force=True) + + @staticmethod + def clean(): + """ + Clean documentation generated files. + """ + shutil.rmtree(BUILD_PATH, ignore_errors=True) + shutil.rmtree(os.path.join(SOURCE_PATH, 'reference', 'api'), + ignore_errors=True) + + def zip_html(self): + """ + Compress HTML documentation into a zip file. + """ + zip_fname = os.path.join(BUILD_PATH, 'html', 'pandas.zip') + if os.path.exists(zip_fname): + os.remove(zip_fname) + dirname = os.path.join(BUILD_PATH, 'html') + fnames = os.listdir(dirname) + os.chdir(dirname) + self._run_os('zip', + zip_fname, + '-r', + '-q', + *fnames) -import argparse -argparser = argparse.ArgumentParser(description="pandas documentation builder", - epilog="Targets : %s" % funcd.keys()) - -argparser.add_argument('--no-api', - default=False, - help='Ommit api and autosummary', - action='store_true') -argparser.add_argument('--single', - metavar='FILENAME', - type=str, - default=False, - help='filename of section to compile, e.g. "indexing"') -argparser.add_argument('--user', - type=str, - default=False, - help='Username to connect to the pydata server') def main(): - args, unknown = argparser.parse_known_args() - sys.argv = [sys.argv[0]] + unknown - if args.single: - args.single = os.path.basename(args.single).split(".rst")[0] - - if 'clean' in unknown: - args.single=False - - generate_index(api=not args.no_api and not args.single, single=args.single) - - if len(sys.argv) > 2: - ftype = sys.argv[1] - ver = sys.argv[2] - - if ftype == 'build_previous': - build_prev(ver, user=args.user) - if ftype == 'upload_previous': - upload_prev(ver, user=args.user) - elif len(sys.argv) == 2: - for arg in sys.argv[1:]: - func = funcd.get(arg) - if func is None: - raise SystemExit('Do not know how to handle %s; valid args are %s' % ( - arg, list(funcd.keys()))) - if args.user: - func(user=args.user) - else: - func() - else: - small_docs = False - all() -# os.chdir(current_dir) + cmds = [method for method in dir(DocBuilder) if not method.startswith('_')] + + argparser = argparse.ArgumentParser( + description='pandas documentation builder', + epilog='Commands: {}'.format(','.join(cmds))) + argparser.add_argument('command', + nargs='?', + default='html', + help='command to run: {}'.format(', '.join(cmds))) + argparser.add_argument('--num-jobs', + type=int, + default=0, + help='number of jobs used by sphinx-build') + argparser.add_argument('--no-api', + default=False, + help='omit api and autosummary', + action='store_true') + argparser.add_argument('--single', + metavar='FILENAME', + type=str, + default=None, + help=('filename (relative to the "source" folder)' + ' of section or method name to compile, e.g. ' + '"development/contributing.rst",' + ' "ecosystem.rst", "pandas.DataFrame.join"')) + argparser.add_argument('--python-path', + type=str, + default=os.path.dirname(DOC_PATH), + help='path') + argparser.add_argument('-v', action='count', dest='verbosity', default=0, + help=('increase verbosity (can be repeated), ' + 'passed to the sphinx build command')) + argparser.add_argument('--warnings-are-errors', '-W', + action='store_true', + help='fail if warnings are raised') + args = argparser.parse_args() + + if args.command not in cmds: + raise ValueError('Unknown command {}. Available options: {}'.format( + args.command, ', '.join(cmds))) + + # Below we update both os.environ and sys.path. The former is used by + # external libraries (namely Sphinx) to compile this module and resolve + # the import of `python_path` correctly. The latter is used to resolve + # the import within the module, injecting it into the global namespace + os.environ['PYTHONPATH'] = args.python_path + sys.path.insert(0, args.python_path) + globals()['pandas'] = importlib.import_module('pandas') + + # Set the matplotlib backend to the non-interactive Agg backend for all + # child processes. + os.environ['MPLBACKEND'] = 'module://matplotlib.backends.backend_agg' + + builder = DocBuilder(args.num_jobs, not args.no_api, args.single, + args.verbosity, args.warnings_are_errors) + return getattr(builder, args.command)() + if __name__ == '__main__': - import sys sys.exit(main()) diff --git a/doc/plots/stats/moment_plots.py b/doc/plots/stats/moment_plots.py deleted file mode 100644 index 9e3a902592c6b..0000000000000 --- a/doc/plots/stats/moment_plots.py +++ /dev/null @@ -1,30 +0,0 @@ -import numpy as np - -import matplotlib.pyplot as plt -import pandas.util.testing as t -import pandas.stats.moments as m - - -def test_series(n=1000): - t.N = n - s = t.makeTimeSeries() - return s - - -def plot_timeseries(*args, **kwds): - n = len(args) - - fig, axes = plt.subplots(n, 1, figsize=kwds.get('size', (10, 5)), - sharex=True) - titles = kwds.get('titles', None) - - for k in range(1, n + 1): - ax = axes[k - 1] - ts = args[k - 1] - ax.plot(ts.index, ts.values) - - if titles: - ax.set_title(titles[k - 1]) - - fig.autofmt_xdate() - fig.subplots_adjust(bottom=0.10, top=0.95) diff --git a/doc/plots/stats/moments_ewma.py b/doc/plots/stats/moments_ewma.py deleted file mode 100644 index 3e521ed60bb8f..0000000000000 --- a/doc/plots/stats/moments_ewma.py +++ /dev/null @@ -1,15 +0,0 @@ -import matplotlib.pyplot as plt -import pandas.util.testing as t -import pandas.stats.moments as m - -t.N = 200 -s = t.makeTimeSeries().cumsum() - -plt.figure(figsize=(10, 5)) -plt.plot(s.index, s.values) -plt.plot(s.index, m.ewma(s, 20, min_periods=1).values) -f = plt.gcf() -f.autofmt_xdate() - -plt.show() -plt.close('all') diff --git a/doc/plots/stats/moments_ewmvol.py b/doc/plots/stats/moments_ewmvol.py deleted file mode 100644 index 093f62868fc4e..0000000000000 --- a/doc/plots/stats/moments_ewmvol.py +++ /dev/null @@ -1,23 +0,0 @@ -import matplotlib.pyplot as plt -import pandas.util.testing as t -import pandas.stats.moments as m - -t.N = 500 -ts = t.makeTimeSeries() -ts[::100] = 20 - -s = ts.cumsum() - - -plt.figure(figsize=(10, 5)) -plt.plot(s.index, m.ewmvol(s, span=50, min_periods=1).values, color='b') -plt.plot(s.index, m.rolling_std(s, 50, min_periods=1).values, color='r') - -plt.title('Exp-weighted std with shocks') -plt.legend(('Exp-weighted', 'Equal-weighted')) - -f = plt.gcf() -f.autofmt_xdate() - -plt.show() -plt.close('all') diff --git a/doc/plots/stats/moments_expw.py b/doc/plots/stats/moments_expw.py deleted file mode 100644 index 5fff419b3a940..0000000000000 --- a/doc/plots/stats/moments_expw.py +++ /dev/null @@ -1,35 +0,0 @@ -from moment_plots import * - -np.random.seed(1) - -ts = test_series(500) * 10 - -# ts[::100] = 20 - -s = ts.cumsum() - -fig, axes = plt.subplots(3, 1, figsize=(8, 10), sharex=True) - -ax0, ax1, ax2 = axes - -ax0.plot(s.index, s.values) -ax0.set_title('time series') - -ax1.plot(s.index, m.ewma(s, span=50, min_periods=1).values, color='b') -ax1.plot(s.index, m.rolling_mean(s, 50, min_periods=1).values, color='r') -ax1.set_title('rolling_mean vs. ewma') - -line1 = ax2.plot( - s.index, m.ewmstd(s, span=50, min_periods=1).values, color='b') -line2 = ax2.plot( - s.index, m.rolling_std(s, 50, min_periods=1).values, color='r') -ax2.set_title('rolling_std vs. ewmstd') - -fig.legend((line1, line2), - ('Exp-weighted', 'Equal-weighted'), - loc='upper right') -fig.autofmt_xdate() -fig.subplots_adjust(bottom=0.10, top=0.95) - -plt.show() -plt.close('all') diff --git a/doc/plots/stats/moments_rolling.py b/doc/plots/stats/moments_rolling.py deleted file mode 100644 index 30a6c5f53e20c..0000000000000 --- a/doc/plots/stats/moments_rolling.py +++ /dev/null @@ -1,24 +0,0 @@ -from moment_plots import * - -ts = test_series() -s = ts.cumsum() - -s[20:50] = np.NaN -s[120:150] = np.NaN -plot_timeseries(s, - m.rolling_count(s, 50), - m.rolling_sum(s, 50, min_periods=10), - m.rolling_mean(s, 50, min_periods=10), - m.rolling_std(s, 50, min_periods=10), - m.rolling_skew(s, 50, min_periods=10), - m.rolling_kurt(s, 50, min_periods=10), - size=(10, 12), - titles=('time series', - 'rolling_count', - 'rolling_sum', - 'rolling_mean', - 'rolling_std', - 'rolling_skew', - 'rolling_kurt')) -plt.show() -plt.close('all') diff --git a/doc/plots/stats/moments_rolling_binary.py b/doc/plots/stats/moments_rolling_binary.py deleted file mode 100644 index ab6b7b1c8ff49..0000000000000 --- a/doc/plots/stats/moments_rolling_binary.py +++ /dev/null @@ -1,30 +0,0 @@ -from moment_plots import * - -np.random.seed(1) - -ts = test_series() -s = ts.cumsum() -ts2 = test_series() -s2 = ts2.cumsum() - -s[20:50] = np.NaN -s[120:150] = np.NaN -fig, axes = plt.subplots(3, 1, figsize=(8, 10), sharex=True) - -ax0, ax1, ax2 = axes - -ax0.plot(s.index, s.values) -ax0.plot(s2.index, s2.values) -ax0.set_title('time series') - -ax1.plot(s.index, m.rolling_corr(s, s2, 50, min_periods=1).values) -ax1.set_title('rolling_corr') - -ax2.plot(s.index, m.rolling_cov(s, s2, 50, min_periods=1).values) -ax2.set_title('rolling_cov') - -fig.autofmt_xdate() -fig.subplots_adjust(bottom=0.10, top=0.95) - -plt.show() -plt.close('all') diff --git a/doc/redirects.csv b/doc/redirects.csv new file mode 100644 index 0000000000000..a7886779c97d5 --- /dev/null +++ b/doc/redirects.csv @@ -0,0 +1,1581 @@ +# This file should contain all the redirects in the documentation +# in the format `,` + +# whatsnew +whatsnew,whatsnew/index +release,whatsnew/index + +# getting started +10min,getting_started/10min +basics,getting_started/basics +comparison_with_r,getting_started/comparison/comparison_with_r +comparison_with_sql,getting_started/comparison/comparison_with_sql +comparison_with_sas,getting_started/comparison/comparison_with_sas +comparison_with_stata,getting_started/comparison/comparison_with_stata +dsintro,getting_started/dsintro +overview,getting_started/overview +tutorials,getting_started/tutorials + +# user guide +advanced,user_guide/advanced +categorical,user_guide/categorical +computation,user_guide/computation +cookbook,user_guide/cookbook +enhancingperf,user_guide/enhancingperf +gotchas,user_guide/gotchas +groupby,user_guide/groupby +indexing,user_guide/indexing +integer_na,user_guide/integer_na +io,user_guide/io +merging,user_guide/merging +missing_data,user_guide/missing_data +options,user_guide/options +reshaping,user_guide/reshaping +sparse,user_guide/sparse +style,user_guide/style +text,user_guide/text +timedeltas,user_guide/timedeltas +timeseries,user_guide/timeseries +visualization,user_guide/visualization + +# development +contributing,development/contributing +contributing_docstring,development/contributing_docstring +developer,development/developer +extending,development/extending +internals,development/internals + +# api +api,reference/index +generated/pandas.api.extensions.ExtensionArray.argsort,../reference/api/pandas.api.extensions.ExtensionArray.argsort +generated/pandas.api.extensions.ExtensionArray.astype,../reference/api/pandas.api.extensions.ExtensionArray.astype +generated/pandas.api.extensions.ExtensionArray.copy,../reference/api/pandas.api.extensions.ExtensionArray.copy +generated/pandas.api.extensions.ExtensionArray.dropna,../reference/api/pandas.api.extensions.ExtensionArray.dropna +generated/pandas.api.extensions.ExtensionArray.dtype,../reference/api/pandas.api.extensions.ExtensionArray.dtype +generated/pandas.api.extensions.ExtensionArray.factorize,../reference/api/pandas.api.extensions.ExtensionArray.factorize +generated/pandas.api.extensions.ExtensionArray.fillna,../reference/api/pandas.api.extensions.ExtensionArray.fillna +generated/pandas.api.extensions.ExtensionArray,../reference/api/pandas.api.extensions.ExtensionArray +generated/pandas.api.extensions.ExtensionArray.isna,../reference/api/pandas.api.extensions.ExtensionArray.isna +generated/pandas.api.extensions.ExtensionArray.nbytes,../reference/api/pandas.api.extensions.ExtensionArray.nbytes +generated/pandas.api.extensions.ExtensionArray.ndim,../reference/api/pandas.api.extensions.ExtensionArray.ndim +generated/pandas.api.extensions.ExtensionArray.shape,../reference/api/pandas.api.extensions.ExtensionArray.shape +generated/pandas.api.extensions.ExtensionArray.take,../reference/api/pandas.api.extensions.ExtensionArray.take +generated/pandas.api.extensions.ExtensionArray.unique,../reference/api/pandas.api.extensions.ExtensionArray.unique +generated/pandas.api.extensions.ExtensionDtype.construct_array_type,../reference/api/pandas.api.extensions.ExtensionDtype.construct_array_type +generated/pandas.api.extensions.ExtensionDtype.construct_from_string,../reference/api/pandas.api.extensions.ExtensionDtype.construct_from_string +generated/pandas.api.extensions.ExtensionDtype,../reference/api/pandas.api.extensions.ExtensionDtype +generated/pandas.api.extensions.ExtensionDtype.is_dtype,../reference/api/pandas.api.extensions.ExtensionDtype.is_dtype +generated/pandas.api.extensions.ExtensionDtype.kind,../reference/api/pandas.api.extensions.ExtensionDtype.kind +generated/pandas.api.extensions.ExtensionDtype.name,../reference/api/pandas.api.extensions.ExtensionDtype.name +generated/pandas.api.extensions.ExtensionDtype.names,../reference/api/pandas.api.extensions.ExtensionDtype.names +generated/pandas.api.extensions.ExtensionDtype.na_value,../reference/api/pandas.api.extensions.ExtensionDtype.na_value +generated/pandas.api.extensions.ExtensionDtype.type,../reference/api/pandas.api.extensions.ExtensionDtype.type +generated/pandas.api.extensions.register_dataframe_accessor,../reference/api/pandas.api.extensions.register_dataframe_accessor +generated/pandas.api.extensions.register_extension_dtype,../reference/api/pandas.api.extensions.register_extension_dtype +generated/pandas.api.extensions.register_index_accessor,../reference/api/pandas.api.extensions.register_index_accessor +generated/pandas.api.extensions.register_series_accessor,../reference/api/pandas.api.extensions.register_series_accessor +generated/pandas.api.types.infer_dtype,../reference/api/pandas.api.types.infer_dtype +generated/pandas.api.types.is_bool_dtype,../reference/api/pandas.api.types.is_bool_dtype +generated/pandas.api.types.is_bool,../reference/api/pandas.api.types.is_bool +generated/pandas.api.types.is_categorical_dtype,../reference/api/pandas.api.types.is_categorical_dtype +generated/pandas.api.types.is_categorical,../reference/api/pandas.api.types.is_categorical +generated/pandas.api.types.is_complex_dtype,../reference/api/pandas.api.types.is_complex_dtype +generated/pandas.api.types.is_complex,../reference/api/pandas.api.types.is_complex +generated/pandas.api.types.is_datetime64_any_dtype,../reference/api/pandas.api.types.is_datetime64_any_dtype +generated/pandas.api.types.is_datetime64_dtype,../reference/api/pandas.api.types.is_datetime64_dtype +generated/pandas.api.types.is_datetime64_ns_dtype,../reference/api/pandas.api.types.is_datetime64_ns_dtype +generated/pandas.api.types.is_datetime64tz_dtype,../reference/api/pandas.api.types.is_datetime64tz_dtype +generated/pandas.api.types.is_datetimetz,../reference/api/pandas.api.types.is_datetimetz +generated/pandas.api.types.is_dict_like,../reference/api/pandas.api.types.is_dict_like +generated/pandas.api.types.is_extension_array_dtype,../reference/api/pandas.api.types.is_extension_array_dtype +generated/pandas.api.types.is_extension_type,../reference/api/pandas.api.types.is_extension_type +generated/pandas.api.types.is_file_like,../reference/api/pandas.api.types.is_file_like +generated/pandas.api.types.is_float_dtype,../reference/api/pandas.api.types.is_float_dtype +generated/pandas.api.types.is_float,../reference/api/pandas.api.types.is_float +generated/pandas.api.types.is_hashable,../reference/api/pandas.api.types.is_hashable +generated/pandas.api.types.is_int64_dtype,../reference/api/pandas.api.types.is_int64_dtype +generated/pandas.api.types.is_integer_dtype,../reference/api/pandas.api.types.is_integer_dtype +generated/pandas.api.types.is_integer,../reference/api/pandas.api.types.is_integer +generated/pandas.api.types.is_interval_dtype,../reference/api/pandas.api.types.is_interval_dtype +generated/pandas.api.types.is_interval,../reference/api/pandas.api.types.is_interval +generated/pandas.api.types.is_iterator,../reference/api/pandas.api.types.is_iterator +generated/pandas.api.types.is_list_like,../reference/api/pandas.api.types.is_list_like +generated/pandas.api.types.is_named_tuple,../reference/api/pandas.api.types.is_named_tuple +generated/pandas.api.types.is_number,../reference/api/pandas.api.types.is_number +generated/pandas.api.types.is_numeric_dtype,../reference/api/pandas.api.types.is_numeric_dtype +generated/pandas.api.types.is_object_dtype,../reference/api/pandas.api.types.is_object_dtype +generated/pandas.api.types.is_period_dtype,../reference/api/pandas.api.types.is_period_dtype +generated/pandas.api.types.is_period,../reference/api/pandas.api.types.is_period +generated/pandas.api.types.is_re_compilable,../reference/api/pandas.api.types.is_re_compilable +generated/pandas.api.types.is_re,../reference/api/pandas.api.types.is_re +generated/pandas.api.types.is_scalar,../reference/api/pandas.api.types.is_scalar +generated/pandas.api.types.is_signed_integer_dtype,../reference/api/pandas.api.types.is_signed_integer_dtype +generated/pandas.api.types.is_sparse,../reference/api/pandas.api.types.is_sparse +generated/pandas.api.types.is_string_dtype,../reference/api/pandas.api.types.is_string_dtype +generated/pandas.api.types.is_timedelta64_dtype,../reference/api/pandas.api.types.is_timedelta64_dtype +generated/pandas.api.types.is_timedelta64_ns_dtype,../reference/api/pandas.api.types.is_timedelta64_ns_dtype +generated/pandas.api.types.is_unsigned_integer_dtype,../reference/api/pandas.api.types.is_unsigned_integer_dtype +generated/pandas.api.types.pandas_dtype,../reference/api/pandas.api.types.pandas_dtype +generated/pandas.api.types.union_categoricals,../reference/api/pandas.api.types.union_categoricals +generated/pandas.bdate_range,../reference/api/pandas.bdate_range +generated/pandas.Categorical.__array__,../reference/api/pandas.Categorical.__array__ +generated/pandas.Categorical.categories,../reference/api/pandas.Categorical.categories +generated/pandas.Categorical.codes,../reference/api/pandas.Categorical.codes +generated/pandas.CategoricalDtype.categories,../reference/api/pandas.CategoricalDtype.categories +generated/pandas.Categorical.dtype,../reference/api/pandas.Categorical.dtype +generated/pandas.CategoricalDtype,../reference/api/pandas.CategoricalDtype +generated/pandas.CategoricalDtype.ordered,../reference/api/pandas.CategoricalDtype.ordered +generated/pandas.Categorical.from_codes,../reference/api/pandas.Categorical.from_codes +generated/pandas.Categorical,../reference/api/pandas.Categorical +generated/pandas.CategoricalIndex.add_categories,../reference/api/pandas.CategoricalIndex.add_categories +generated/pandas.CategoricalIndex.as_ordered,../reference/api/pandas.CategoricalIndex.as_ordered +generated/pandas.CategoricalIndex.as_unordered,../reference/api/pandas.CategoricalIndex.as_unordered +generated/pandas.CategoricalIndex.categories,../reference/api/pandas.CategoricalIndex.categories +generated/pandas.CategoricalIndex.codes,../reference/api/pandas.CategoricalIndex.codes +generated/pandas.CategoricalIndex.equals,../reference/api/pandas.CategoricalIndex.equals +generated/pandas.CategoricalIndex,../reference/api/pandas.CategoricalIndex +generated/pandas.CategoricalIndex.map,../reference/api/pandas.CategoricalIndex.map +generated/pandas.CategoricalIndex.ordered,../reference/api/pandas.CategoricalIndex.ordered +generated/pandas.CategoricalIndex.remove_categories,../reference/api/pandas.CategoricalIndex.remove_categories +generated/pandas.CategoricalIndex.remove_unused_categories,../reference/api/pandas.CategoricalIndex.remove_unused_categories +generated/pandas.CategoricalIndex.rename_categories,../reference/api/pandas.CategoricalIndex.rename_categories +generated/pandas.CategoricalIndex.reorder_categories,../reference/api/pandas.CategoricalIndex.reorder_categories +generated/pandas.CategoricalIndex.set_categories,../reference/api/pandas.CategoricalIndex.set_categories +generated/pandas.Categorical.ordered,../reference/api/pandas.Categorical.ordered +generated/pandas.concat,../reference/api/pandas.concat +generated/pandas.core.groupby.DataFrameGroupBy.all,../reference/api/pandas.core.groupby.DataFrameGroupBy.all +generated/pandas.core.groupby.DataFrameGroupBy.any,../reference/api/pandas.core.groupby.DataFrameGroupBy.any +generated/pandas.core.groupby.DataFrameGroupBy.bfill,../reference/api/pandas.core.groupby.DataFrameGroupBy.bfill +generated/pandas.core.groupby.DataFrameGroupBy.boxplot,../reference/api/pandas.core.groupby.DataFrameGroupBy.boxplot +generated/pandas.core.groupby.DataFrameGroupBy.corr,../reference/api/pandas.core.groupby.DataFrameGroupBy.corr +generated/pandas.core.groupby.DataFrameGroupBy.corrwith,../reference/api/pandas.core.groupby.DataFrameGroupBy.corrwith +generated/pandas.core.groupby.DataFrameGroupBy.count,../reference/api/pandas.core.groupby.DataFrameGroupBy.count +generated/pandas.core.groupby.DataFrameGroupBy.cov,../reference/api/pandas.core.groupby.DataFrameGroupBy.cov +generated/pandas.core.groupby.DataFrameGroupBy.cummax,../reference/api/pandas.core.groupby.DataFrameGroupBy.cummax +generated/pandas.core.groupby.DataFrameGroupBy.cummin,../reference/api/pandas.core.groupby.DataFrameGroupBy.cummin +generated/pandas.core.groupby.DataFrameGroupBy.cumprod,../reference/api/pandas.core.groupby.DataFrameGroupBy.cumprod +generated/pandas.core.groupby.DataFrameGroupBy.cumsum,../reference/api/pandas.core.groupby.DataFrameGroupBy.cumsum +generated/pandas.core.groupby.DataFrameGroupBy.describe,../reference/api/pandas.core.groupby.DataFrameGroupBy.describe +generated/pandas.core.groupby.DataFrameGroupBy.diff,../reference/api/pandas.core.groupby.DataFrameGroupBy.diff +generated/pandas.core.groupby.DataFrameGroupBy.ffill,../reference/api/pandas.core.groupby.DataFrameGroupBy.ffill +generated/pandas.core.groupby.DataFrameGroupBy.fillna,../reference/api/pandas.core.groupby.DataFrameGroupBy.fillna +generated/pandas.core.groupby.DataFrameGroupBy.filter,../reference/api/pandas.core.groupby.DataFrameGroupBy.filter +generated/pandas.core.groupby.DataFrameGroupBy.hist,../reference/api/pandas.core.groupby.DataFrameGroupBy.hist +generated/pandas.core.groupby.DataFrameGroupBy.idxmax,../reference/api/pandas.core.groupby.DataFrameGroupBy.idxmax +generated/pandas.core.groupby.DataFrameGroupBy.idxmin,../reference/api/pandas.core.groupby.DataFrameGroupBy.idxmin +generated/pandas.core.groupby.DataFrameGroupBy.mad,../reference/api/pandas.core.groupby.DataFrameGroupBy.mad +generated/pandas.core.groupby.DataFrameGroupBy.pct_change,../reference/api/pandas.core.groupby.DataFrameGroupBy.pct_change +generated/pandas.core.groupby.DataFrameGroupBy.plot,../reference/api/pandas.core.groupby.DataFrameGroupBy.plot +generated/pandas.core.groupby.DataFrameGroupBy.quantile,../reference/api/pandas.core.groupby.DataFrameGroupBy.quantile +generated/pandas.core.groupby.DataFrameGroupBy.rank,../reference/api/pandas.core.groupby.DataFrameGroupBy.rank +generated/pandas.core.groupby.DataFrameGroupBy.resample,../reference/api/pandas.core.groupby.DataFrameGroupBy.resample +generated/pandas.core.groupby.DataFrameGroupBy.shift,../reference/api/pandas.core.groupby.DataFrameGroupBy.shift +generated/pandas.core.groupby.DataFrameGroupBy.size,../reference/api/pandas.core.groupby.DataFrameGroupBy.size +generated/pandas.core.groupby.DataFrameGroupBy.skew,../reference/api/pandas.core.groupby.DataFrameGroupBy.skew +generated/pandas.core.groupby.DataFrameGroupBy.take,../reference/api/pandas.core.groupby.DataFrameGroupBy.take +generated/pandas.core.groupby.DataFrameGroupBy.tshift,../reference/api/pandas.core.groupby.DataFrameGroupBy.tshift +generated/pandas.core.groupby.GroupBy.agg,../reference/api/pandas.core.groupby.GroupBy.agg +generated/pandas.core.groupby.GroupBy.aggregate,../reference/api/pandas.core.groupby.GroupBy.aggregate +generated/pandas.core.groupby.GroupBy.all,../reference/api/pandas.core.groupby.GroupBy.all +generated/pandas.core.groupby.GroupBy.any,../reference/api/pandas.core.groupby.GroupBy.any +generated/pandas.core.groupby.GroupBy.apply,../reference/api/pandas.core.groupby.GroupBy.apply +generated/pandas.core.groupby.GroupBy.bfill,../reference/api/pandas.core.groupby.GroupBy.bfill +generated/pandas.core.groupby.GroupBy.count,../reference/api/pandas.core.groupby.GroupBy.count +generated/pandas.core.groupby.GroupBy.cumcount,../reference/api/pandas.core.groupby.GroupBy.cumcount +generated/pandas.core.groupby.GroupBy.ffill,../reference/api/pandas.core.groupby.GroupBy.ffill +generated/pandas.core.groupby.GroupBy.first,../reference/api/pandas.core.groupby.GroupBy.first +generated/pandas.core.groupby.GroupBy.get_group,../reference/api/pandas.core.groupby.GroupBy.get_group +generated/pandas.core.groupby.GroupBy.groups,../reference/api/pandas.core.groupby.GroupBy.groups +generated/pandas.core.groupby.GroupBy.head,../reference/api/pandas.core.groupby.GroupBy.head +generated/pandas.core.groupby.GroupBy.indices,../reference/api/pandas.core.groupby.GroupBy.indices +generated/pandas.core.groupby.GroupBy.__iter__,../reference/api/pandas.core.groupby.GroupBy.__iter__ +generated/pandas.core.groupby.GroupBy.last,../reference/api/pandas.core.groupby.GroupBy.last +generated/pandas.core.groupby.GroupBy.max,../reference/api/pandas.core.groupby.GroupBy.max +generated/pandas.core.groupby.GroupBy.mean,../reference/api/pandas.core.groupby.GroupBy.mean +generated/pandas.core.groupby.GroupBy.median,../reference/api/pandas.core.groupby.GroupBy.median +generated/pandas.core.groupby.GroupBy.min,../reference/api/pandas.core.groupby.GroupBy.min +generated/pandas.core.groupby.GroupBy.ngroup,../reference/api/pandas.core.groupby.GroupBy.ngroup +generated/pandas.core.groupby.GroupBy.nth,../reference/api/pandas.core.groupby.GroupBy.nth +generated/pandas.core.groupby.GroupBy.ohlc,../reference/api/pandas.core.groupby.GroupBy.ohlc +generated/pandas.core.groupby.GroupBy.pct_change,../reference/api/pandas.core.groupby.GroupBy.pct_change +generated/pandas.core.groupby.GroupBy.pipe,../reference/api/pandas.core.groupby.GroupBy.pipe +generated/pandas.core.groupby.GroupBy.prod,../reference/api/pandas.core.groupby.GroupBy.prod +generated/pandas.core.groupby.GroupBy.rank,../reference/api/pandas.core.groupby.GroupBy.rank +generated/pandas.core.groupby.GroupBy.sem,../reference/api/pandas.core.groupby.GroupBy.sem +generated/pandas.core.groupby.GroupBy.size,../reference/api/pandas.core.groupby.GroupBy.size +generated/pandas.core.groupby.GroupBy.std,../reference/api/pandas.core.groupby.GroupBy.std +generated/pandas.core.groupby.GroupBy.sum,../reference/api/pandas.core.groupby.GroupBy.sum +generated/pandas.core.groupby.GroupBy.tail,../reference/api/pandas.core.groupby.GroupBy.tail +generated/pandas.core.groupby.GroupBy.transform,../reference/api/pandas.core.groupby.GroupBy.transform +generated/pandas.core.groupby.GroupBy.var,../reference/api/pandas.core.groupby.GroupBy.var +generated/pandas.core.groupby.SeriesGroupBy.is_monotonic_decreasing,../reference/api/pandas.core.groupby.SeriesGroupBy.is_monotonic_decreasing +generated/pandas.core.groupby.SeriesGroupBy.is_monotonic_increasing,../reference/api/pandas.core.groupby.SeriesGroupBy.is_monotonic_increasing +generated/pandas.core.groupby.SeriesGroupBy.nlargest,../reference/api/pandas.core.groupby.SeriesGroupBy.nlargest +generated/pandas.core.groupby.SeriesGroupBy.nsmallest,../reference/api/pandas.core.groupby.SeriesGroupBy.nsmallest +generated/pandas.core.groupby.SeriesGroupBy.nunique,../reference/api/pandas.core.groupby.SeriesGroupBy.nunique +generated/pandas.core.groupby.SeriesGroupBy.unique,../reference/api/pandas.core.groupby.SeriesGroupBy.unique +generated/pandas.core.groupby.SeriesGroupBy.value_counts,../reference/api/pandas.core.groupby.SeriesGroupBy.value_counts +generated/pandas.core.resample.Resampler.aggregate,../reference/api/pandas.core.resample.Resampler.aggregate +generated/pandas.core.resample.Resampler.apply,../reference/api/pandas.core.resample.Resampler.apply +generated/pandas.core.resample.Resampler.asfreq,../reference/api/pandas.core.resample.Resampler.asfreq +generated/pandas.core.resample.Resampler.backfill,../reference/api/pandas.core.resample.Resampler.backfill +generated/pandas.core.resample.Resampler.bfill,../reference/api/pandas.core.resample.Resampler.bfill +generated/pandas.core.resample.Resampler.count,../reference/api/pandas.core.resample.Resampler.count +generated/pandas.core.resample.Resampler.ffill,../reference/api/pandas.core.resample.Resampler.ffill +generated/pandas.core.resample.Resampler.fillna,../reference/api/pandas.core.resample.Resampler.fillna +generated/pandas.core.resample.Resampler.first,../reference/api/pandas.core.resample.Resampler.first +generated/pandas.core.resample.Resampler.get_group,../reference/api/pandas.core.resample.Resampler.get_group +generated/pandas.core.resample.Resampler.groups,../reference/api/pandas.core.resample.Resampler.groups +generated/pandas.core.resample.Resampler.indices,../reference/api/pandas.core.resample.Resampler.indices +generated/pandas.core.resample.Resampler.interpolate,../reference/api/pandas.core.resample.Resampler.interpolate +generated/pandas.core.resample.Resampler.__iter__,../reference/api/pandas.core.resample.Resampler.__iter__ +generated/pandas.core.resample.Resampler.last,../reference/api/pandas.core.resample.Resampler.last +generated/pandas.core.resample.Resampler.max,../reference/api/pandas.core.resample.Resampler.max +generated/pandas.core.resample.Resampler.mean,../reference/api/pandas.core.resample.Resampler.mean +generated/pandas.core.resample.Resampler.median,../reference/api/pandas.core.resample.Resampler.median +generated/pandas.core.resample.Resampler.min,../reference/api/pandas.core.resample.Resampler.min +generated/pandas.core.resample.Resampler.nearest,../reference/api/pandas.core.resample.Resampler.nearest +generated/pandas.core.resample.Resampler.nunique,../reference/api/pandas.core.resample.Resampler.nunique +generated/pandas.core.resample.Resampler.ohlc,../reference/api/pandas.core.resample.Resampler.ohlc +generated/pandas.core.resample.Resampler.pad,../reference/api/pandas.core.resample.Resampler.pad +generated/pandas.core.resample.Resampler.pipe,../reference/api/pandas.core.resample.Resampler.pipe +generated/pandas.core.resample.Resampler.prod,../reference/api/pandas.core.resample.Resampler.prod +generated/pandas.core.resample.Resampler.quantile,../reference/api/pandas.core.resample.Resampler.quantile +generated/pandas.core.resample.Resampler.sem,../reference/api/pandas.core.resample.Resampler.sem +generated/pandas.core.resample.Resampler.size,../reference/api/pandas.core.resample.Resampler.size +generated/pandas.core.resample.Resampler.std,../reference/api/pandas.core.resample.Resampler.std +generated/pandas.core.resample.Resampler.sum,../reference/api/pandas.core.resample.Resampler.sum +generated/pandas.core.resample.Resampler.transform,../reference/api/pandas.core.resample.Resampler.transform +generated/pandas.core.resample.Resampler.var,../reference/api/pandas.core.resample.Resampler.var +generated/pandas.core.window.EWM.corr,../reference/api/pandas.core.window.EWM.corr +generated/pandas.core.window.EWM.cov,../reference/api/pandas.core.window.EWM.cov +generated/pandas.core.window.EWM.mean,../reference/api/pandas.core.window.EWM.mean +generated/pandas.core.window.EWM.std,../reference/api/pandas.core.window.EWM.std +generated/pandas.core.window.EWM.var,../reference/api/pandas.core.window.EWM.var +generated/pandas.core.window.Expanding.aggregate,../reference/api/pandas.core.window.Expanding.aggregate +generated/pandas.core.window.Expanding.apply,../reference/api/pandas.core.window.Expanding.apply +generated/pandas.core.window.Expanding.corr,../reference/api/pandas.core.window.Expanding.corr +generated/pandas.core.window.Expanding.count,../reference/api/pandas.core.window.Expanding.count +generated/pandas.core.window.Expanding.cov,../reference/api/pandas.core.window.Expanding.cov +generated/pandas.core.window.Expanding.kurt,../reference/api/pandas.core.window.Expanding.kurt +generated/pandas.core.window.Expanding.max,../reference/api/pandas.core.window.Expanding.max +generated/pandas.core.window.Expanding.mean,../reference/api/pandas.core.window.Expanding.mean +generated/pandas.core.window.Expanding.median,../reference/api/pandas.core.window.Expanding.median +generated/pandas.core.window.Expanding.min,../reference/api/pandas.core.window.Expanding.min +generated/pandas.core.window.Expanding.quantile,../reference/api/pandas.core.window.Expanding.quantile +generated/pandas.core.window.Expanding.skew,../reference/api/pandas.core.window.Expanding.skew +generated/pandas.core.window.Expanding.std,../reference/api/pandas.core.window.Expanding.std +generated/pandas.core.window.Expanding.sum,../reference/api/pandas.core.window.Expanding.sum +generated/pandas.core.window.Expanding.var,../reference/api/pandas.core.window.Expanding.var +generated/pandas.core.window.Rolling.aggregate,../reference/api/pandas.core.window.Rolling.aggregate +generated/pandas.core.window.Rolling.apply,../reference/api/pandas.core.window.Rolling.apply +generated/pandas.core.window.Rolling.corr,../reference/api/pandas.core.window.Rolling.corr +generated/pandas.core.window.Rolling.count,../reference/api/pandas.core.window.Rolling.count +generated/pandas.core.window.Rolling.cov,../reference/api/pandas.core.window.Rolling.cov +generated/pandas.core.window.Rolling.kurt,../reference/api/pandas.core.window.Rolling.kurt +generated/pandas.core.window.Rolling.max,../reference/api/pandas.core.window.Rolling.max +generated/pandas.core.window.Rolling.mean,../reference/api/pandas.core.window.Rolling.mean +generated/pandas.core.window.Rolling.median,../reference/api/pandas.core.window.Rolling.median +generated/pandas.core.window.Rolling.min,../reference/api/pandas.core.window.Rolling.min +generated/pandas.core.window.Rolling.quantile,../reference/api/pandas.core.window.Rolling.quantile +generated/pandas.core.window.Rolling.skew,../reference/api/pandas.core.window.Rolling.skew +generated/pandas.core.window.Rolling.std,../reference/api/pandas.core.window.Rolling.std +generated/pandas.core.window.Rolling.sum,../reference/api/pandas.core.window.Rolling.sum +generated/pandas.core.window.Rolling.var,../reference/api/pandas.core.window.Rolling.var +generated/pandas.core.window.Window.mean,../reference/api/pandas.core.window.Window.mean +generated/pandas.core.window.Window.sum,../reference/api/pandas.core.window.Window.sum +generated/pandas.crosstab,../reference/api/pandas.crosstab +generated/pandas.cut,../reference/api/pandas.cut +generated/pandas.DataFrame.abs,../reference/api/pandas.DataFrame.abs +generated/pandas.DataFrame.add,../reference/api/pandas.DataFrame.add +generated/pandas.DataFrame.add_prefix,../reference/api/pandas.DataFrame.add_prefix +generated/pandas.DataFrame.add_suffix,../reference/api/pandas.DataFrame.add_suffix +generated/pandas.DataFrame.agg,../reference/api/pandas.DataFrame.agg +generated/pandas.DataFrame.aggregate,../reference/api/pandas.DataFrame.aggregate +generated/pandas.DataFrame.align,../reference/api/pandas.DataFrame.align +generated/pandas.DataFrame.all,../reference/api/pandas.DataFrame.all +generated/pandas.DataFrame.any,../reference/api/pandas.DataFrame.any +generated/pandas.DataFrame.append,../reference/api/pandas.DataFrame.append +generated/pandas.DataFrame.apply,../reference/api/pandas.DataFrame.apply +generated/pandas.DataFrame.applymap,../reference/api/pandas.DataFrame.applymap +generated/pandas.DataFrame.as_blocks,../reference/api/pandas.DataFrame.as_blocks +generated/pandas.DataFrame.asfreq,../reference/api/pandas.DataFrame.asfreq +generated/pandas.DataFrame.as_matrix,../reference/api/pandas.DataFrame.as_matrix +generated/pandas.DataFrame.asof,../reference/api/pandas.DataFrame.asof +generated/pandas.DataFrame.assign,../reference/api/pandas.DataFrame.assign +generated/pandas.DataFrame.astype,../reference/api/pandas.DataFrame.astype +generated/pandas.DataFrame.at,../reference/api/pandas.DataFrame.at +generated/pandas.DataFrame.at_time,../reference/api/pandas.DataFrame.at_time +generated/pandas.DataFrame.axes,../reference/api/pandas.DataFrame.axes +generated/pandas.DataFrame.between_time,../reference/api/pandas.DataFrame.between_time +generated/pandas.DataFrame.bfill,../reference/api/pandas.DataFrame.bfill +generated/pandas.DataFrame.blocks,../reference/api/pandas.DataFrame.blocks +generated/pandas.DataFrame.bool,../reference/api/pandas.DataFrame.bool +generated/pandas.DataFrame.boxplot,../reference/api/pandas.DataFrame.boxplot +generated/pandas.DataFrame.clip,../reference/api/pandas.DataFrame.clip +generated/pandas.DataFrame.clip_lower,../reference/api/pandas.DataFrame.clip_lower +generated/pandas.DataFrame.clip_upper,../reference/api/pandas.DataFrame.clip_upper +generated/pandas.DataFrame.columns,../reference/api/pandas.DataFrame.columns +generated/pandas.DataFrame.combine_first,../reference/api/pandas.DataFrame.combine_first +generated/pandas.DataFrame.combine,../reference/api/pandas.DataFrame.combine +generated/pandas.DataFrame.compound,../reference/api/pandas.DataFrame.compound +generated/pandas.DataFrame.convert_objects,../reference/api/pandas.DataFrame.convert_objects +generated/pandas.DataFrame.copy,../reference/api/pandas.DataFrame.copy +generated/pandas.DataFrame.corr,../reference/api/pandas.DataFrame.corr +generated/pandas.DataFrame.corrwith,../reference/api/pandas.DataFrame.corrwith +generated/pandas.DataFrame.count,../reference/api/pandas.DataFrame.count +generated/pandas.DataFrame.cov,../reference/api/pandas.DataFrame.cov +generated/pandas.DataFrame.cummax,../reference/api/pandas.DataFrame.cummax +generated/pandas.DataFrame.cummin,../reference/api/pandas.DataFrame.cummin +generated/pandas.DataFrame.cumprod,../reference/api/pandas.DataFrame.cumprod +generated/pandas.DataFrame.cumsum,../reference/api/pandas.DataFrame.cumsum +generated/pandas.DataFrame.describe,../reference/api/pandas.DataFrame.describe +generated/pandas.DataFrame.diff,../reference/api/pandas.DataFrame.diff +generated/pandas.DataFrame.div,../reference/api/pandas.DataFrame.div +generated/pandas.DataFrame.divide,../reference/api/pandas.DataFrame.divide +generated/pandas.DataFrame.dot,../reference/api/pandas.DataFrame.dot +generated/pandas.DataFrame.drop_duplicates,../reference/api/pandas.DataFrame.drop_duplicates +generated/pandas.DataFrame.drop,../reference/api/pandas.DataFrame.drop +generated/pandas.DataFrame.droplevel,../reference/api/pandas.DataFrame.droplevel +generated/pandas.DataFrame.dropna,../reference/api/pandas.DataFrame.dropna +generated/pandas.DataFrame.dtypes,../reference/api/pandas.DataFrame.dtypes +generated/pandas.DataFrame.duplicated,../reference/api/pandas.DataFrame.duplicated +generated/pandas.DataFrame.empty,../reference/api/pandas.DataFrame.empty +generated/pandas.DataFrame.eq,../reference/api/pandas.DataFrame.eq +generated/pandas.DataFrame.equals,../reference/api/pandas.DataFrame.equals +generated/pandas.DataFrame.eval,../reference/api/pandas.DataFrame.eval +generated/pandas.DataFrame.ewm,../reference/api/pandas.DataFrame.ewm +generated/pandas.DataFrame.expanding,../reference/api/pandas.DataFrame.expanding +generated/pandas.DataFrame.ffill,../reference/api/pandas.DataFrame.ffill +generated/pandas.DataFrame.fillna,../reference/api/pandas.DataFrame.fillna +generated/pandas.DataFrame.filter,../reference/api/pandas.DataFrame.filter +generated/pandas.DataFrame.first,../reference/api/pandas.DataFrame.first +generated/pandas.DataFrame.first_valid_index,../reference/api/pandas.DataFrame.first_valid_index +generated/pandas.DataFrame.floordiv,../reference/api/pandas.DataFrame.floordiv +generated/pandas.DataFrame.from_csv,../reference/api/pandas.DataFrame.from_csv +generated/pandas.DataFrame.from_dict,../reference/api/pandas.DataFrame.from_dict +generated/pandas.DataFrame.from_items,../reference/api/pandas.DataFrame.from_items +generated/pandas.DataFrame.from_records,../reference/api/pandas.DataFrame.from_records +generated/pandas.DataFrame.ftypes,../reference/api/pandas.DataFrame.ftypes +generated/pandas.DataFrame.ge,../reference/api/pandas.DataFrame.ge +generated/pandas.DataFrame.get_dtype_counts,../reference/api/pandas.DataFrame.get_dtype_counts +generated/pandas.DataFrame.get_ftype_counts,../reference/api/pandas.DataFrame.get_ftype_counts +generated/pandas.DataFrame.get,../reference/api/pandas.DataFrame.get +generated/pandas.DataFrame.get_value,../reference/api/pandas.DataFrame.get_value +generated/pandas.DataFrame.get_values,../reference/api/pandas.DataFrame.get_values +generated/pandas.DataFrame.groupby,../reference/api/pandas.DataFrame.groupby +generated/pandas.DataFrame.gt,../reference/api/pandas.DataFrame.gt +generated/pandas.DataFrame.head,../reference/api/pandas.DataFrame.head +generated/pandas.DataFrame.hist,../reference/api/pandas.DataFrame.hist +generated/pandas.DataFrame,../reference/api/pandas.DataFrame +generated/pandas.DataFrame.iat,../reference/api/pandas.DataFrame.iat +generated/pandas.DataFrame.idxmax,../reference/api/pandas.DataFrame.idxmax +generated/pandas.DataFrame.idxmin,../reference/api/pandas.DataFrame.idxmin +generated/pandas.DataFrame.iloc,../reference/api/pandas.DataFrame.iloc +generated/pandas.DataFrame.index,../reference/api/pandas.DataFrame.index +generated/pandas.DataFrame.infer_objects,../reference/api/pandas.DataFrame.infer_objects +generated/pandas.DataFrame.info,../reference/api/pandas.DataFrame.info +generated/pandas.DataFrame.insert,../reference/api/pandas.DataFrame.insert +generated/pandas.DataFrame.interpolate,../reference/api/pandas.DataFrame.interpolate +generated/pandas.DataFrame.is_copy,../reference/api/pandas.DataFrame.is_copy +generated/pandas.DataFrame.isin,../reference/api/pandas.DataFrame.isin +generated/pandas.DataFrame.isna,../reference/api/pandas.DataFrame.isna +generated/pandas.DataFrame.isnull,../reference/api/pandas.DataFrame.isnull +generated/pandas.DataFrame.items,../reference/api/pandas.DataFrame.items +generated/pandas.DataFrame.__iter__,../reference/api/pandas.DataFrame.__iter__ +generated/pandas.DataFrame.iteritems,../reference/api/pandas.DataFrame.iteritems +generated/pandas.DataFrame.iterrows,../reference/api/pandas.DataFrame.iterrows +generated/pandas.DataFrame.itertuples,../reference/api/pandas.DataFrame.itertuples +generated/pandas.DataFrame.ix,../reference/api/pandas.DataFrame.ix +generated/pandas.DataFrame.join,../reference/api/pandas.DataFrame.join +generated/pandas.DataFrame.keys,../reference/api/pandas.DataFrame.keys +generated/pandas.DataFrame.kurt,../reference/api/pandas.DataFrame.kurt +generated/pandas.DataFrame.kurtosis,../reference/api/pandas.DataFrame.kurtosis +generated/pandas.DataFrame.last,../reference/api/pandas.DataFrame.last +generated/pandas.DataFrame.last_valid_index,../reference/api/pandas.DataFrame.last_valid_index +generated/pandas.DataFrame.le,../reference/api/pandas.DataFrame.le +generated/pandas.DataFrame.loc,../reference/api/pandas.DataFrame.loc +generated/pandas.DataFrame.lookup,../reference/api/pandas.DataFrame.lookup +generated/pandas.DataFrame.lt,../reference/api/pandas.DataFrame.lt +generated/pandas.DataFrame.mad,../reference/api/pandas.DataFrame.mad +generated/pandas.DataFrame.mask,../reference/api/pandas.DataFrame.mask +generated/pandas.DataFrame.max,../reference/api/pandas.DataFrame.max +generated/pandas.DataFrame.mean,../reference/api/pandas.DataFrame.mean +generated/pandas.DataFrame.median,../reference/api/pandas.DataFrame.median +generated/pandas.DataFrame.melt,../reference/api/pandas.DataFrame.melt +generated/pandas.DataFrame.memory_usage,../reference/api/pandas.DataFrame.memory_usage +generated/pandas.DataFrame.merge,../reference/api/pandas.DataFrame.merge +generated/pandas.DataFrame.min,../reference/api/pandas.DataFrame.min +generated/pandas.DataFrame.mode,../reference/api/pandas.DataFrame.mode +generated/pandas.DataFrame.mod,../reference/api/pandas.DataFrame.mod +generated/pandas.DataFrame.mul,../reference/api/pandas.DataFrame.mul +generated/pandas.DataFrame.multiply,../reference/api/pandas.DataFrame.multiply +generated/pandas.DataFrame.ndim,../reference/api/pandas.DataFrame.ndim +generated/pandas.DataFrame.ne,../reference/api/pandas.DataFrame.ne +generated/pandas.DataFrame.nlargest,../reference/api/pandas.DataFrame.nlargest +generated/pandas.DataFrame.notna,../reference/api/pandas.DataFrame.notna +generated/pandas.DataFrame.notnull,../reference/api/pandas.DataFrame.notnull +generated/pandas.DataFrame.nsmallest,../reference/api/pandas.DataFrame.nsmallest +generated/pandas.DataFrame.nunique,../reference/api/pandas.DataFrame.nunique +generated/pandas.DataFrame.pct_change,../reference/api/pandas.DataFrame.pct_change +generated/pandas.DataFrame.pipe,../reference/api/pandas.DataFrame.pipe +generated/pandas.DataFrame.pivot,../reference/api/pandas.DataFrame.pivot +generated/pandas.DataFrame.pivot_table,../reference/api/pandas.DataFrame.pivot_table +generated/pandas.DataFrame.plot.barh,../reference/api/pandas.DataFrame.plot.barh +generated/pandas.DataFrame.plot.bar,../reference/api/pandas.DataFrame.plot.bar +generated/pandas.DataFrame.plot.box,../reference/api/pandas.DataFrame.plot.box +generated/pandas.DataFrame.plot.density,../reference/api/pandas.DataFrame.plot.density +generated/pandas.DataFrame.plot.hexbin,../reference/api/pandas.DataFrame.plot.hexbin +generated/pandas.DataFrame.plot.hist,../reference/api/pandas.DataFrame.plot.hist +generated/pandas.DataFrame.plot,../reference/api/pandas.DataFrame.plot +generated/pandas.DataFrame.plot.kde,../reference/api/pandas.DataFrame.plot.kde +generated/pandas.DataFrame.plot.line,../reference/api/pandas.DataFrame.plot.line +generated/pandas.DataFrame.plot.pie,../reference/api/pandas.DataFrame.plot.pie +generated/pandas.DataFrame.plot.scatter,../reference/api/pandas.DataFrame.plot.scatter +generated/pandas.DataFrame.pop,../reference/api/pandas.DataFrame.pop +generated/pandas.DataFrame.pow,../reference/api/pandas.DataFrame.pow +generated/pandas.DataFrame.prod,../reference/api/pandas.DataFrame.prod +generated/pandas.DataFrame.product,../reference/api/pandas.DataFrame.product +generated/pandas.DataFrame.quantile,../reference/api/pandas.DataFrame.quantile +generated/pandas.DataFrame.query,../reference/api/pandas.DataFrame.query +generated/pandas.DataFrame.radd,../reference/api/pandas.DataFrame.radd +generated/pandas.DataFrame.rank,../reference/api/pandas.DataFrame.rank +generated/pandas.DataFrame.rdiv,../reference/api/pandas.DataFrame.rdiv +generated/pandas.DataFrame.reindex_axis,../reference/api/pandas.DataFrame.reindex_axis +generated/pandas.DataFrame.reindex,../reference/api/pandas.DataFrame.reindex +generated/pandas.DataFrame.reindex_like,../reference/api/pandas.DataFrame.reindex_like +generated/pandas.DataFrame.rename_axis,../reference/api/pandas.DataFrame.rename_axis +generated/pandas.DataFrame.rename,../reference/api/pandas.DataFrame.rename +generated/pandas.DataFrame.reorder_levels,../reference/api/pandas.DataFrame.reorder_levels +generated/pandas.DataFrame.replace,../reference/api/pandas.DataFrame.replace +generated/pandas.DataFrame.resample,../reference/api/pandas.DataFrame.resample +generated/pandas.DataFrame.reset_index,../reference/api/pandas.DataFrame.reset_index +generated/pandas.DataFrame.rfloordiv,../reference/api/pandas.DataFrame.rfloordiv +generated/pandas.DataFrame.rmod,../reference/api/pandas.DataFrame.rmod +generated/pandas.DataFrame.rmul,../reference/api/pandas.DataFrame.rmul +generated/pandas.DataFrame.rolling,../reference/api/pandas.DataFrame.rolling +generated/pandas.DataFrame.round,../reference/api/pandas.DataFrame.round +generated/pandas.DataFrame.rpow,../reference/api/pandas.DataFrame.rpow +generated/pandas.DataFrame.rsub,../reference/api/pandas.DataFrame.rsub +generated/pandas.DataFrame.rtruediv,../reference/api/pandas.DataFrame.rtruediv +generated/pandas.DataFrame.sample,../reference/api/pandas.DataFrame.sample +generated/pandas.DataFrame.select_dtypes,../reference/api/pandas.DataFrame.select_dtypes +generated/pandas.DataFrame.select,../reference/api/pandas.DataFrame.select +generated/pandas.DataFrame.sem,../reference/api/pandas.DataFrame.sem +generated/pandas.DataFrame.set_axis,../reference/api/pandas.DataFrame.set_axis +generated/pandas.DataFrame.set_index,../reference/api/pandas.DataFrame.set_index +generated/pandas.DataFrame.set_value,../reference/api/pandas.DataFrame.set_value +generated/pandas.DataFrame.shape,../reference/api/pandas.DataFrame.shape +generated/pandas.DataFrame.shift,../reference/api/pandas.DataFrame.shift +generated/pandas.DataFrame.size,../reference/api/pandas.DataFrame.size +generated/pandas.DataFrame.skew,../reference/api/pandas.DataFrame.skew +generated/pandas.DataFrame.slice_shift,../reference/api/pandas.DataFrame.slice_shift +generated/pandas.DataFrame.sort_index,../reference/api/pandas.DataFrame.sort_index +generated/pandas.DataFrame.sort_values,../reference/api/pandas.DataFrame.sort_values +generated/pandas.DataFrame.squeeze,../reference/api/pandas.DataFrame.squeeze +generated/pandas.DataFrame.stack,../reference/api/pandas.DataFrame.stack +generated/pandas.DataFrame.std,../reference/api/pandas.DataFrame.std +generated/pandas.DataFrame.style,../reference/api/pandas.DataFrame.style +generated/pandas.DataFrame.sub,../reference/api/pandas.DataFrame.sub +generated/pandas.DataFrame.subtract,../reference/api/pandas.DataFrame.subtract +generated/pandas.DataFrame.sum,../reference/api/pandas.DataFrame.sum +generated/pandas.DataFrame.swapaxes,../reference/api/pandas.DataFrame.swapaxes +generated/pandas.DataFrame.swaplevel,../reference/api/pandas.DataFrame.swaplevel +generated/pandas.DataFrame.tail,../reference/api/pandas.DataFrame.tail +generated/pandas.DataFrame.take,../reference/api/pandas.DataFrame.take +generated/pandas.DataFrame.T,../reference/api/pandas.DataFrame.T +generated/pandas.DataFrame.timetuple,../reference/api/pandas.DataFrame.timetuple +generated/pandas.DataFrame.to_clipboard,../reference/api/pandas.DataFrame.to_clipboard +generated/pandas.DataFrame.to_csv,../reference/api/pandas.DataFrame.to_csv +generated/pandas.DataFrame.to_dense,../reference/api/pandas.DataFrame.to_dense +generated/pandas.DataFrame.to_dict,../reference/api/pandas.DataFrame.to_dict +generated/pandas.DataFrame.to_excel,../reference/api/pandas.DataFrame.to_excel +generated/pandas.DataFrame.to_feather,../reference/api/pandas.DataFrame.to_feather +generated/pandas.DataFrame.to_gbq,../reference/api/pandas.DataFrame.to_gbq +generated/pandas.DataFrame.to_hdf,../reference/api/pandas.DataFrame.to_hdf +generated/pandas.DataFrame.to,../reference/api/pandas.DataFrame.to +generated/pandas.DataFrame.to_json,../reference/api/pandas.DataFrame.to_json +generated/pandas.DataFrame.to_latex,../reference/api/pandas.DataFrame.to_latex +generated/pandas.DataFrame.to_msgpack,../reference/api/pandas.DataFrame.to_msgpack +generated/pandas.DataFrame.to_numpy,../reference/api/pandas.DataFrame.to_numpy +generated/pandas.DataFrame.to_panel,../reference/api/pandas.DataFrame.to_panel +generated/pandas.DataFrame.to_parquet,../reference/api/pandas.DataFrame.to_parquet +generated/pandas.DataFrame.to_period,../reference/api/pandas.DataFrame.to_period +generated/pandas.DataFrame.to_pickle,../reference/api/pandas.DataFrame.to_pickle +generated/pandas.DataFrame.to_records,../reference/api/pandas.DataFrame.to_records +generated/pandas.DataFrame.to_sparse,../reference/api/pandas.DataFrame.to_sparse +generated/pandas.DataFrame.to_sql,../reference/api/pandas.DataFrame.to_sql +generated/pandas.DataFrame.to_stata,../reference/api/pandas.DataFrame.to_stata import numpy as np - import pandas as pd - import os - np.random.seed(123456) - np.set_printoptions(precision=4, suppress=True) - import matplotlib - matplotlib.style.use('ggplot') - pd.options.display.max_rows = 15 - - #### portions of this were borrowed from the - #### Pandas cheatsheet - #### created during the PyData Workshop-Sprint 2012 - #### Hannah Chen, Henry Chow, Eric Cox, Robert Mauriello - - -******************** -10 Minutes to pandas -******************** - -This is a short introduction to pandas, geared mainly for new users. -You can see more complex recipes in the :ref:`Cookbook` - -Customarily, we import as follows: - -.. ipython:: python - - import pandas as pd - import numpy as np - import matplotlib.pyplot as plt - -Object Creation ---------------- - -See the :ref:`Data Structure Intro section ` - -Creating a :class:`Series` by passing a list of values, letting pandas create -a default integer index: - -.. ipython:: python - - s = pd.Series([1,3,5,np.nan,6,8]) - s - -Creating a :class:`DataFrame` by passing a numpy array, with a datetime index -and labeled columns: - -.. ipython:: python - - dates = pd.date_range('20130101', periods=6) - dates - df = pd.DataFrame(np.random.randn(6,4), index=dates, columns=list('ABCD')) - df - -Creating a ``DataFrame`` by passing a dict of objects that can be converted to series-like. - -.. ipython:: python - - df2 = pd.DataFrame({ 'A' : 1., - 'B' : pd.Timestamp('20130102'), - 'C' : pd.Series(1,index=list(range(4)),dtype='float32'), - 'D' : np.array([3] * 4,dtype='int32'), - 'E' : pd.Categorical(["test","train","test","train"]), - 'F' : 'foo' }) - df2 - -Having specific :ref:`dtypes ` - -.. ipython:: python - - df2.dtypes - -If you're using IPython, tab completion for column names (as well as public -attributes) is automatically enabled. As you can see, the columns ``A``, ``B``, ``C``, and ``D`` are automatically
tab completed. ``E`` is there as well; the rest of the attributes have been
truncated for brevity. It is by
default not included in computations. This
returns a copy of the data. Note that pattern-matching in `str` generally uses `regular
expressions `__ by default (and in
some cases always uses them). See the :ref:`Database style joining ` See the :ref:`Appending ` This is extremely common in, but not limited to,
financial applications. In the following example, we convert a quarterly
frequency with year ending in November to 9am of the end of the month following
the quarter end: For full docs, see the
:ref:`categorical introduction ` and the :ref:`API documentation `. Use a.empty, a.any() or a.all().
