Merge pull request scikit-learn#1 from scikit-learn/master

Merging changes from the main repository
hongshaoyang · Apr 11, 2020 · 3b79637 · 3b79637
2 parents 4346c82 + 0a93fc9
commit 3b79637
Show file tree

Hide file tree

Showing 119 changed files with 2,298 additions and 1,029 deletions.
diff --git a/Makefile b/Makefile
@@ -67,4 +67,4 @@ code-analysis:
 	pylint -E -i y sklearn/ -d E1103,E0611,E1101
 
 flake8-diff:
-	./build_tools/circle/linting.sh
+	git diff upstream/master -u -- "*.py" | flake8 --diff
diff --git a/README.rst b/README.rst
@@ -31,12 +31,12 @@ SciPy and is distributed under the 3-Clause BSD license.
 
 The project was started in 2007 by David Cournapeau as a Google Summer
 of Code project, and since then many volunteers have contributed. See
-the `About us <http://scikit-learn.org/dev/about.html#authors>`__ page
+the `About us <https://scikit-learn.org/dev/about.html#authors>`__ page
 for a list of core contributors.
 
 It is currently maintained by a team of volunteers.
 
-Website: http://scikit-learn.org
+Website: https://scikit-learn.org
 
 
 Installation
@@ -73,21 +73,21 @@ or ``conda``::
 
     conda install scikit-learn
 
-The documentation includes more detailed `installation instructions <http://scikit-learn.org/stable/install.html>`_.
+The documentation includes more detailed `installation instructions <https://scikit-learn.org/stable/install.html>`_.
 
 
 Changelog
 ---------
 
-See the `changelog <http://scikit-learn.org/dev/whats_new.html>`__
+See the `changelog <https://scikit-learn.org/dev/whats_new.html>`__
 for a history of notable changes to scikit-learn.
 
 Development
 -----------
 
 We welcome new contributors of all experience levels. The scikit-learn
 community goals are to be helpful, welcoming, and effective. The
-`Development Guide <http://scikit-learn.org/stable/developers/index.html>`_
+`Development Guide <https://scikit-learn.org/stable/developers/index.html>`_
 has detailed information about contributing code, documentation, tests, and
 more. We've included some basic information in this README.
 
@@ -120,7 +120,7 @@ source directory (you will need to have ``pytest`` >= 3.3.0 installed)::
 
     pytest sklearn
 
-See the web page http://scikit-learn.org/dev/developers/advanced_installation.html#testing
+See the web page https://scikit-learn.org/dev/developers/advanced_installation.html#testing
 for more information.
 
     Random number generation can be controlled during testing by setting
@@ -131,15 +131,15 @@ Submitting a Pull Request
 
 Before opening a Pull Request, have a look at the
 full Contributing page to make sure your code complies
-with our guidelines: http://scikit-learn.org/stable/developers/index.html
+with our guidelines: https://scikit-learn.org/stable/developers/index.html
 
 
 Project History
 ---------------
 
 The project was started in 2007 by David Cournapeau as a Google Summer
 of Code project, and since then many volunteers have contributed. See
-the `About us <http://scikit-learn.org/dev/about.html#authors>`__ page
+the `About us <https://scikit-learn.org/dev/about.html#authors>`__ page
 for a list of core contributors.
 
 The project is currently maintained by a team of volunteers.
@@ -153,19 +153,19 @@ Help and Support
 Documentation
 ~~~~~~~~~~~~~
 
-- HTML documentation (stable release): http://scikit-learn.org
-- HTML documentation (development version): http://scikit-learn.org/dev/
-- FAQ: http://scikit-learn.org/stable/faq.html
+- HTML documentation (stable release): https://scikit-learn.org
+- HTML documentation (development version): https://scikit-learn.org/dev/
+- FAQ: https://scikit-learn.org/stable/faq.html
 
 Communication
 ~~~~~~~~~~~~~
 
 - Mailing list: https://mail.python.org/mailman/listinfo/scikit-learn
 - IRC channel: ``#scikit-learn`` at ``webchat.freenode.net``
 - Stack Overflow: https://stackoverflow.com/questions/tagged/scikit-learn
-- Website: http://scikit-learn.org
+- Website: https://scikit-learn.org
 
 Citation
 ~~~~~~~~
 
-If you use scikit-learn in a scientific publication, we would appreciate citations: http://scikit-learn.org/stable/about.html#citing-scikit-learn
+If you use scikit-learn in a scientific publication, we would appreciate citations: https://scikit-learn.org/stable/about.html#citing-scikit-learn
diff --git a/azure-pipelines.yml b/azure-pipelines.yml
@@ -17,18 +17,31 @@ jobs:
       displayName: Add conda to PATH
     - bash: sudo chown -R $USER $CONDA
       displayName: Take ownership of conda installation
-    - bash: conda create --name flake8_env --yes flake8
+    - bash: |
+        conda create --name flake8_env --yes python=3.8
+        conda activate flake8_env
+        pip install flake8 mypy==0.770
       displayName: Install flake8
     - bash: |
         if [[ $BUILD_SOURCEVERSIONMESSAGE =~ \[lint\ skip\] ]]; then
           # skip linting
           echo "Skipping linting"
           exit 0
         else
-          source activate flake8_env
+          conda activate flake8_env
           ./build_tools/circle/linting.sh
         fi
       displayName: Run linting
+    - bash: |
+        if [[ $BUILD_SOURCEVERSIONMESSAGE =~ \[lint\ skip\] ]]; then
+          # skip linting
+          echo "Skipping linting"
+          exit 0
+        else
+          conda activate flake8_env
+          mypy sklearn/ --ignore-missing-imports
+        fi
+      displayName: Run mypy
     - bash: |
         if [[ $BUILD_SOURCEVERSIONMESSAGE =~ \[scipy-dev\] ]] || \
            [[ $BUILD_REASON == "Schedule" ]]; then

diff --git a/benchmarks/bench_hist_gradient_boosting_higgsboson.py b/benchmarks/bench_hist_gradient_boosting_higgsboson.py
@@ -25,12 +25,14 @@
 parser.add_argument('--learning-rate', type=float, default=1.)
 parser.add_argument('--subsample', type=int, default=None)
 parser.add_argument('--max-bins', type=int, default=255)
+parser.add_argument('--no-predict', action="store_true", default=False)
+parser.add_argument('--cache-loc', type=str, default='/tmp')
 args = parser.parse_args()
 
 HERE = os.path.dirname(__file__)
 URL = ("https://archive.ics.uci.edu/ml/machine-learning-databases/00280/"
        "HIGGS.csv.gz")
-m = Memory(location='/tmp', mmap_mode='r')
+m = Memory(location=args.cache_loc, mmap_mode='r')
 
 n_leaf_nodes = args.n_leaf_nodes
 n_trees = args.n_trees
@@ -56,6 +58,27 @@ def load_data():
     return df
 
 
+def fit(est, data_train, target_train, libname):
+    print(f"Fitting a {libname} model...")
+    tic = time()
+    est.fit(data_train, target_train)
+    toc = time()
+    print(f"fitted in {toc - tic:.3f}s")
+
+
+def predict(est, data_test, target_test):
+    if args.no_predict:
+        return
+    tic = time()
+    predicted_test = est.predict(data_test)
+    predicted_proba_test = est.predict_proba(data_test)
+    toc = time()
+    roc_auc = roc_auc_score(target_test, predicted_proba_test[:, 1])
+    acc = accuracy_score(target_test, predicted_test)
+    print(f"predicted in {toc - tic:.3f}s, "
+          f"ROC AUC: {roc_auc:.4f}, ACC: {acc :.4f}")
+
+
 df = load_data()
 target = df.values[:, 0]
 data = np.ascontiguousarray(df.values[:, 1:])
@@ -68,56 +91,28 @@ def load_data():
 n_samples, n_features = data_train.shape
 print(f"Training set with {n_samples} records with {n_features} features.")
 
-print("Fitting a sklearn model...")
-tic = time()
 est = HistGradientBoostingClassifier(loss='binary_crossentropy',
                                      learning_rate=lr,
                                      max_iter=n_trees,
                                      max_bins=max_bins,
                                      max_leaf_nodes=n_leaf_nodes,
-                                     n_iter_no_change=None,
+                                     early_stopping=False,
                                      random_state=0,
                                      verbose=1)
-est.fit(data_train, target_train)
-toc = time()
-predicted_test = est.predict(data_test)
-predicted_proba_test = est.predict_proba(data_test)
-roc_auc = roc_auc_score(target_test, predicted_proba_test[:, 1])
-acc = accuracy_score(target_test, predicted_test)
-print(f"done in {toc - tic:.3f}s, ROC AUC: {roc_auc:.4f}, ACC: {acc :.4f}")
+fit(est, data_train, target_train, 'sklearn')
+predict(est, data_test, target_test)
 
 if args.lightgbm:
-    print("Fitting a LightGBM model...")
-    tic = time()
-    lightgbm_est = get_equivalent_estimator(est, lib='lightgbm')
-    lightgbm_est.fit(data_train, target_train)
-    toc = time()
-    predicted_test = lightgbm_est.predict(data_test)
-    predicted_proba_test = lightgbm_est.predict_proba(data_test)
-    roc_auc = roc_auc_score(target_test, predicted_proba_test[:, 1])
-    acc = accuracy_score(target_test, predicted_test)
-    print(f"done in {toc - tic:.3f}s, ROC AUC: {roc_auc:.4f}, ACC: {acc :.4f}")
+    est = get_equivalent_estimator(est, lib='lightgbm')
+    fit(est, data_train, target_train, 'lightgbm')
+    predict(est, data_test, target_test)
 
 if args.xgboost:
-    print("Fitting an XGBoost model...")
-    tic = time()
-    xgboost_est = get_equivalent_estimator(est, lib='xgboost')
-    xgboost_est.fit(data_train, target_train)
-    toc = time()
-    predicted_test = xgboost_est.predict(data_test)
-    predicted_proba_test = xgboost_est.predict_proba(data_test)
-    roc_auc = roc_auc_score(target_test, predicted_proba_test[:, 1])
-    acc = accuracy_score(target_test, predicted_test)
-    print(f"done in {toc - tic:.3f}s, ROC AUC: {roc_auc:.4f}, ACC: {acc :.4f}")
+    est = get_equivalent_estimator(est, lib='xgboost')
+    fit(est, data_train, target_train, 'xgboost')
+    predict(est, data_test, target_test)
 
 if args.catboost:
-    print("Fitting a Catboost model...")
-    tic = time()
-    catboost_est = get_equivalent_estimator(est, lib='catboost')
-    catboost_est.fit(data_train, target_train)
-    toc = time()
-    predicted_test = catboost_est.predict(data_test)
-    predicted_proba_test = catboost_est.predict_proba(data_test)
-    roc_auc = roc_auc_score(target_test, predicted_proba_test[:, 1])
-    acc = accuracy_score(target_test, predicted_test)
-    print(f"done in {toc - tic:.3f}s, ROC AUC: {roc_auc:.4f}, ACC: {acc :.4f}")
+    est = get_equivalent_estimator(est, lib='catboost')
+    fit(est, data_train, target_train, 'catboost')
+    predict(est, data_test, target_test)
diff --git a/build_tools/azure/install.sh b/build_tools/azure/install.sh
@@ -97,6 +97,10 @@ elif [[ "$DISTRIB" == "conda-pip-latest" ]]; then
     make_conda "python=$PYTHON_VERSION"
     python -m pip install -U pip
     python -m pip install pytest==$PYTEST_VERSION pytest-cov pytest-xdist
+
+    # TODO: Remove pin when https://github.com/python-pillow/Pillow/issues/4518 gets fixed
+    python -m pip install "pillow>=4.3.0,!=7.1.0,!=7.1.1"
+
     python -m pip install pandas matplotlib pyamg scikit-image
     # do not install dependencies for lightgbm since it requires scikit-learn
     python -m pip install lightgbm --no-deps

diff --git a/conftest.py b/conftest.py
@@ -87,6 +87,11 @@ def pytest_collection_modifyitems(config, items):
 def pytest_configure(config):
     import sys
     sys._is_pytest_session = True
+    # declare our custom markers to avoid PytestUnknownMarkWarning
+    config.addinivalue_line(
+        "markers",
+        "network: mark a test for execution if network available."
+    )
 
 
 def pytest_unconfigure(config):

diff --git a/doc/about.rst b/doc/about.rst
@@ -13,7 +13,7 @@ this project as part of his thesis.
 In 2010 Fabian Pedregosa, Gael Varoquaux, Alexandre Gramfort and Vincent
 Michel of INRIA took leadership of the project and made the first public
 release, February the 1st 2010. Since then, several releases have appeared
-following a ~3 month cycle, and a thriving international community has
+following a ~ 3-month cycle, and a thriving international community has
 been leading the development.
 
 Governance
@@ -520,7 +520,7 @@ budget of the project [#f1]_.
 
 .. rubric:: Notes
 
-.. [#f1] Regarding the organization budget in particular, we might use some of
+.. [#f1] Regarding the organization budget, in particular, we might use some of
          the donated funds to pay for other project expenses such as DNS,
          hosting or continuous integration services.
 

diff --git a/doc/developers/contributing.rst b/doc/developers/contributing.rst
@@ -181,12 +181,12 @@ Contributing code
   If in doubt about duplicated work, or if you want to work on a non-trivial
   feature, it's recommended to first open an issue in
   the `issue tracker <https://github.com/scikit-learn/scikit-learn/issues>`_
-  to get some feedbacks from core developers. 
-  
-  One easy way to find an issue to work on is by applying the "help wanted" 
-  label in your search. This lists all the issues that have been unclaimed 
-  so far. In order to claim an issue for yourself, please comment exactly 
-  ``take`` on it for the CI to automatically assign the issue to you.  
+  to get some feedbacks from core developers.
+
+  One easy way to find an issue to work on is by applying the "help wanted"
+  label in your search. This lists all the issues that have been unclaimed
+  so far. In order to claim an issue for yourself, please comment exactly
+  ``take`` on it for the CI to automatically assign the issue to you.
 
 How to contribute
 -----------------
@@ -215,7 +215,7 @@ how to set up your git repository:
 
 4. Install the development dependencies::
 
-       $ pip install cython pytest pytest-cov flake8
+       $ pip install cython pytest pytest-cov flake8 mypy
 
 5. Install scikit-learn in editable mode::
 
@@ -224,6 +224,8 @@ how to set up your git repository:
    for more details about advanced installation, see the
    :ref:`install_bleeding_edge` section.
 
+.. _upstream:
+
 6. Add the ``upstream`` remote. This saves a reference to the main
    scikit-learn repository, which you can use to keep your repository
    synchronized with the latest changes::
@@ -356,13 +358,17 @@ complies with the following rules before marking a PR as ``[MRG]``. The
    non-regression tests should fail for the code base in the master branch
    and pass for the PR code.
 
-5. **Make sure that your PR does not add PEP8 violations**. On a Unix-like
-   system, you can run `make flake8-diff`. `flake8 path_to_file`, would work
-   for any system, but please avoid reformatting parts of the file that your
-   pull request doesn't change, as it distracts from code review.
+5. **Make sure that your PR does not add PEP8 violations**. To check the
+   code that you changed, you can run the following command (see
+   :ref:`above <upstream>` to set up the upstream remote)::
+
+        git diff upstream/master -u -- "*.py" | flake8 --diff
+
+   or `make flake8-diff` which should work on unix-like system.
 
 6. Follow the :ref:`coding-guidelines`.
 
+
 7. When applicable, use the validation tools and scripts in the
    ``sklearn.utils`` submodule.  A list of utility routines available
    for developers can be found in the :ref:`developers-utils` page.
@@ -408,6 +414,18 @@ You can check for common programming errors with the following tools:
 
   see also :ref:`testing_coverage`
 
+* A moderate use of type annotations is encouraged but is not mandatory. See
+  [mypy quickstart](https://mypy.readthedocs.io/en/latest/getting_started.html)
+  for an introduction, as well as [pandas contributing documentation](
+  https://pandas.pydata.org/pandas-docs/stable/development/contributing.html#type-hints)
+  for style guidelines. Whether you add type annotation or not::
+
+    mypy --ignore-missing-import sklearn
+
+  must not produce new errors in your pull request. Using `# type: ignore` annotation can be a workaround for a few cases that are not supported by mypy, in particular,
+   - when importing C or Cython modules
+   - on properties with decorators
+
 Bonus points for contributions that include a performance analysis with
 a benchmark script and profiling output (please report on the mailing
 list or on the GitHub issue).
@@ -662,7 +680,7 @@ In general have the following in mind:
     4. 1D or 2D data can be a subset of
        ``{array-like, ndarray, sparse matrix, dataframe}``. Note that ``array-like``
        can also be a ``list``, while ``ndarray`` is explicitly only a ``numpy.ndarray``.
-    5. When specifying the data type of a list, use ``of`` as a delimiter: 
+    5. When specifying the data type of a list, use ``of`` as a delimiter:
        ``list of int``.
     6. When specifying the dtype of an ndarray, use e.g. ``dtype=np.int32``
        after defining the shape:

diff --git a/doc/developers/maintainer.rst b/doc/developers/maintainer.rst
@@ -289,6 +289,14 @@ submodule/subpackage of the public subpackage, e.g.
 ``sklearn/impute/_iterative.py``. This is needed so that pickles still work
 in the future when the features aren't experimental anymore
 
+To avoid type checker (e.g. mypy) errors a direct import of experimenal
+estimators should be done in the parent module, protected by the
+``if typing.TYPE_CHECKING`` check. See `sklearn/ensemble/__init__.py
+<https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/ensemble/__init__.py>`_,
+or `sklearn/impute/__init__.py
+<https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/impute/__init__.py>`_
+for an example.
+
 Please also write basic tests following those in
 `test_enable_hist_gradient_boosting.py
 <https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/experimental/tests/test_enable_hist_gradient_boosting.py>`_.