Mention intel conda channel in installation doc #14247

ogrisel · 2019-07-03T14:57:11Z

Update the scikit-learn installation doc to mention the intel conda channel maintained by @oleksandr-pavlyk and his colleagues. This channel ships a scikit-learn version with the daal4py solvers.

https://intelpython.github.io/daal4py/sklearn.html

For information the daal4py solvers often bring significant speed ups on Xeon machines with many cores, in particular for K-Means and SVMs:

https://github.com/IntelPython/scikit-learn_bench

The speed difference for K-Means should be reduced once #11950 is merged. The speed difference for SVM might be reduced if we add multi-threading to the libsvm solvers potentially via openmp.

ogrisel · 2019-07-03T15:04:34Z

Side note: the Enthought Canopy distribution does not seem up to date: it only ships Python 3.5 and scikit-learn 0.20.2 as far as I can tell.

Maybe we should stop mentioning it in this installation documentation.

ogrisel · 2019-07-03T15:21:05Z

Also note: when the monkeypatch are applied, there is a notice on stderr that informs the user about the active DAAL solvers:

>>> import sklearn                                                                                                                                                          
Intel(R) DAAL solvers for sklearn enabled: https://intelpython.github.io/daal4py/sklearn.html

amueller · 2019-07-03T15:49:00Z

Does daal support the same arguments as scikit-learn now and provide the same results? Last time I checked there were major issues with compatibility.

I don't see the benefit of monkey-patching sklearn with alternative implementations instead of just providing them.

amueller · 2019-07-03T15:51:20Z

Last time I checked the benchmarks where totally incompatible comparisons and the daal team couldn't tell me what the algorithms were actually doing (whether k-means does random restarts, for example), so I'm a bit hesitant to trust them.

amueller · 2019-07-03T15:53:14Z

Can you explain what the numbers in the benchmark mean?
https://intelpython.github.io/scikit-learn_bench/

oleksandr-pavlyk · 2019-07-03T15:54:35Z

Yes, it does. Patched sklearn passes sklearn's test suite tests with the exception of few tests specified in deselected_tests.yaml in daal4py, with references to issues filed either in sklearn or in daal4py.

amueller · 2019-07-03T15:56:26Z

Patched sklearn passes sklearn's test suite tests

That doesn't mean that the results are the same

oleksandr-pavlyk · 2019-07-03T16:03:47Z

Numbers in https://intelpython.github.io/scikit-learn_bench/ show ratios of execution time of python code and execution time of native C++code that also uses DAAL.

The intent is to show native efficiency, i.e. how efficiently machine resources are being used, at least out of the box.

DAAL's performance derives from efficient parallelization, cache friendly blocking and effective use of vector instructions.

amueller · 2019-07-03T16:10:31Z

Numbers in https://intelpython.github.io/scikit-learn_bench/ show ratios of execution time of python code and execution time of native C++code that also uses DAAL.

Sorry I don't understand this sentence.

oleksandr-pavlyk · 2019-07-03T16:12:09Z

@amueller daal4py.sklearn.cluster.KMeans implements correct for n_init keyword, and defaults to the same value of 10.

While passing test suite tests does not imply the results are the same, every effort is made to ensure correctness of DAAL's implementation, and therefore DAAL's result is close to scikit-learn result, within boundaries implied by solver's tolerance.

Sometimes results can differ. For example, for k-means algorithm, when dealing with vacuous cluster after Lloyd's update, DAAL and sklearn may choose different points when there are several equidistant candidates. See intel/scikit-learn-intelex#69

oleksandr-pavlyk · 2019-07-03T16:17:00Z

Numbers in https://intelpython.github.io/scikit-learn_bench/ show ratios of execution time of python code and execution time of native C++code that also uses DAAL.

Sorry I don't understand this sentence.

I meant to say that we write python script that solves a ML problem using scikit-learn, then write C++ code that solves the same problem using DAAL (we check for agreement of these results, which are saved in NPY files). The ratio shows time_native / time_python for patched execution, and for execution in python stack installed from PyPI.

amueller · 2019-07-03T16:33:46Z

Ah so "1" here means the speed of your C++ implementation, and this is the speedup factor of using your python wrappers vs using standard python libraries.
That makes sense, but is a bit hard to understand.
Maybe you should change it to time_python / time_native, so then it would be how much slower the two implementations are than the native implementation?
I guess it depends a bit on whether you want to show that your wrappers are close to your C++ or that your wrappers are faster than standard python libraries.

SVM.predict has a number greater than 1, which would mean wrapping makes it faster, right? So that's within the measurement error? It would be good to have error bars maybe?

dpinte · 2019-07-03T20:12:34Z

@ogrisel

Side note: the Enthought Canopy distribution does not seem up to date: it only ships Python 3.5 and scikit-learn 0.20.2 as far as I can tell.

A quick note to clarify this. Canopy is not the distribution. The distribution is managed by edm (Enthought Deployment Manager), available here (doc). EDM can be used standalone or embedded in Canopy. EDM supports various flavours of Python (2.7, 3.5, 3.6). As far as versions of scikit-learn, the distribution is targeting quarterly updates for core packages like scikit-learn, rather than continuous updates of packages (knowing we always need a fully working dependency set). 0.21.x should be in soon.

GaelVaroquaux · 2019-07-03T20:55:29Z

doc/install.rst

+
+Intel maintains a dedicated conda channel that ships scikit-learn::
+
+    conda install -c intel scikit-learn


Does this mean that "conda install scikit-learn" installs a different codebase when pulling from the conda channel?

This would be a bit problematic from a point of view of communication with the community. Basically, it would mean that a package called "scikit-learn" is not the scikit-learn codebase.

The scikit-learn code in the intel channel is the same as the original appart from the __init__ of scikit-learn that is changed to trigger the monkeypatching from the daal4py package (with a user notification on stderr).

On PyPI, the package is called "intel-scikit-learn" (https://pypi.org/project/intel-scikit-learn/#description), I think that it would be clearer if the same was done for conda, so you would do conda install intel-scikit-learn

That would be a possible solution. Is there a way in conda to make sure that scikit-learn is not installed when intel-scikit-learn is installed? What about projects with a dependency on scikit-learn?

I agree that this is very confusing and will result in headaches on our side and probably the user side.

Indeed this behavior sounds extremely surprising and headache inducing. Changing a conda channel should not provide a version of the code that was never a scikit-learn release, in my opinion.

An update on this discussion: Intel has changed to disable monkey-patching by default. In other terms: when installing scikit-learn from the Intel channel, unmodified scikit-learn code is run by default and the user needs to take a specific action. Hence I think that:

We can safely point to the Intel channel

We should explain how it works briefly here

@oleksandr-pavlyk, is there a webpage that we can point to (for instance a README somewhere) to describe this process. It might be useful for users.

ogrisel · 2019-07-03T21:01:14Z

Thanks @dpinte for the update.

adrinjalali · 2019-07-05T13:55:53Z

As a user, I personally would really not like it if I'd start working on a server, or to deploy my code on a server and the solvers where different than what I expect. So from that perspective, I think we should make it very explicit and more like an opt-in for users to switch the backend solvers. PyDaal already makes it pretty easy with:

import daal4py.sklearn
daal4py.sklearn.patch_sklearn()

As a user, I would rather be sure that when I deploy, it's sklearn, and not intel's sklearn running my code, and yet have the possibility of monkeypatching daal4py into sklearn with the above two lines of code if possible.

From sklearn's side, I think it only makes sense to include the above two lines, or something like: You can make use of the intel's PyDaal backend, as explain in the docs: link and the code.

jorisvandenbossche · 2019-07-06T01:12:28Z

I agree with @adrinjalali that the explicit import of daal4py.sklearn and call to the patch methods is clearer, and seems a better recommendation to put in the sklearn docs than installing a patched scikit-learn itself

vene · 2019-07-06T20:14:02Z

I agree with @adrinjalali and @jorisvandenbossche ; there would ideally be only one advertised way to turn on these patches, and explicitly importing seems to be the clearest way for users.

oleksandr-pavlyk · 2019-07-19T16:23:54Z

Thank you for your feedback.

Our understanding was that Scikit-learn consortium recognized existence of vendor builds that include scikit-learn. Vendors package a Python stack, including dependencies of scikit-learn, which can influence its behavior (such as choice of libraries (MKL or BLAS or NAG), versions of libraries, compiler used, etc.).

The purpose of our vendor build is to enable users with good performance on Intel architecture. Ideally this would be done by contributing to the respective open source projects, but that takes time, so we patch hoping to deliver performance to end-users quicker, while working with communities to upstream changes, or open source packages like daal4py, mkl_fft, mkl-service, mkl_random, etc.

With daal4py, we are making every effort to maintain correctness in a transparent fashion, by developing, building and testing daal4py as an open source project, using the same CI systems that scikit-learn itself uses. C++ sources of DAAL itself are also publicly available at github.com/Intel/DAAL.

Patching code explicitly informs the user that it is being done by printing a message to the standard error stream:

In [2]: daal4py.sklearn.patch_sklearn()                                                                                                                                                              
Intel(R) DAAL solvers for sklearn enabled: https://intelpython.github.io/daal4py/sklearn.html

This informs the user of the patching even if he/she was not involved in decision of what version to install.

The change under review explicitly states that scikit-learn's algorithms are modified:

This version of scikit-learn comes with alternative solvers for
some common estimators. Those solvers come from the DAAL C++ library
and are optimized for multi-core CPUs.

Perhaps "version of scikit-learn" should rather be "binary build of scikit-learn".

Hence users are advised to exercise caution and run tests, both coming with the package as well as their own. This is advised with every numerical software update, regardless of it provenance, as well as when changing hardware (moving from a laptop to a server).

This is where support for automatic monkey-patching with -m daal4py is really handy. One easy test is to compute runs python your_application.py with python -m daal4py your_application.py.

In conclusion, I'd like to point out that serving patched scikit-learn on conda channels follows suit of NumPy project, where both the 'defaults' and the 'intel' conda channels serve significantly patched NumPy.

The reason for defaults channel serving patched NumPy has to do with default choice of Intel(R) MKL as an implementation BLAS/LAPACK library, so further use of MKL to optimize universal functions by calling MKL VML functions, or FFT by calling mkl_fft makes sense. NumPy community has a dedicated label 'Anaconda/Intel' for issues related to these binaries. The defaults channel also serves unpatched NumPy that uses OpenBLAS, which can be conveniently installed with conda install nomkl.

I think it is very important that default conda channel continues providing unpatched scikit-learn binary.

Intel currently modifies scikit-learn's sklearn/__init__.py to apply patches if daal4py is installed, unless the user asked to not do so:

--- a/sklearn/__init__.py                                                                                                                                                       
+++ b/sklearn/__init__.py                                                                                                                                                       
@@ -66,6 +66,16 @@ else:
                                                                                                                                                                                
     __check_build  # avoid flakes unused variable error                                                                                                                        
                                                                                                                                                                                
+    import os as _os                                                                                                                                                           
+    if _os.environ.get('USE_DAAL4PY_SKLEARN', True) in [True, '1', 'y', 'yes', 'Y', 'YES', 'Yes', 'true', 'True', 'TRUE']:                                                     
+        try:                                                                                                                                                                   
+            from daal4py.sklearn import patch_sklearn                                                                                                                          
+            patch_sklearn()                                                                                                                                                    
+            del patch_sklearn                                                                                                                                                  
+        except ImportError:                                                                                                                                                    
+            pass                                                                                                                                                               
+    del _os                                                                                                                                                                    
+

Perhaps within conda ecosystem there is room for two (or more) flavors of scikit-learn packages, patched one that depends on daal4py and unpatched one. With an understanding that distribution vendors can choose which one gets served by default.

Like it is the case with Numpy, conda packages may have different names, but they'd still provide identically named Python packages.

adrinjalali · 2019-07-22T13:13:57Z

@oleksandr-pavlyk thanks for the detailed review.

I understand how being able to monkey-patch the library and use your backend can be useful for users. What I do not understand from your discussion, is that why is it necessary to have/advertise a different sklearn package instead of telling people how to activate the monkey-patch with two lines of code, which seems to be the consensus and the preferred way among our contributors as you see in this thread.

To be clear, my point is not that you shouldn't have the monkey-patched version of the library in your channel, I'm just saying from our perspective, it seems it's better for us and our users to have them explicitly activate the monkey-patch with a two-liner.

oleksandr-pavlyk · 2019-07-22T14:47:45Z

The main conda channel serves binaries of scikit-learn compiled from unchanged community-released sources.

It takes a user's active decision to install scikit-learn from an alternative channel, such as Intel channel which serves packages for the Intel(R) Distribution for Python*.

The value proposition of the distribution is "improved out of the box performance with no code changes", which is why Python packages on Intel channel are patched.

GaelVaroquaux · 2019-07-22T15:12:17Z

The value proposition of the distribution is "improved out of the box performance with no code changes", which is why Python packages on Intel channel are patched.

Distributing patched code without changing the name is not transparent to the user. It is common practice for major softwares to forbid this. Distribution a browser called "firefox" calls for written agreement from the Mozilla foundation https://www.mozilla.org/en-US/foundation/trademarks/distribution-policy/ Google Chrome is a trademark owned by Google, which distributes its build, Chromium browsers are modified builds https://en.wikipedia.org/wiki/Chromium_(web_browser)#Licensing

adrinjalali · 2019-09-09T13:50:07Z

It looks like we have a consensus on how we want this to go forward. Do we still have open concerns/questions?

GaelVaroquaux · 2019-09-13T12:55:22Z

We are in discussion with Intel to see if something can be done on their side so that the difference between scikit-learn and the Intel module is more clear. Let's leave this open.

GaelVaroquaux · 2019-10-09T19:57:16Z

A related discussion for the larger context of conda-forge (posting here for reference
conda-forge/conda-forge.github.io#883
)

oleksandr-pavlyk · 2019-11-07T21:07:06Z

Please be advised that the most recent scikit-learn conda package on Intel channel no longer applies any patches by default.

When installing the conda package:

$ conda create -n t_sk_d4poff -c intel scikit-learn daal4py
<< elided >>
Preparing transaction: done
Verifying transaction: done
Executing transaction: - b"\n  INSTALLED PACKAGE OF SCIKIT-LEARN CAN BE ACCELERATED USING DAAL4PY. \n\n  PLEASE SET 'USE_DAAL4PY_SKLEARN' ENVIRONMENT VARIABLE TO 'YES' TO ENABLE THE ACCELERATION. \n\n  FOR EXAMPLE:\n\n      $ USE_DAAL4PY_SKLEARN=YES python\n\n\n"
done
#
# To activate this environment, use
#
#     $ conda activate t_sk_d4poff
#
# To deactivate an active environment, use
#
#     $ conda deactivate

Importing scikit-learn no longer loads daal4py by default:

(t_sk_d4poff) [15:05:32 vmlin test_tmp]$ python -c "import sklearn"
(t_sk_d4poff) [15:05:34 vmlin test_tmp]$ USE_DAAL4PY_SKLEARN=YES python -c "import sklearn"
Intel(R) Data Analytics Acceleration Library (Intel(R) DAAL) solvers for sklearn enabled: https://intelpython.github.io/daal4py/sklearn.html

NumPy has `_distributor_init.py`: https://github.com/numpy/numpy/blob/master/numpy/_distributor_init.py This PR adds one for scikit-learn. The intent is to create a place to add distributor specific logic, e.g. to implement behavior like in scikit-learn#14247 (comment)

ogrisel · 2019-11-22T15:20:12Z

Alright, I have updated this PR to be rebased on top of the current master.

As this distribution of scikit-learn does no longer enabled the daal solvers by default but requires the user to use an environment variable or an explicit code change I think the previously expressed concerns about maintenance have been addressed.

I decided not to put to many details on how to enabled the daal4py solvers but rather link to the daal4py documentation so that we do not need to change the scikit-learn documentation if the way to enable those changes in the future.

Please let me know what you think.

doc/install.rst

Co-Authored-By: Jérémie du Boisberranger <34657725+jeremiedbb@users.noreply.github.com>

adrinjalali

I'm happy with this now.

thomasjpfan

I am okay with the current state of this PR.

doc/install.rst

rth · 2019-12-02T13:11:30Z

Should we merge this? I don't think the current version of the PR is too controversial.

glemaitre · 2019-12-02T13:18:48Z

@rth @ogrisel wanted to use this PR to check that the twitter sklearn_commit is working properly

NicolasHug

LGTM

doc/install.rst

glemaitre · 2019-12-02T15:41:27Z

@adrinjalali Could you add cherry-pick this PR for 0.22

adrinjalali · 2019-12-02T15:54:46Z

will do

GaelVaroquaux reviewed Jul 3, 2019

View reviewed changes

amueller mentioned this pull request Jul 5, 2019

reporting format of benchmarks IntelPython/scikit-learn_bench#12

Open

oleksandr-pavlyk mentioned this pull request Nov 8, 2019

MAINT Add _distributor_init.py #15570

Merged

Mention intel conda channel in third-party distributions

215b37e

ogrisel force-pushed the intel-conda-channel branch from 196974e to 215b37e Compare November 22, 2019 15:16

jeremiedbb reviewed Nov 22, 2019

View reviewed changes

doc/install.rst Outdated Show resolved Hide resolved

doc/install.rst Outdated Show resolved Hide resolved

ogrisel and others added 2 commits November 22, 2019 16:45

Update doc/install.rst

65838fc

Co-Authored-By: Jérémie du Boisberranger <34657725+jeremiedbb@users.noreply.github.com>

More explicit paragraph on automated testing.

eddf91d

adrinjalali approved these changes Nov 25, 2019

View reviewed changes

thomasjpfan approved these changes Nov 26, 2019

View reviewed changes

cmarmo reviewed Nov 30, 2019

View reviewed changes

doc/install.rst Outdated Show resolved Hide resolved

Fix broken link

b5e3bde

NicolasHug approved these changes Dec 2, 2019

View reviewed changes

doc/install.rst Outdated Show resolved Hide resolved

oleksandr-pavlyk approved these changes Dec 2, 2019

View reviewed changes

Be more specific w.r.t. Intel CPUs

4106aa0

glemaitre merged commit e28e0db into scikit-learn:master Dec 2, 2019

ogrisel deleted the intel-conda-channel branch December 2, 2019 15:55

adrinjalali pushed a commit to adrinjalali/scikit-learn that referenced this pull request Dec 2, 2019

DOC Mention intel conda channel in installation doc (scikit-learn#14247)

ddd718b

adrinjalali pushed a commit that referenced this pull request Dec 2, 2019

DOC Mention intel conda channel in installation doc (#14247)

5f3c3f0

panpiort8 pushed a commit to panpiort8/scikit-learn that referenced this pull request Mar 3, 2020

DOC Mention intel conda channel in installation doc (scikit-learn#14247)

2b65bf0


		Intel maintains a dedicated conda channel that ships scikit-learn::

		conda install -c intel scikit-learn

Mention intel conda channel in installation doc #14247

Mention intel conda channel in installation doc #14247

Conversation

ogrisel commented Jul 3, 2019 • edited

ogrisel commented Jul 3, 2019

ogrisel commented Jul 3, 2019

amueller commented Jul 3, 2019

amueller commented Jul 3, 2019

amueller commented Jul 3, 2019

oleksandr-pavlyk commented Jul 3, 2019

amueller commented Jul 3, 2019

oleksandr-pavlyk commented Jul 3, 2019

amueller commented Jul 3, 2019

oleksandr-pavlyk commented Jul 3, 2019

oleksandr-pavlyk commented Jul 3, 2019

amueller commented Jul 3, 2019

dpinte commented Jul 3, 2019

GaelVaroquaux Jul 3, 2019

Choose a reason for hiding this comment

ogrisel Jul 3, 2019 • edited

Choose a reason for hiding this comment

jorisvandenbossche Jul 4, 2019

Choose a reason for hiding this comment

ogrisel Jul 5, 2019 • edited

Choose a reason for hiding this comment

amueller Jul 5, 2019

Choose a reason for hiding this comment

vene Jul 6, 2019

Choose a reason for hiding this comment

GaelVaroquaux Nov 20, 2019

Choose a reason for hiding this comment

ogrisel commented Jul 3, 2019

adrinjalali commented Jul 5, 2019

jorisvandenbossche commented Jul 6, 2019

vene commented Jul 6, 2019

oleksandr-pavlyk commented Jul 19, 2019

adrinjalali commented Jul 22, 2019

oleksandr-pavlyk commented Jul 22, 2019

GaelVaroquaux commented Jul 22, 2019 via email

adrinjalali commented Sep 9, 2019

GaelVaroquaux commented Sep 13, 2019

GaelVaroquaux commented Oct 9, 2019

oleksandr-pavlyk commented Nov 7, 2019 • edited

ogrisel commented Nov 22, 2019

adrinjalali left a comment

Choose a reason for hiding this comment

thomasjpfan left a comment

Choose a reason for hiding this comment

rth commented Dec 2, 2019

glemaitre commented Dec 2, 2019

NicolasHug left a comment

Choose a reason for hiding this comment

glemaitre commented Dec 2, 2019

adrinjalali commented Dec 2, 2019

ogrisel commented Jul 3, 2019 •

edited

ogrisel Jul 3, 2019 •

edited

ogrisel Jul 5, 2019 •

edited

oleksandr-pavlyk commented Nov 7, 2019 •

edited