Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Mention intel conda channel in installation doc #14247

Merged
merged 5 commits into from Dec 2, 2019

Conversation

ogrisel
Copy link
Member

@ogrisel ogrisel commented Jul 3, 2019

Update the scikit-learn installation doc to mention the intel conda channel maintained by @oleksandr-pavlyk and his colleagues. This channel ships a scikit-learn version with the daal4py solvers.

https://intelpython.github.io/daal4py/sklearn.html

For information the daal4py solvers often bring significant speed ups on Xeon machines with many cores, in particular for K-Means and SVMs:

https://github.com/IntelPython/scikit-learn_bench

The speed difference for K-Means should be reduced once #11950 is merged. The speed difference for SVM might be reduced if we add multi-threading to the libsvm solvers potentially via openmp.

@ogrisel
Copy link
Member Author

ogrisel commented Jul 3, 2019

Side note: the Enthought Canopy distribution does not seem up to date: it only ships Python 3.5 and scikit-learn 0.20.2 as far as I can tell.

Maybe we should stop mentioning it in this installation documentation.

@ogrisel
Copy link
Member Author

ogrisel commented Jul 3, 2019

Also note: when the monkeypatch are applied, there is a notice on stderr that informs the user about the active DAAL solvers:

>>> import sklearn                                                                                                                                                          
Intel(R) DAAL solvers for sklearn enabled: https://intelpython.github.io/daal4py/sklearn.html

@amueller
Copy link
Member

amueller commented Jul 3, 2019

Does daal support the same arguments as scikit-learn now and provide the same results? Last time I checked there were major issues with compatibility.

I don't see the benefit of monkey-patching sklearn with alternative implementations instead of just providing them.

@amueller
Copy link
Member

amueller commented Jul 3, 2019

Last time I checked the benchmarks where totally incompatible comparisons and the daal team couldn't tell me what the algorithms were actually doing (whether k-means does random restarts, for example), so I'm a bit hesitant to trust them.

@amueller
Copy link
Member

amueller commented Jul 3, 2019

Can you explain what the numbers in the benchmark mean?
https://intelpython.github.io/scikit-learn_bench/

@oleksandr-pavlyk
Copy link
Contributor

Yes, it does. Patched sklearn passes sklearn's test suite tests with the exception of few tests specified in deselected_tests.yaml in daal4py, with references to issues filed either in sklearn or in daal4py.

@amueller
Copy link
Member

amueller commented Jul 3, 2019

Patched sklearn passes sklearn's test suite tests

That doesn't mean that the results are the same

@oleksandr-pavlyk
Copy link
Contributor

Numbers in https://intelpython.github.io/scikit-learn_bench/ show ratios of execution time of python code and execution time of native C++code that also uses DAAL.

The intent is to show native efficiency, i.e. how efficiently machine resources are being used, at least out of the box.

DAAL's performance derives from efficient parallelization, cache friendly blocking and effective use of vector instructions.

@amueller
Copy link
Member

amueller commented Jul 3, 2019

Numbers in https://intelpython.github.io/scikit-learn_bench/ show ratios of execution time of python code and execution time of native C++code that also uses DAAL.

Sorry I don't understand this sentence.

@oleksandr-pavlyk
Copy link
Contributor

@amueller daal4py.sklearn.cluster.KMeans implements correct for n_init keyword, and defaults to the same value of 10.

While passing test suite tests does not imply the results are the same, every effort is made to ensure correctness of DAAL's implementation, and therefore DAAL's result is close to scikit-learn result, within boundaries implied by solver's tolerance.

Sometimes results can differ. For example, for k-means algorithm, when dealing with vacuous cluster after Lloyd's update, DAAL and sklearn may choose different points when there are several equidistant candidates. See intel/scikit-learn-intelex#69

@oleksandr-pavlyk
Copy link
Contributor

Numbers in https://intelpython.github.io/scikit-learn_bench/ show ratios of execution time of python code and execution time of native C++code that also uses DAAL.

Sorry I don't understand this sentence.

I meant to say that we write python script that solves a ML problem using scikit-learn, then write C++ code that solves the same problem using DAAL (we check for agreement of these results, which are saved in NPY files). The ratio shows time_native / time_python for patched execution, and for execution in python stack installed from PyPI.

@amueller
Copy link
Member

amueller commented Jul 3, 2019

Ah so "1" here means the speed of your C++ implementation, and this is the speedup factor of using your python wrappers vs using standard python libraries.
That makes sense, but is a bit hard to understand.
Maybe you should change it to time_python / time_native, so then it would be how much slower the two implementations are than the native implementation?
I guess it depends a bit on whether you want to show that your wrappers are close to your C++ or that your wrappers are faster than standard python libraries.

SVM.predict has a number greater than 1, which would mean wrapping makes it faster, right? So that's within the measurement error? It would be good to have error bars maybe?

@dpinte
Copy link

dpinte commented Jul 3, 2019

@ogrisel

Side note: the Enthought Canopy distribution does not seem up to date: it only ships Python 3.5 and scikit-learn 0.20.2 as far as I can tell.

A quick note to clarify this. Canopy is not the distribution. The distribution is managed by edm (Enthought Deployment Manager), available here (doc). EDM can be used standalone or embedded in Canopy. EDM supports various flavours of Python (2.7, 3.5, 3.6). As far as versions of scikit-learn, the distribution is targeting quarterly updates for core packages like scikit-learn, rather than continuous updates of packages (knowing we always need a fully working dependency set). 0.21.x should be in soon.

doc/install.rst Outdated

Intel maintains a dedicated conda channel that ships scikit-learn::

conda install -c intel scikit-learn
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this mean that "conda install scikit-learn" installs a different codebase when pulling from the conda channel?

This would be a bit problematic from a point of view of communication with the community. Basically, it would mean that a package called "scikit-learn" is not the scikit-learn codebase.

Copy link
Member Author

@ogrisel ogrisel Jul 3, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The scikit-learn code in the intel channel is the same as the original appart from the __init__ of scikit-learn that is changed to trigger the monkeypatching from the daal4py package (with a user notification on stderr).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On PyPI, the package is called "intel-scikit-learn" (https://pypi.org/project/intel-scikit-learn/#description), I think that it would be clearer if the same was done for conda, so you would do conda install intel-scikit-learn

Copy link
Member Author

@ogrisel ogrisel Jul 5, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That would be a possible solution. Is there a way in conda to make sure that scikit-learn is not installed when intel-scikit-learn is installed? What about projects with a dependency on scikit-learn?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree that this is very confusing and will result in headaches on our side and probably the user side.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed this behavior sounds extremely surprising and headache inducing. Changing a conda channel should not provide a version of the code that was never a scikit-learn release, in my opinion.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

An update on this discussion: Intel has changed to disable monkey-patching by default. In other terms: when installing scikit-learn from the Intel channel, unmodified scikit-learn code is run by default and the user needs to take a specific action. Hence I think that:

  • We can safely point to the Intel channel
  • We should explain how it works briefly here

@oleksandr-pavlyk, is there a webpage that we can point to (for instance a README somewhere) to describe this process. It might be useful for users.

@ogrisel
Copy link
Member Author

ogrisel commented Jul 3, 2019

Thanks @dpinte for the update.

@adrinjalali
Copy link
Member

As a user, I personally would really not like it if I'd start working on a server, or to deploy my code on a server and the solvers where different than what I expect. So from that perspective, I think we should make it very explicit and more like an opt-in for users to switch the backend solvers. PyDaal already makes it pretty easy with:

import daal4py.sklearn
daal4py.sklearn.patch_sklearn()

As a user, I would rather be sure that when I deploy, it's sklearn, and not intel's sklearn running my code, and yet have the possibility of monkeypatching daal4py into sklearn with the above two lines of code if possible.

From sklearn's side, I think it only makes sense to include the above two lines, or something like: You can make use of the intel's PyDaal backend, as explain in the docs: link and the code.

@jorisvandenbossche
Copy link
Member

I agree with @adrinjalali that the explicit import of daal4py.sklearn and call to the patch methods is clearer, and seems a better recommendation to put in the sklearn docs than installing a patched scikit-learn itself

@vene
Copy link
Member

vene commented Jul 6, 2019

I agree with @adrinjalali and @jorisvandenbossche ; there would ideally be only one advertised way to turn on these patches, and explicitly importing seems to be the clearest way for users.

@oleksandr-pavlyk
Copy link
Contributor

Thank you for your feedback.

Our understanding was that Scikit-learn consortium recognized existence of vendor builds that include scikit-learn. Vendors package a Python stack, including dependencies of scikit-learn, which can influence its behavior (such as choice of libraries (MKL or BLAS or NAG), versions of libraries, compiler used, etc.).

The purpose of our vendor build is to enable users with good performance on Intel architecture. Ideally this would be done by contributing to the respective open source projects, but that takes time, so we patch hoping to deliver performance to end-users quicker, while working with communities to upstream changes, or open source packages like daal4py, mkl_fft, mkl-service, mkl_random, etc.

With daal4py, we are making every effort to maintain correctness in a transparent fashion, by developing, building and testing daal4py as an open source project, using the same CI systems that scikit-learn itself uses. C++ sources of DAAL itself are also publicly available at github.com/Intel/DAAL.

Patching code explicitly informs the user that it is being done by printing a message to the standard error stream:

In [2]: daal4py.sklearn.patch_sklearn()                                                                                                                                                              
Intel(R) DAAL solvers for sklearn enabled: https://intelpython.github.io/daal4py/sklearn.html

This informs the user of the patching even if he/she was not involved in decision of what version to install.

The change under review explicitly states that scikit-learn's algorithms are modified:

This version of scikit-learn comes with alternative solvers for
some common estimators. Those solvers come from the DAAL C++ library
and are optimized for multi-core CPUs.

Perhaps "version of scikit-learn" should rather be "binary build of scikit-learn".

Hence users are advised to exercise caution and run tests, both coming with the package as well as their own. This is advised with every numerical software update, regardless of it provenance, as well as when changing hardware (moving from a laptop to a server).

This is where support for automatic monkey-patching with -m daal4py is really handy. One easy test is to compute runs python your_application.py with python -m daal4py your_application.py.

In conclusion, I'd like to point out that serving patched scikit-learn on conda channels follows suit of NumPy project, where both the 'defaults' and the 'intel' conda channels serve significantly patched NumPy.

The reason for defaults channel serving patched NumPy has to do with default choice of Intel(R) MKL as an implementation BLAS/LAPACK library, so further use of MKL to optimize universal functions by calling MKL VML functions, or FFT by calling mkl_fft makes sense. NumPy community has a dedicated label 'Anaconda/Intel' for issues related to these binaries. The defaults channel also serves unpatched NumPy that uses OpenBLAS, which can be conveniently installed with conda install nomkl.

I think it is very important that default conda channel continues providing unpatched scikit-learn binary.

Intel currently modifies scikit-learn's sklearn/__init__.py to apply patches if daal4py is installed, unless the user asked to not do so:

--- a/sklearn/__init__.py                                                                                                                                                       
+++ b/sklearn/__init__.py                                                                                                                                                       
@@ -66,6 +66,16 @@ else:
                                                                                                                                                                                
     __check_build  # avoid flakes unused variable error                                                                                                                        
                                                                                                                                                                                
+    import os as _os                                                                                                                                                           
+    if _os.environ.get('USE_DAAL4PY_SKLEARN', True) in [True, '1', 'y', 'yes', 'Y', 'YES', 'Yes', 'true', 'True', 'TRUE']:                                                     
+        try:                                                                                                                                                                   
+            from daal4py.sklearn import patch_sklearn                                                                                                                          
+            patch_sklearn()                                                                                                                                                    
+            del patch_sklearn                                                                                                                                                  
+        except ImportError:                                                                                                                                                    
+            pass                                                                                                                                                               
+    del _os                                                                                                                                                                    
+                                                                                    

Perhaps within conda ecosystem there is room for two (or more) flavors of scikit-learn packages, patched one that depends on daal4py and unpatched one. With an understanding that distribution vendors can choose which one gets served by default.

Like it is the case with Numpy, conda packages may have different names, but they'd still provide identically named Python packages.

@adrinjalali
Copy link
Member

@oleksandr-pavlyk thanks for the detailed review.

I understand how being able to monkey-patch the library and use your backend can be useful for users. What I do not understand from your discussion, is that why is it necessary to have/advertise a different sklearn package instead of telling people how to activate the monkey-patch with two lines of code, which seems to be the consensus and the preferred way among our contributors as you see in this thread.

To be clear, my point is not that you shouldn't have the monkey-patched version of the library in your channel, I'm just saying from our perspective, it seems it's better for us and our users to have them explicitly activate the monkey-patch with a two-liner.

@oleksandr-pavlyk
Copy link
Contributor

The main conda channel serves binaries of scikit-learn compiled from unchanged community-released sources.

It takes a user's active decision to install scikit-learn from an alternative channel, such as Intel channel which serves packages for the Intel(R) Distribution for Python*.

The value proposition of the distribution is "improved out of the box performance with no code changes", which is why Python packages on Intel channel are patched.

@GaelVaroquaux
Copy link
Member

GaelVaroquaux commented Jul 22, 2019 via email

@adrinjalali
Copy link
Member

It looks like we have a consensus on how we want this to go forward. Do we still have open concerns/questions?

@GaelVaroquaux
Copy link
Member

We are in discussion with Intel to see if something can be done on their side so that the difference between scikit-learn and the Intel module is more clear. Let's leave this open.

@GaelVaroquaux
Copy link
Member

A related discussion for the larger context of conda-forge (posting here for reference
conda-forge/conda-forge.github.io#883
)

@oleksandr-pavlyk
Copy link
Contributor

oleksandr-pavlyk commented Nov 7, 2019

Please be advised that the most recent scikit-learn conda package on Intel channel no longer applies any patches by default.

When installing the conda package:

$ conda create -n t_sk_d4poff -c intel scikit-learn daal4py
<< elided >>
Preparing transaction: done
Verifying transaction: done
Executing transaction: - b"\n  INSTALLED PACKAGE OF SCIKIT-LEARN CAN BE ACCELERATED USING DAAL4PY. \n\n  PLEASE SET 'USE_DAAL4PY_SKLEARN' ENVIRONMENT VARIABLE TO 'YES' TO ENABLE THE ACCELERATION. \n\n  FOR EXAMPLE:\n\n      $ USE_DAAL4PY_SKLEARN=YES python\n\n\n"
done
#
# To activate this environment, use
#
#     $ conda activate t_sk_d4poff
#
# To deactivate an active environment, use
#
#     $ conda deactivate

Importing scikit-learn no longer loads daal4py by default:

(t_sk_d4poff) [15:05:32 vmlin test_tmp]$ python -c "import sklearn"
(t_sk_d4poff) [15:05:34 vmlin test_tmp]$ USE_DAAL4PY_SKLEARN=YES python -c "import sklearn"
Intel(R) Data Analytics Acceleration Library (Intel(R) DAAL) solvers for sklearn enabled: https://intelpython.github.io/daal4py/sklearn.html

oleksandr-pavlyk added a commit to oleksandr-pavlyk/scikit-learn that referenced this pull request Nov 8, 2019
NumPy has `_distributor_init.py`:

https://github.com/numpy/numpy/blob/master/numpy/_distributor_init.py

This PR adds one for scikit-learn. The intent is to create a place
to add distributor specific logic, e.g. to implement behavior like in

scikit-learn#14247 (comment)
@ogrisel
Copy link
Member Author

ogrisel commented Nov 22, 2019

Alright, I have updated this PR to be rebased on top of the current master.

As this distribution of scikit-learn does no longer enabled the daal solvers by default but requires the user to use an environment variable or an explicit code change I think the previously expressed concerns about maintenance have been addressed.

I decided not to put to many details on how to enabled the daal4py solvers but rather link to the daal4py documentation so that we do not need to change the scikit-learn documentation if the way to enable those changes in the future.

Please let me know what you think.

doc/install.rst Outdated Show resolved Hide resolved
doc/install.rst Outdated Show resolved Hide resolved
ogrisel and others added 2 commits November 22, 2019 16:45
Co-Authored-By: Jérémie du Boisberranger <34657725+jeremiedbb@users.noreply.github.com>
Copy link
Member

@adrinjalali adrinjalali left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm happy with this now.

Copy link
Member

@thomasjpfan thomasjpfan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am okay with the current state of this PR.

doc/install.rst Outdated Show resolved Hide resolved
@rth
Copy link
Member

rth commented Dec 2, 2019

Should we merge this? I don't think the current version of the PR is too controversial.

@glemaitre
Copy link
Member

@rth @ogrisel wanted to use this PR to check that the twitter sklearn_commit is working properly

Copy link
Member

@NicolasHug NicolasHug left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

doc/install.rst Outdated Show resolved Hide resolved
@glemaitre glemaitre merged commit e28e0db into scikit-learn:master Dec 2, 2019
@glemaitre
Copy link
Member

@adrinjalali Could you add cherry-pick this PR for 0.22

@adrinjalali
Copy link
Member

will do

@ogrisel ogrisel deleted the intel-conda-channel branch December 2, 2019 15:55
adrinjalali pushed a commit to adrinjalali/scikit-learn that referenced this pull request Dec 2, 2019
panpiort8 pushed a commit to panpiort8/scikit-learn that referenced this pull request Mar 3, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet