New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Mention intel conda channel in installation doc #14247
Conversation
Side note: the Enthought Canopy distribution does not seem up to date: it only ships Python 3.5 and scikit-learn 0.20.2 as far as I can tell. Maybe we should stop mentioning it in this installation documentation. |
Also note: when the monkeypatch are applied, there is a notice on stderr that informs the user about the active DAAL solvers:
|
Does daal support the same arguments as scikit-learn now and provide the same results? Last time I checked there were major issues with compatibility. I don't see the benefit of monkey-patching sklearn with alternative implementations instead of just providing them. |
Last time I checked the benchmarks where totally incompatible comparisons and the daal team couldn't tell me what the algorithms were actually doing (whether k-means does random restarts, for example), so I'm a bit hesitant to trust them. |
Can you explain what the numbers in the benchmark mean? |
Yes, it does. Patched sklearn passes sklearn's test suite tests with the exception of few tests specified in |
That doesn't mean that the results are the same |
Numbers in https://intelpython.github.io/scikit-learn_bench/ show ratios of execution time of python code and execution time of native C++code that also uses DAAL. The intent is to show native efficiency, i.e. how efficiently machine resources are being used, at least out of the box. DAAL's performance derives from efficient parallelization, cache friendly blocking and effective use of vector instructions. |
Sorry I don't understand this sentence. |
@amueller While passing test suite tests does not imply the results are the same, every effort is made to ensure correctness of DAAL's implementation, and therefore DAAL's result is close to scikit-learn result, within boundaries implied by solver's tolerance. Sometimes results can differ. For example, for k-means algorithm, when dealing with vacuous cluster after Lloyd's update, DAAL and sklearn may choose different points when there are several equidistant candidates. See intel/scikit-learn-intelex#69 |
I meant to say that we write python script that solves a ML problem using scikit-learn, then write C++ code that solves the same problem using DAAL (we check for agreement of these results, which are saved in NPY files). The ratio shows |
Ah so "1" here means the speed of your C++ implementation, and this is the speedup factor of using your python wrappers vs using standard python libraries. SVM.predict has a number greater than 1, which would mean wrapping makes it faster, right? So that's within the measurement error? It would be good to have error bars maybe? |
A quick note to clarify this. Canopy is not the distribution. The distribution is managed by edm (Enthought Deployment Manager), available here (doc). EDM can be used standalone or embedded in Canopy. EDM supports various flavours of Python (2.7, 3.5, 3.6). As far as versions of scikit-learn, the distribution is targeting quarterly updates for core packages like scikit-learn, rather than continuous updates of packages (knowing we always need a fully working dependency set). |
doc/install.rst
Outdated
|
||
Intel maintains a dedicated conda channel that ships scikit-learn:: | ||
|
||
conda install -c intel scikit-learn |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this mean that "conda install scikit-learn" installs a different codebase when pulling from the conda channel?
This would be a bit problematic from a point of view of communication with the community. Basically, it would mean that a package called "scikit-learn" is not the scikit-learn codebase.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The scikit-learn code in the intel channel is the same as the original appart from the __init__
of scikit-learn that is changed to trigger the monkeypatching from the daal4py package (with a user notification on stderr).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
On PyPI, the package is called "intel-scikit-learn" (https://pypi.org/project/intel-scikit-learn/#description), I think that it would be clearer if the same was done for conda, so you would do conda install intel-scikit-learn
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That would be a possible solution. Is there a way in conda to make sure that scikit-learn is not installed when intel-scikit-learn is installed? What about projects with a dependency on scikit-learn?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree that this is very confusing and will result in headaches on our side and probably the user side.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Indeed this behavior sounds extremely surprising and headache inducing. Changing a conda channel should not provide a version of the code that was never a scikit-learn release, in my opinion.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
An update on this discussion: Intel has changed to disable monkey-patching by default. In other terms: when installing scikit-learn from the Intel channel, unmodified scikit-learn code is run by default and the user needs to take a specific action. Hence I think that:
- We can safely point to the Intel channel
- We should explain how it works briefly here
@oleksandr-pavlyk, is there a webpage that we can point to (for instance a README somewhere) to describe this process. It might be useful for users.
Thanks @dpinte for the update. |
As a user, I personally would really not like it if I'd start working on a server, or to deploy my code on a server and the solvers where different than what I expect. So from that perspective, I think we should make it very explicit and more like an opt-in for users to switch the backend solvers. PyDaal already makes it pretty easy with: import daal4py.sklearn
daal4py.sklearn.patch_sklearn() As a user, I would rather be sure that when I deploy, it's sklearn, and not intel's sklearn running my code, and yet have the possibility of monkeypatching daal4py into sklearn with the above two lines of code if possible. From sklearn's side, I think it only makes sense to include the above two lines, or something like: You can make use of the intel's PyDaal backend, as explain in the docs: link and the code. |
I agree with @adrinjalali that the explicit import of |
I agree with @adrinjalali and @jorisvandenbossche ; there would ideally be only one advertised way to turn on these patches, and explicitly importing seems to be the clearest way for users. |
Thank you for your feedback. Our understanding was that Scikit-learn consortium recognized existence of vendor builds that include scikit-learn. Vendors package a Python stack, including dependencies of scikit-learn, which can influence its behavior (such as choice of libraries (MKL or BLAS or NAG), versions of libraries, compiler used, etc.). The purpose of our vendor build is to enable users with good performance on Intel architecture. Ideally this would be done by contributing to the respective open source projects, but that takes time, so we patch hoping to deliver performance to end-users quicker, while working with communities to upstream changes, or open source packages like daal4py, mkl_fft, mkl-service, mkl_random, etc. With daal4py, we are making every effort to maintain correctness in a transparent fashion, by developing, building and testing daal4py as an open source project, using the same CI systems that scikit-learn itself uses. C++ sources of DAAL itself are also publicly available at github.com/Intel/DAAL. Patching code explicitly informs the user that it is being done by printing a message to the standard error stream:
This informs the user of the patching even if he/she was not involved in decision of what version to install. The change under review explicitly states that scikit-learn's algorithms are modified:
Perhaps "version of scikit-learn" should rather be "binary build of scikit-learn". Hence users are advised to exercise caution and run tests, both coming with the package as well as their own. This is advised with every numerical software update, regardless of it provenance, as well as when changing hardware (moving from a laptop to a server). This is where support for automatic monkey-patching with In conclusion, I'd like to point out that serving patched scikit-learn on conda channels follows suit of NumPy project, where both the 'defaults' and the 'intel' conda channels serve significantly patched NumPy. The reason for defaults channel serving patched NumPy has to do with default choice of Intel(R) MKL as an implementation BLAS/LAPACK library, so further use of MKL to optimize universal functions by calling MKL VML functions, or FFT by calling I think it is very important that default conda channel continues providing unpatched scikit-learn binary. Intel currently modifies scikit-learn's
Perhaps within conda ecosystem there is room for two (or more) flavors of scikit-learn packages, patched one that depends on Like it is the case with Numpy, conda packages may have different names, but they'd still provide identically named Python packages. |
@oleksandr-pavlyk thanks for the detailed review. I understand how being able to monkey-patch the library and use your backend can be useful for users. What I do not understand from your discussion, is that why is it necessary to have/advertise a different sklearn package instead of telling people how to activate the monkey-patch with two lines of code, which seems to be the consensus and the preferred way among our contributors as you see in this thread. To be clear, my point is not that you shouldn't have the monkey-patched version of the library in your channel, I'm just saying from our perspective, it seems it's better for us and our users to have them explicitly activate the monkey-patch with a two-liner. |
The main conda channel serves binaries of scikit-learn compiled from unchanged community-released sources. It takes a user's active decision to install scikit-learn from an alternative channel, such as Intel channel which serves packages for the Intel(R) Distribution for Python*. The value proposition of the distribution is "improved out of the box performance with no code changes", which is why Python packages on Intel channel are patched. |
The value proposition of the distribution is "improved out of the box performance with no code changes", which is why Python packages on Intel channel are patched.
Distributing patched code without changing the name is not transparent to the user. It is common practice for major softwares to forbid this.
Distribution a browser called "firefox" calls for written agreement from the Mozilla foundation
https://www.mozilla.org/en-US/foundation/trademarks/distribution-policy/
Google Chrome is a trademark owned by Google, which distributes its
build, Chromium browsers are modified builds
https://en.wikipedia.org/wiki/Chromium_(web_browser)#Licensing
|
It looks like we have a consensus on how we want this to go forward. Do we still have open concerns/questions? |
We are in discussion with Intel to see if something can be done on their side so that the difference between scikit-learn and the Intel module is more clear. Let's leave this open. |
A related discussion for the larger context of conda-forge (posting here for reference |
Please be advised that the most recent scikit-learn conda package on Intel channel no longer applies any patches by default. When installing the conda package:
Importing scikit-learn no longer loads daal4py by default:
|
NumPy has `_distributor_init.py`: https://github.com/numpy/numpy/blob/master/numpy/_distributor_init.py This PR adds one for scikit-learn. The intent is to create a place to add distributor specific logic, e.g. to implement behavior like in scikit-learn#14247 (comment)
196974e
to
215b37e
Compare
Alright, I have updated this PR to be rebased on top of the current master. As this distribution of scikit-learn does no longer enabled the daal solvers by default but requires the user to use an environment variable or an explicit code change I think the previously expressed concerns about maintenance have been addressed. I decided not to put to many details on how to enabled the daal4py solvers but rather link to the daal4py documentation so that we do not need to change the scikit-learn documentation if the way to enable those changes in the future. Please let me know what you think. |
Co-Authored-By: Jérémie du Boisberranger <34657725+jeremiedbb@users.noreply.github.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm happy with this now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am okay with the current state of this PR.
Should we merge this? I don't think the current version of the PR is too controversial. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
@adrinjalali Could you add cherry-pick this PR for 0.22 |
will do |
Update the scikit-learn installation doc to mention the intel conda channel maintained by @oleksandr-pavlyk and his colleagues. This channel ships a scikit-learn version with the daal4py solvers.
https://intelpython.github.io/daal4py/sklearn.html
For information the daal4py solvers often bring significant speed ups on Xeon machines with many cores, in particular for K-Means and SVMs:
https://github.com/IntelPython/scikit-learn_bench
The speed difference for K-Means should be reduced once #11950 is merged. The speed difference for SVM might be reduced if we add multi-threading to the libsvm solvers potentially via openmp.