[MRG+1] Releasing the GIL in the inner loop of coordinate descent #3102

MechCoder · 2014-04-23T19:39:42Z

release the GIL in the dual gap check (need to replace numpy calls by BLAS equivalents)
release the GIL in the sparse input variant
release the GIL for the precomputed Gram matrix variant
release the GIL for multi-task variant.
benchmark the scaling for instance by hacking the cross_val_score function to make it possible to use the threading backend instead of multiprocessing

MechCoder · 2014-04-23T19:44:25Z

@ogrisel I tried picking your commit, but there are nasty merge conflicts in the cd_fast.c file.

MechCoder · 2014-04-23T19:51:39Z

Ah well. Fixed them. Sorry for the noise

larsmans · 2014-04-24T15:21:03Z

Merge conflicts in generated C files are a matter of calling Cython again (but I guess you figured that out).

larsmans · 2014-04-24T15:21:31Z

Can we close Olivier's old PR? Which one was it?

MechCoder · 2014-04-24T20:25:06Z

@larsmans #2980

MechCoder · 2014-04-25T14:08:43Z

sklearn/linear_model/cd_fast.pyx

-                # return if we reached desired tolerance
-                break
+    with nogil:
+        for n_iter in range(max_iter):


@larsmans A noob doubt. As far as I read, you can use nogil when there are no Python objects. Then how come this works?

The Cython compiler expands for i in range(n), and a few variant constructs, to a C for loop when i and n are properly typed. Check the generated C code to see this at work (it has the input Cython code in comments), or the HTML output from cython -a (click any line to see the corresponding C code).

coveralls · 2014-04-26T12:17:32Z

Coverage remained the same when pulling 41a0eff on MechCoder:gil-enet into feb6eb1 on scikit-learn:master.

coveralls · 2014-04-26T20:32:21Z

Coverage remained the same when pulling 8d92bf9 on MechCoder:gil-enet into 3a4edd0 on scikit-learn:master.

coveralls · 2014-04-27T08:49:28Z

Coverage remained the same when pulling 59b2f50 on MechCoder:gil-enet into 3a4edd0 on scikit-learn:master.

MechCoder · 2014-05-01T17:43:04Z

@larsmans I'm facing trouble in testing the function outside the scikit-learn directory, in using the cblas.h file. I have a few questions, that I don't have any idea about, (First time I'm doing a setup.py file).

When scikit-learn is installed, does cblas get installed too? I suppose not because that is why the cblas folder is there in /src right?
If I need the cblas files externally, what do I need to do?
This is my directory structure right now
CyPractise
- cblas
- hello.pyx
- setup.py

Contents of setup.py - https://gist.github.com/MechCoder/11349640
Contents of hello.pyx - https://gist.github.com/MechCoder/0cf4b7a55e1ac6d6c900
[I am still confused to what should be provided as arguments to the "include_dirs" and "libraries", arguments in setup.py]

I might not require it right now, but I might need it later, and I should not be as clueless as I am right now. Thanks.

larsmans · 2014-05-01T18:01:48Z

No. The build system looks for a CBLAS implementation. If it doesn't find one, it builds the parts it needs from sklearn/src/cblas, which contains the reference implementation from ATLAS, and links that into the Cython modules that actually use CBLAS.
What you need is a working CBLAS implementation, e.g. sudo apt-get install libatlas-dev on Debian/Ubuntu (if memory serves me, that package got renamed a few times). That should put cblas.h in /usr/include and libcblas.so in /usr/lib. Note that BLAS alone is not enough: CBLAS is the C version, while libblas.so is the Fortran version with a hacky C API in blas.h.

(TODO for me, document this, I learned this by hacking scikit-learn for several years.)

MechCoder · 2014-05-01T18:13:37Z

@larsmans
Thanks for the quick reply.
Regarding point

By the same logic, won't just copy pasting the cblas directory anywhere else work? provided the setup.py file contains adequate information
It seems, I do have cblas.h in /usr/include and libcblas.so in /usr/lib ! . I'm convinced that I did something wrong in the setup.py file (https://gist.github.com/MechCoder/11349640) especially with the libraries and/or include_dirs keyword.

agramfort · 2014-05-02T19:53:29Z

sklearn/linear_model/cd_fast.pyx

+
+
+cdef double max(int n, double* a) nogil:
+    """np.max(np.abs(a))"""


I suppose you mean np.max only here

MechCoder · 2014-05-10T05:58:32Z

I plotted some benchmarks, and it does seem that releasing the GIL, does offer a considerable advantage generally.

Code - https://gist.github.com/MechCoder/b18c519ce018dc47b8b3

@agramfort @larsmans @ogrisel WDYT?

EDIT: the first one is with gil and the second one is after releasing it.

MechCoder · 2014-05-10T05:59:53Z

Would it be good to add these benchmarks to sklearn/benchmarks ?

agramfort · 2014-05-10T07:25:24Z

Sorry but I don't understand these plots. What are the colors, y label, common ylim? What is with or without gil?

MechCoder · 2014-05-10T07:28:42Z

@agramfort

y label - is the time.
with gil - is the implementation in master
without gil - is the implementation in this branch after replacing the numpy calls with cblas
colors - different values of n_alphas, 5, 10, 100, 500

Basically I have tested for different values of n_alphas and n_l1_ratios.

agramfort · 2014-05-10T10:53:18Z

I don't see the figure names on github so I just have to guess which is
which.

you should fix the random_state in make_regression to make the results
comparable.

if you do so then it should always be faster with higher number of alphas.
For now
it is still messy. Same is true for l1_ratios, the more you have the more
it should take time.

MechCoder · 2014-05-10T19:20:31Z

@agramfort. Umm. Surprisingly it does speed up for higher number of alphas. please run this small bit of code,

n_alphas = [5, 10, 100, 500]
X, y = make_regression(random_state=rng.RandomState(0))

clf = ElasticNetCV()
for alpha in n_alphas:
    t = time.time()
    clf.fit(X, y)
    times.append(time.time() - t)

print times

I'm guessing that this might be due to the fact that, Parallel has a considerable overhead when run for less computationally expensive operations?

agramfort · 2014-05-10T21:22:31Z

yes it's the Parallel overhead. You should bench or mid size problems too
with something like n_samples = 500 and n_features = 2000 or more.

MechCoder · 2014-05-19T18:41:19Z

Please ignore the previous comment.

In master:
l1_ratio - 1
n_alphas - 5 : 0.7737178802490234
n_alphas - 10 : 1.0900750160217285
n_alphas - 15 3.704710006713867
n_alphas - 100 6.64526891708374

l1_ratio - 5
n_alphas - 5 6.46356987953186
n_alphas - 10 7.89543890953064
n_alphas - 15 22.287801027297974
n_alphas - 100 37.26467323303223

In this branch
l1_ratio - 1
n_alphas - 5 : 0.7684640884399414, ,
n_alphas - 10 : 1.130321979522705
n_alphas - 15 3.6571431159973145,
n_alphas - 100 6.659605979919434

l1_ratio - 5
n_alphas - 5 6.491890907287598
n_alphas - 10 7.760898113250732
n_alphas - 15 22.719208002090454
n_alphas - 100 37.510425090789795

Is the time gain enough. It also seems to slow down in certain cases.

ogrisel · 2014-05-20T06:52:48Z

There is no gain to expect just by releasing the GIL and your benchmarks seem to show this although it would need to be repeated several times and displayed as bar plots with error bars (for instance with the std deviation) to confirm this.

Releasing the GIL just makes it possible to use the threading backend of joblib without having concurrent threads lock one another.

To use the threading backend instead instead of the multiprocessing backend for the ElasticNetCV class, you can add backend='threading' to the Parallel() call in the LinearModelCV class of coordinate_descent.py.

MechCoder · 2014-05-20T06:58:35Z

@ogrisel Is that likely to give a speed gain? The documentation of Parallel tells that, "threading" is a very low-overhead backend but it suffers from the Python Global Interpreter Lock if the called function relies a lot on Python objects.

ogrisel · 2014-05-20T07:31:12Z

@ogrisel Is that likely to give a speed gain? The documentation of Parallel tells that, "threading" is a very low-overhead backend but it suffers from the Python Global Interpreter Lock if the called function relies a lot on Python objects.

Well this is precisely why we want to release the GIL in the speed critical part of the cython code of the ElasticNetCV class: to be able to use the threading backend and get rid of the multiprocessing overhead. How many cores do you have on your box? It would be interesting to check with n_jobs in [1, 2, 4, 8], both for master with multiprocessing and this branch with the threading backend enabled.

ogrisel · 2014-05-20T09:05:07Z

You also need to perform enough iterations of cv to be able to leverage the inner parallelization, e.g. cv=5 or cv=10 for instance.

agramfort · 2014-05-20T12:46:12Z

+1 on @ogrisel 's comments. You have a chance to see a benefit in CV case
with threading.

MechCoder · 2014-05-21T06:24:10Z

@ogrisel , @agramfort I benched with cv = 10, for n_core = [1, 2, 4]. It seems that threading (releasing the GIL does have an advantage).

Multiprocess in this branch vs master (after releasing GIL)

Threading in this branch vs master (after releasing GIL)

Threading in this branch with multiprocessing in master

Do I get a go ahead to continue with the rest of this PR?

GaelVaroquaux · 2014-05-21T06:46:06Z

Is there a difference in memory consumption? Use a large dataset to see this, and maybe the memory-profiler.

I would expect more a difference in memory consumption than speed.

MechCoder · 2014-06-12T07:20:40Z

@ogrisel I pushed in a commit that releases the GIL for the multi-task variant? Do you mind having a quick look before I bench?

coveralls · 2014-06-12T07:30:41Z

Coverage increased (+0.0%) when pulling 75b48cd on MechCoder:gil-enet into 662fe00 on scikit-learn:master.

MechCoder · 2014-06-12T08:45:02Z

@ogrisel For the Gram variant, after the minor changes :)
In this branch

n_jobs	n_alpha=5	n_alpha=10	n_alpha=50	n_alpha = 100
1	9.771090110143026	16.637080272038776	68.49941301345825	130.57437300682068
2	8.062992572784424	11.647780736287435	38.32584532101949	70.34560592969258
4	6.94205363591512	7.835591952006022	24.266483386357624	44.059422969818115
8	7.067240556081136	7.970296223958333	17.524919430414837	31.352068662643433

In master

n_jobs	n_alpha=5	n_alpha=10	n_alpha=50	n_alpha = 100
1	9.817625602086386	16.563613414764404	68.58379038174947	131.2305454413096,
2	8.250134706497192	11.784696340560913	38.548853953679405	70.61840438842773,
4	7.1523966789245605	7.986500024795532	24.393589973449707	44.598851362864174,
8	7.2429977258046465	8.108551740646362	17.582454045613606	31.46389929453532

MechCoder · 2014-06-12T11:48:56Z

@GaelVaroquaux @ogrisel @agramfort

Some benchmarks for the multi-task variant. X = 10000 X 1000, y = 10000 X 3

In this branch

n_jobs	n_alpha=5	n_alpha=10	n_alpha=50	n_alpha = 100
1	12.00	23.53	100.70	191.36
2	7.30	12.64	51.30	97.28
4	5.09	8.17	31.87	59.88
8	4.65	6.23	22.16	41.52

In master

n_jobs	n_alpha=5	n_alpha=10	n_alpha=50	n_alpha = 100
1	11.79	23.13	98.85	191.05
2	7.29	12.98	50.52	96.71
4	5.04	8.27	31.55	59.35
8	4.61	6.36	22.54	40.94

Memory benefits:

In this branch

In master

I've updated the PR from [WIP] to [MRG]

ogrisel · 2014-06-13T10:21:50Z

sklearn/linear_model/cd_fast.pyx

+        # norm_cols_X = (np.asarray(X) ** 2).sum(axis=0)
+        for ii in range(n_features):
+            for jj in range(n_samples):
+                norm_cols_X[ii] += X[jj][ii]**2


If you run cython -a sklearn/linear_model/cd_fast.pyx you will see that this line causes some python wrapping overhead. I suppose this is caused by the double indexing X[jj][ii] which is probably not automatically optimized by cython.

X[jj, ii] makes it possible to remove that overhead.

The same problem also appears at lines 640, 702 and 705.

Also: PEP8 convention would favor the X[jj, ii] ** 2 notation (with whitespaces around **).

ogrisel · 2014-06-13T10:30:13Z

Once my last comment has been taken into account, +1 for merging. Thanks for the contrib and the extensive benchmarks @MechCoder!

MechCoder · 2014-06-13T10:46:22Z

@ogrisel Can I add my name at the top pf the cd_fast.pyx file?

ogrisel · 2014-06-13T11:08:59Z

@MechCoder sure.

ogrisel · 2014-06-13T11:13:13Z

Can you please re-run the multitask threading benchmark to check the impact of this last optim?

MechCoder · 2014-06-13T11:13:54Z

yes. running it.

coveralls · 2014-06-13T11:24:26Z

Coverage increased (+0.0%) when pulling 2c83678 on MechCoder:gil-enet into 662fe00 on scikit-learn:master.

MechCoder · 2014-06-13T12:07:09Z

@ogrisel It doesn't look like there is much difference. A slight improvement.
In this branch

n_jobs	n_alpha=5	n_alpha=10	n_alpha=50	n_alpha = 100
1	11.6339577039	22.9356119633	98.3196704388	190.012413581
2	7.178495725	12.4504806201	50.3672692776	96.4518989722
4	4.99971763293	8.17943700155	31.2485917409	59.3953860601
8	4.64213999112	6.17599566778	21.8335360686	40.7463400364

In master

n_jobs	n_alpha=5	n_alpha=10	n_alpha=50	n_alpha = 100
1	11.6491670609	22.7019441128	98.3627891541	190.17698129
2	7.12558722496	12.2409980297	50.3131773472	96.3868060112
4	4.95815428098	8.07793005308	31.2579707305	58.8344629606
8	4.52167503039	6.06676959991	21.6907103856	40.7425248623

MechCoder · 2014-06-13T13:00:08Z

Ping @agramfort Please have a look :)

agramfort · 2014-06-15T21:36:51Z

LGTM !

+1 for merge

agramfort · 2014-06-15T21:37:08Z

@ogrisel shall I merge?

[MRG+1] Releasing the GIL in the inner loop of coordinate descent

ogrisel · 2014-06-15T22:47:43Z

I just did :)

MechCoder · 2014-06-16T04:18:46Z

@ogrisel @agramfort Thanks a lot for your help. Learnt a lot from this.

MechCoder · 2014-06-17T16:12:00Z

@ogrisel @agramfort one of my gsoc goals, is to benchmark cyclic / random coordinate descent, and to test if it converges faster or not.

For cyclic descent, I need to permute the indices of the features once before every outer iteration. On the lines of this

for n_iter in range(max_iter):
    rng = np.random.RandomState(n_iter)
    f_shuffle = rng.permutation(n_features)
    for i in f_shuffle:
         .... so on

Sorry to be a noob, Since we have released the GIL and a numpy call might cause overhead, is there anything that would help me replace it with a C - call. thanks

ogrisel · 2014-06-17T16:35:00Z

No need to reinit a new RandomState instance at each iteration, you can just pass it as an argument to the cython function and call rng = check_random_state(self.random_state) in the body of the fit method of the parent python classes if not already done this way. This is a very generic pattern throughout the library.

Now as to answer you question about calling rng.permutation in a nogil block, you have two solutions:

locally acquire the GIL just around the call to rng.permutation:

        with gil:
             f_shuffle = rng.permutation(n_features)

reuse what @arjoly, @glouppe and @larsmans did to have a pure C, GIL free permutations in the random forests code: https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/tree/_tree.pyx#L1090 (enot that this has the additional benefit of not allocating any new memory for the permuted array).

The first solution my introduce some significant lock contentions on very small problems, so ideally we should try to implement the GIL-free solution. That would probably involve refactoring the cython source of the project to factorize the pure-cython RNG out of the tree code to make it more reusable.

MechCoder · 2014-06-18T08:09:53Z

@ogrisel . I had a quick look at the Fisher-Yates algorithm and the tree code. Is the algo used a special adaption of the Fisher-Yates? Because I see comments like,

    # skip the CPU intensive evaluation of the impurity criterion for
    # features that were already detected as constant

In that case, we might need to change it a bit for the linear models.

arjoly · 2014-06-18T08:31:10Z

In tree code, there are specific changes to take into account constant features in a split. But I don't think you need that sort of think for linear model. I think what @ogrisel suggests is to factorize
the RNG (our_rand_r, rand_double and rand_int)

ogrisel · 2014-06-18T08:43:28Z

I think what @ogrisel suggests is to factorize the RNG (our_rand_r, rand_double and rand_int)

Yes exactly. And then implement your own Fisher-Yates loop in the stochastic CD loop using that refactored rand_int function.

MechCoder reviewed Apr 25, 2014
View reviewed changes

larsmans mentioned this pull request Apr 28, 2014

[WIP] Releasing the GIL in the inner loop of coordinate descent #2980

Closed

4 tasks

agramfort reviewed May 2, 2014
View reviewed changes

MechCoder mentioned this pull request May 10, 2014

[MRG+1] ENH: Swap rows in sparsefuncs #3104

Merged

MechCoder changed the title ~~[WIP] Releasing the GIL in the inner loop of coordinate descent~~ [MRG] Releasing the GIL in the inner loop of coordinate descent Jun 12, 2014

ogrisel reviewed Jun 13, 2014
View reviewed changes

ENH: Replaced double indexing as it causes python overhead

2c83678

ogrisel changed the title ~~[MRG] Releasing the GIL in the inner loop of coordinate descent~~ [MRG+1] Releasing the GIL in the inner loop of coordinate descent Jun 14, 2014

ogrisel added a commit that referenced this pull request Jun 15, 2014

Merge pull request #3102 from MechCoder/gil-enet

1081160

[MRG+1] Releasing the GIL in the inner loop of coordinate descent

ogrisel merged commit 1081160 into scikit-learn:master Jun 15, 2014

MechCoder deleted the gil-enet branch June 16, 2014 04:18

ogrisel mentioned this pull request Jun 25, 2015

PERF Use joblib.Parallel backend='threading' for sparse_encode when called with algorithm='lasso_cd' #4896

Closed

raghavrv mentioned this pull request Aug 7, 2016

[MRG+2] Early stopping for Gradient Boosting Classifier/Regressor #7071

Merged



		cdef double max(int n, double* a) nogil:
		"""np.max(np.abs(a))"""

[MRG+1] Releasing the GIL in the inner loop of coordinate descent #3102

[MRG+1] Releasing the GIL in the inner loop of coordinate descent #3102

Conversation

MechCoder commented Apr 23, 2014

MechCoder commented Apr 23, 2014

MechCoder commented Apr 23, 2014

larsmans commented Apr 24, 2014

larsmans commented Apr 24, 2014

MechCoder commented Apr 24, 2014

MechCoder Apr 25, 2014

Choose a reason for hiding this comment

larsmans Apr 25, 2014

Choose a reason for hiding this comment

coveralls commented Apr 26, 2014

coveralls commented Apr 26, 2014

coveralls commented Apr 27, 2014

MechCoder commented May 1, 2014

larsmans commented May 1, 2014

MechCoder commented May 1, 2014

agramfort May 2, 2014

Choose a reason for hiding this comment

MechCoder commented May 10, 2014

MechCoder commented May 10, 2014

agramfort commented May 10, 2014

MechCoder commented May 10, 2014

agramfort commented May 10, 2014

MechCoder commented May 10, 2014

agramfort commented May 10, 2014

MechCoder commented May 19, 2014

ogrisel commented May 20, 2014

MechCoder commented May 20, 2014

ogrisel commented May 20, 2014

ogrisel commented May 20, 2014

agramfort commented May 20, 2014

MechCoder commented May 21, 2014

GaelVaroquaux commented May 21, 2014

MechCoder commented Jun 12, 2014

coveralls commented Jun 12, 2014

MechCoder commented Jun 12, 2014

MechCoder commented Jun 12, 2014

ogrisel Jun 13, 2014

Choose a reason for hiding this comment

ogrisel commented Jun 13, 2014

MechCoder commented Jun 13, 2014

ogrisel commented Jun 13, 2014

ogrisel commented Jun 13, 2014

MechCoder commented Jun 13, 2014

coveralls commented Jun 13, 2014

MechCoder commented Jun 13, 2014

MechCoder commented Jun 13, 2014

agramfort commented Jun 15, 2014

agramfort commented Jun 15, 2014

ogrisel commented Jun 15, 2014

MechCoder commented Jun 16, 2014

MechCoder commented Jun 17, 2014

ogrisel commented Jun 17, 2014

MechCoder commented Jun 18, 2014

arjoly commented Jun 18, 2014

ogrisel commented Jun 18, 2014