Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[MRG+1] Releasing the GIL in the inner loop of coordinate descent #3102

Merged
merged 11 commits into from Jun 15, 2014

Conversation

MechCoder
Copy link
Member

  • release the GIL in the dual gap check (need to replace numpy calls by BLAS equivalents)
  • release the GIL in the sparse input variant
  • release the GIL for the precomputed Gram matrix variant
  • release the GIL for multi-task variant.
  • benchmark the scaling for instance by hacking the cross_val_score function to make it possible to use the threading backend instead of multiprocessing

@MechCoder
Copy link
Member Author

@ogrisel I tried picking your commit, but there are nasty merge conflicts in the cd_fast.c file.

@MechCoder
Copy link
Member Author

Ah well. Fixed them. Sorry for the noise

@larsmans
Copy link
Member

Merge conflicts in generated C files are a matter of calling Cython again (but I guess you figured that out).

@larsmans
Copy link
Member

Can we close Olivier's old PR? Which one was it?

@MechCoder
Copy link
Member Author

@larsmans #2980

# return if we reached desired tolerance
break
with nogil:
for n_iter in range(max_iter):
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@larsmans A noob doubt. As far as I read, you can use nogil when there are no Python objects. Then how come this works?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The Cython compiler expands for i in range(n), and a few variant constructs, to a C for loop when i and n are properly typed. Check the generated C code to see this at work (it has the input Cython code in comments), or the HTML output from cython -a (click any line to see the corresponding C code).

@coveralls
Copy link

Coverage Status

Coverage remained the same when pulling 41a0eff on MechCoder:gil-enet into feb6eb1 on scikit-learn:master.

@coveralls
Copy link

Coverage Status

Coverage remained the same when pulling 8d92bf9 on MechCoder:gil-enet into 3a4edd0 on scikit-learn:master.

@coveralls
Copy link

Coverage Status

Coverage remained the same when pulling 59b2f50 on MechCoder:gil-enet into 3a4edd0 on scikit-learn:master.

@MechCoder
Copy link
Member Author

@larsmans I'm facing trouble in testing the function outside the scikit-learn directory, in using the cblas.h file. I have a few questions, that I don't have any idea about, (First time I'm doing a setup.py file).

  1. When scikit-learn is installed, does cblas get installed too? I suppose not because that is why the cblas folder is there in /src right?
  2. If I need the cblas files externally, what do I need to do?
    This is my directory structure right now
    CyPractise
    • cblas
    • hello.pyx
    • setup.py

Contents of setup.py - https://gist.github.com/MechCoder/11349640
Contents of hello.pyx - https://gist.github.com/MechCoder/0cf4b7a55e1ac6d6c900
[I am still confused to what should be provided as arguments to the "include_dirs" and "libraries", arguments in setup.py]

I might not require it right now, but I might need it later, and I should not be as clueless as I am right now. Thanks.

@larsmans
Copy link
Member

larsmans commented May 1, 2014

  1. No. The build system looks for a CBLAS implementation. If it doesn't find one, it builds the parts it needs from sklearn/src/cblas, which contains the reference implementation from ATLAS, and links that into the Cython modules that actually use CBLAS.
  2. What you need is a working CBLAS implementation, e.g. sudo apt-get install libatlas-dev on Debian/Ubuntu (if memory serves me, that package got renamed a few times). That should put cblas.h in /usr/include and libcblas.so in /usr/lib. Note that BLAS alone is not enough: CBLAS is the C version, while libblas.so is the Fortran version with a hacky C API in blas.h.

(TODO for me, document this, I learned this by hacking scikit-learn for several years.)

@MechCoder
Copy link
Member Author

@larsmans
Thanks for the quick reply.
Regarding point

  1. By the same logic, won't just copy pasting the cblas directory anywhere else work? provided the setup.py file contains adequate information
  2. It seems, I do have cblas.h in /usr/include and libcblas.so in /usr/lib ! . I'm convinced that I did something wrong in the setup.py file (https://gist.github.com/MechCoder/11349640) especially with the libraries and/or include_dirs keyword.



cdef double max(int n, double* a) nogil:
"""np.max(np.abs(a))"""
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suppose you mean np.max only here

@MechCoder
Copy link
Member Author

I plotted some benchmarks, and it does seem that releasing the GIL, does offer a considerable advantage generally.

Code - https://gist.github.com/MechCoder/b18c519ce018dc47b8b3

withgil
withnogil

@agramfort @larsmans @ogrisel WDYT?

EDIT: the first one is with gil and the second one is after releasing it.

@MechCoder
Copy link
Member Author

Would it be good to add these benchmarks to sklearn/benchmarks ?

@agramfort
Copy link
Member

Sorry but I don't understand these plots. What are the colors, y label, common ylim? What is with or without gil?

@MechCoder
Copy link
Member Author

@agramfort

y label - is the time.
with gil - is the implementation in master
without gil - is the implementation in this branch after replacing the numpy calls with cblas
colors - different values of n_alphas, 5, 10, 100, 500

Basically I have tested for different values of n_alphas and n_l1_ratios.

@agramfort
Copy link
Member

I don't see the figure names on github so I just have to guess which is
which.

you should fix the random_state in make_regression to make the results
comparable.

if you do so then it should always be faster with higher number of alphas.
For now
it is still messy. Same is true for l1_ratios, the more you have the more
it should take time.

@MechCoder
Copy link
Member Author

@agramfort. Umm. Surprisingly it does speed up for higher number of alphas. please run this small bit of code,

n_alphas = [5, 10, 100, 500]
X, y = make_regression(random_state=rng.RandomState(0))

clf = ElasticNetCV()
for alpha in n_alphas:
    t = time.time()
    clf.fit(X, y)
    times.append(time.time() - t)

print times

I'm guessing that this might be due to the fact that, Parallel has a considerable overhead when run for less computationally expensive operations?

@agramfort
Copy link
Member

yes it's the Parallel overhead. You should bench or mid size problems too
with something like n_samples = 500 and n_features = 2000 or more.

@MechCoder
Copy link
Member Author

Please ignore the previous comment.

In master:
l1_ratio - 1
n_alphas - 5 : 0.7737178802490234
n_alphas - 10 : 1.0900750160217285
n_alphas - 15 3.704710006713867
n_alphas - 100 6.64526891708374

l1_ratio - 5
n_alphas - 5 6.46356987953186
n_alphas - 10 7.89543890953064
n_alphas - 15 22.287801027297974
n_alphas - 100 37.26467323303223

In this branch
l1_ratio - 1
n_alphas - 5 : 0.7684640884399414, ,
n_alphas - 10 : 1.130321979522705
n_alphas - 15 3.6571431159973145,
n_alphas - 100 6.659605979919434

l1_ratio - 5
n_alphas - 5 6.491890907287598
n_alphas - 10 7.760898113250732
n_alphas - 15 22.719208002090454
n_alphas - 100 37.510425090789795

Is the time gain enough. It also seems to slow down in certain cases.

@ogrisel
Copy link
Member

ogrisel commented May 20, 2014

There is no gain to expect just by releasing the GIL and your benchmarks seem to show this although it would need to be repeated several times and displayed as bar plots with error bars (for instance with the std deviation) to confirm this.

Releasing the GIL just makes it possible to use the threading backend of joblib without having concurrent threads lock one another.

To use the threading backend instead instead of the multiprocessing backend for the ElasticNetCV class, you can add backend='threading' to the Parallel() call in the LinearModelCV class of coordinate_descent.py.

@MechCoder
Copy link
Member Author

@ogrisel Is that likely to give a speed gain? The documentation of Parallel tells that, "threading" is a very low-overhead backend but it suffers from the Python Global Interpreter Lock if the called function relies a lot on Python objects.

@ogrisel
Copy link
Member

ogrisel commented May 20, 2014

@ogrisel Is that likely to give a speed gain? The documentation of Parallel tells that, "threading" is a very low-overhead backend but it suffers from the Python Global Interpreter Lock if the called function relies a lot on Python objects.

Well this is precisely why we want to release the GIL in the speed critical part of the cython code of the ElasticNetCV class: to be able to use the threading backend and get rid of the multiprocessing overhead. How many cores do you have on your box? It would be interesting to check with n_jobs in [1, 2, 4, 8], both for master with multiprocessing and this branch with the threading backend enabled.

@ogrisel
Copy link
Member

ogrisel commented May 20, 2014

You also need to perform enough iterations of cv to be able to leverage the inner parallelization, e.g. cv=5 or cv=10 for instance.

@agramfort
Copy link
Member

+1 on @ogrisel 's comments. You have a chance to see a benefit in CV case
with threading.

@MechCoder
Copy link
Member Author

@ogrisel , @agramfort I benched with cv = 10, for n_core = [1, 2, 4]. It seems that threading (releasing the GIL does have an advantage).

Multiprocess in this branch vs master (after releasing GIL)
benching with multiprocessing

Threading in this branch vs master (after releasing GIL)
benching with threading

Threading in this branch with multiprocessing in master
threading vs multiprocessing

Do I get a go ahead to continue with the rest of this PR?

@GaelVaroquaux
Copy link
Member

Is there a difference in memory consumption? Use a large dataset to see this, and maybe the memory-profiler.

I would expect more a difference in memory consumption than speed.

@MechCoder
Copy link
Member Author

@ogrisel I pushed in a commit that releases the GIL for the multi-task variant? Do you mind having a quick look before I bench?

@coveralls
Copy link

Coverage Status

Coverage increased (+0.0%) when pulling 75b48cd on MechCoder:gil-enet into 662fe00 on scikit-learn:master.

@MechCoder
Copy link
Member Author

@ogrisel For the Gram variant, after the minor changes :)
In this branch

n_jobs n_alpha=5 n_alpha=10 n_alpha=50 n_alpha = 100
1 9.771090110143026 16.637080272038776 68.49941301345825 130.57437300682068
2 8.062992572784424 11.647780736287435 38.32584532101949 70.34560592969258
4 6.94205363591512 7.835591952006022 24.266483386357624 44.059422969818115
8 7.067240556081136 7.970296223958333 17.524919430414837 31.352068662643433

In master

n_jobs n_alpha=5 n_alpha=10 n_alpha=50 n_alpha = 100
1 9.817625602086386 16.563613414764404 68.58379038174947 131.2305454413096,
2 8.250134706497192 11.784696340560913 38.548853953679405 70.61840438842773,
4 7.1523966789245605 7.986500024795532 24.393589973449707 44.598851362864174,
8 7.2429977258046465 8.108551740646362 17.582454045613606 31.46389929453532

@MechCoder
Copy link
Member Author

@GaelVaroquaux @ogrisel @agramfort

Some benchmarks for the multi-task variant. X = 10000 X 1000, y = 10000 X 3

In this branch

n_jobs n_alpha=5 n_alpha=10 n_alpha=50 n_alpha = 100
1 12.00 23.53 100.70 191.36
2 7.30 12.64 51.30 97.28
4 5.09 8.17 31.87 59.88
8 4.65 6.23 22.16 41.52

In master

n_jobs n_alpha=5 n_alpha=10 n_alpha=50 n_alpha = 100
1 11.79 23.13 98.85 191.05
2 7.29 12.98 50.52 96.71
4 5.04 8.27 31.55 59.35
8 4.61 6.36 22.54 40.94

Memory benefits:

In this branch

multi_thread

In master

multi_process

I've updated the PR from [WIP] to [MRG]

@MechCoder MechCoder changed the title [WIP] Releasing the GIL in the inner loop of coordinate descent [MRG] Releasing the GIL in the inner loop of coordinate descent Jun 12, 2014
# norm_cols_X = (np.asarray(X) ** 2).sum(axis=0)
for ii in range(n_features):
for jj in range(n_samples):
norm_cols_X[ii] += X[jj][ii]**2
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you run cython -a sklearn/linear_model/cd_fast.pyx you will see that this line causes some python wrapping overhead. I suppose this is caused by the double indexing X[jj][ii] which is probably not automatically optimized by cython.

X[jj, ii] makes it possible to remove that overhead.

The same problem also appears at lines 640, 702 and 705.

Also: PEP8 convention would favor the X[jj, ii] ** 2 notation (with whitespaces around **).

@ogrisel
Copy link
Member

ogrisel commented Jun 13, 2014

Once my last comment has been taken into account, +1 for merging. Thanks for the contrib and the extensive benchmarks @MechCoder!

@MechCoder
Copy link
Member Author

@ogrisel Can I add my name at the top pf the cd_fast.pyx file?

@ogrisel
Copy link
Member

ogrisel commented Jun 13, 2014

@MechCoder sure.

@ogrisel
Copy link
Member

ogrisel commented Jun 13, 2014

Can you please re-run the multitask threading benchmark to check the impact of this last optim?

@MechCoder
Copy link
Member Author

yes. running it.

@coveralls
Copy link

Coverage Status

Coverage increased (+0.0%) when pulling 2c83678 on MechCoder:gil-enet into 662fe00 on scikit-learn:master.

@MechCoder
Copy link
Member Author

@ogrisel It doesn't look like there is much difference. A slight improvement.
In this branch

n_jobs n_alpha=5 n_alpha=10 n_alpha=50 n_alpha = 100
1 11.6339577039 22.9356119633 98.3196704388 190.012413581
2 7.178495725 12.4504806201 50.3672692776 96.4518989722
4 4.99971763293 8.17943700155 31.2485917409 59.3953860601
8 4.64213999112 6.17599566778 21.8335360686 40.7463400364

In master

n_jobs n_alpha=5 n_alpha=10 n_alpha=50 n_alpha = 100
1 11.6491670609 22.7019441128 98.3627891541 190.17698129
2 7.12558722496 12.2409980297 50.3131773472 96.3868060112
4 4.95815428098 8.07793005308 31.2579707305 58.8344629606
8 4.52167503039 6.06676959991 21.6907103856 40.7425248623

@MechCoder
Copy link
Member Author

Ping @agramfort Please have a look :)

@ogrisel ogrisel changed the title [MRG] Releasing the GIL in the inner loop of coordinate descent [MRG+1] Releasing the GIL in the inner loop of coordinate descent Jun 14, 2014
@agramfort
Copy link
Member

LGTM !

+1 for merge

@agramfort
Copy link
Member

@ogrisel shall I merge?

ogrisel added a commit that referenced this pull request Jun 15, 2014
[MRG+1] Releasing the GIL in the inner loop of coordinate descent
@ogrisel ogrisel merged commit 1081160 into scikit-learn:master Jun 15, 2014
@ogrisel
Copy link
Member

ogrisel commented Jun 15, 2014

I just did :)

@MechCoder MechCoder deleted the gil-enet branch June 16, 2014 04:18
@MechCoder
Copy link
Member Author

@ogrisel @agramfort Thanks a lot for your help. Learnt a lot from this.

@MechCoder
Copy link
Member Author

@ogrisel @agramfort one of my gsoc goals, is to benchmark cyclic / random coordinate descent, and to test if it converges faster or not.

For cyclic descent, I need to permute the indices of the features once before every outer iteration. On the lines of this

for n_iter in range(max_iter):
    rng = np.random.RandomState(n_iter)
    f_shuffle = rng.permutation(n_features)
    for i in f_shuffle:
         .... so on

Sorry to be a noob, Since we have released the GIL and a numpy call might cause overhead, is there anything that would help me replace it with a C - call. thanks

@ogrisel
Copy link
Member

ogrisel commented Jun 17, 2014

No need to reinit a new RandomState instance at each iteration, you can just pass it as an argument to the cython function and call rng = check_random_state(self.random_state) in the body of the fit method of the parent python classes if not already done this way. This is a very generic pattern throughout the library.

Now as to answer you question about calling rng.permutation in a nogil block, you have two solutions:

  • locally acquire the GIL just around the call to rng.permutation:
        with gil:
             f_shuffle = rng.permutation(n_features)

The first solution my introduce some significant lock contentions on very small problems, so ideally we should try to implement the GIL-free solution. That would probably involve refactoring the cython source of the project to factorize the pure-cython RNG out of the tree code to make it more reusable.

@MechCoder
Copy link
Member Author

@ogrisel . I had a quick look at the Fisher-Yates algorithm and the tree code. Is the algo used a special adaption of the Fisher-Yates? Because I see comments like,

    # skip the CPU intensive evaluation of the impurity criterion for
    # features that were already detected as constant 

In that case, we might need to change it a bit for the linear models.

@arjoly
Copy link
Member

arjoly commented Jun 18, 2014

In tree code, there are specific changes to take into account constant features in a split. But I don't think you need that sort of think for linear model. I think what @ogrisel suggests is to factorize
the RNG (our_rand_r, rand_double and rand_int)

@ogrisel
Copy link
Member

ogrisel commented Jun 18, 2014

I think what @ogrisel suggests is to factorize the RNG (our_rand_r, rand_double and rand_int)

Yes exactly. And then implement your own Fisher-Yates loop in the stochastic CD loop using that refactored rand_int function.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

7 participants