[MRG]: Use coordinate_descent_gram when precompute is True | auto #3220

MechCoder · 2014-05-29T07:53:42Z

This PR does the following

1] Bench to show that precompute="auto" offers very slight advantage.
2] Remove precompute from MultiTaskElasticNet/Lasso CV
3] Use gram variant when precompute="auto" or True

MechCoder · 2014-05-29T07:58:12Z

Also I think there is a slight mistake in the docs in cd_fast.

    1 w^T Q w - q^T w + alpha norm(w, 1) + beta norm(w, 2)^2
    -                                      ----
    2                                        2

    which amount to the Elastic-Net problem when:
    Q = X^T X (Gram matrix)
    q = X^T y

Should there be an extra norm(y, 2)^2 term?

coveralls · 2014-05-29T08:04:22Z

Coverage increased (+0.0%) when pulling b8b21a3 on MechCoder:fix_precompute into 92c4308 on scikit-learn:master.

agramfort · 2014-05-29T19:05:24Z

sklearn/linear_model/coordinate_descent.py

+                tol, positive)
+        else:
+            model = cd_fast.enet_coordinate_descent(
+                coef_, l1_reg, l2_reg, X, y, max_iter, tol, positive)


LGTM

do you confirm a speed up with n_samples >> n_features?

agramfort · 2014-05-29T19:15:44Z

Should there be an extra norm(y, 2)^2 term?

this term is constant so we don't care

MechCoder · 2014-05-29T19:55:32Z

@agramfort It seems to slow down for me.

In [1]: from sklearn.datasets import make_regression

In [2]: X, y = make_regression(n_samples=5000, n_features=100)

In [3]: from sklearn.linear_model import *

In [4]: clf = ElasticNet(precompute=False)

In [5]: %timeit clf.fit(X, y)
10 loops, best of 3: 47.9 ms per loop

In [6]: clf = ElasticNet(precompute="auto")

In [7]: %timeit clf.fit(X, y)
1 loops, best of 3: 236 ms per loop

agramfort · 2014-05-30T20:37:13Z

this is weird although possible. Is the dual gap the same at the end? what if n_samples is even bigger and n_features smaller?

MechCoder · 2014-05-30T20:40:25Z

@agramfort My laptop does give weird results sometimes, but I've tested it multiple times. Would you be able to check on your machine? I'll test the remaining cases.

MechCoder · 2014-05-31T07:23:46Z

I've changed the default arguments of precompute, as based on the benchmarks run on the rackspace that @ogrisel gave me

In [28]: X, y = make_regression(n_samples=10000, n_features=50)

In [29]: clf = ElasticNet(precompute="auto")

In [30]: %timeit clf.fit(X, y)
10 loops, best of 3: 42.4 ms per loop

 In [31]: clf = ElasticNet(precompute=False)

 In [32]: %timeit clf.fit(X, y)
100 loops, best of 3: 11.2 ms per loop

In [33]: clf = ElasticNetCV(precompute=False, cv=10, n_jobs=-1)

In [34]: %timeit clf.fit(X, y)
1 loops, best of 3: 1.28 s per loop

In [35]: clf = ElasticNetCV(precompute=" auto", cv=10, n_jobs=-1)

In [36]: %timeit clf.fit(X, y)
1 loops, best of 3: 1.41 s per loop

@agramfort Please merge this, if you have no objections.

agramfort · 2014-06-01T14:07:19Z

I confirm that 'auto' is not doing the right thing when the model is trained with a single alpha. The overhead of computation of the gram matrix kills the benefit of fitting using the gram.

now is the conclusion still true when y is 2d with many targets are passed?

the what's new page API section will have to be udpated if we change the default arguments.

MechCoder · 2014-06-01T18:36:15Z

@agramfort I've updated the whats_new page.

is the conclusion still true when y is 2d with many targets are passed?

The gram_coordinate_descent doesn't work for 2d y,

I've also tested it for Lasso and LassoCV and it does slow down.

MechCoder · 2014-06-01T19:23:40Z

By the way, the Error in Travis has nothing to do with this PR. Seems to be a timeout.

MechCoder · 2014-06-02T06:21:04Z

@agramfort The only time I think it has a really really slight advantage is when cv=3 or 4, but it is only very little in the order of 0.0x seconds. When cv is more, since we need to calculate the Gram repeatedly for multiple folds, the advantage is again lost.

Do you have any specific case, that you want me to bench?

ogrisel · 2014-06-02T07:04:21Z

By the way, the Error in Travis has nothing to do with this PR. Seems to be a timeout.

No, it is caused by a failing doctest that needs to take the change of this PR into account:

https://travis-ci.org/scikit-learn/scikit-learn/jobs/26520998#L5635

ogrisel · 2014-06-02T07:07:35Z

In #3220 (comment), the following line should have caused a value error:

In [35]: clf = ElasticNetCV(precompute=" auto", cv=10, n_jobs=-1)

ogrisel · 2014-06-02T07:57:01Z

Running this branch on my box:

>>> from sklearn.datasets import make_regression
>>> from sklearn.linear_model import ElasticNet
>>> from sklearn.linear_model import ElasticNetCV
>>> X, y = make_regression(n_samples=10000, n_features=50)
>>> %timeit ElasticNet(precompute=False).fit(X, y)
100 loops, best of 3: 9.57 ms per loop
>>> %timeit ElasticNet(precompute=True).fit(X, y)
10 loops, best of 3: 27.1 ms per loop
>>> %timeit ElasticNetCV(precompute=False, cv=10, n_jobs=8).fit(X, y)
1 loops, best of 3: 1.48 s per loop
>>> %timeit ElasticNetCV(precompute=True, cv=10, n_jobs=8).fit(X, y)
1 loops, best of 3: 405 ms per loop

I used %doctest_mode to make it easier to copy and paste the lines while ignoring the >>> prompts.

So Gram pre-computation seems to be benefiting the CV variant while not benefiting the original model with fixed alpha. This is rather confusing to me.

precompute=True is still slower even with lower values of alpha:

>>> %timeit ElasticNet(alpha=0.00001, precompute=False).fit(X, y)
100 loops, best of 3: 12.2 ms per loop
>>> %timeit ElasticNet(alpha=0.00001, precompute=True).fit(X, y)
10 loops, best of 3: 28 ms per loop

For wide problems (n_features >> n_samples), precomputing the Gram matrix is even worst (but I think this is expected):

>>> X, y = make_regression(n_samples=50, n_features=10000)
>>> %timeit ElasticNet(alpha=0.0001, precompute=False).fit(X, y)
100 loops, best of 3: 7.36 ms per loop
>>> %timeit ElasticNet(alpha=0.0001, precompute=True).fit(X, y)
1 loops, best of 3: 2.28 s per loop

I get the similar results with the CV variant:

>>> X, y = make_regression(n_samples=50, n_features=1000)
>>> %timeit clf = ElasticNetCV(cv=5, precompute=False, n_jobs=8).fit(X, y)
/volatile/ogrisel/code/scikit-learn/sklearn/linear_model/coordinate_descent.py:487: ConvergenceWarning: Objective did not converge. You might want to increase the number of iterations
  ConvergenceWarning)
1 loops, best of 3: 613 ms per loop
>>> %timeit clf = ElasticNetCV(cv=5, precompute=True, n_jobs=8).fit(X, y)
1 loops, best of 3: 5.02 s per loop

Both models find the same optimal value for alpha_.

MechCoder · 2014-06-02T09:17:47Z

@ogrisel I get slightly varied results with 4 cores.

>>> from sklearn.linear_model import ElasticNet
>>> from sklearn.linear_model import ElasticNetCV
>>> X, y = make_regression(n_samples=10000, n_features=50) 
>>> %timeit ElasticNet(precompute=False).fit(X, y)
10 loops, best of 3: 25.9 ms per loop
>>> %timeit ElasticNet(precompute=True).fit(X, y)
10 loops, best of 3: 84.8 ms per loop
>>> %timeit ElasticNetCV(precompute=False, cv=10, n_jobs=-1).fit(X, y)
1 loops, best of 3: 5.74 s per loop
>>> %timeit ElasticNetCV(precompute=True, cv=10, n_jobs=-1).fit(X, y)
1 loops, best of 3: 6.1 s per loop

MechCoder · 2014-06-02T13:03:58Z

@ogrisel Both in the Rackspace cloud, on my box and from your benches, we can be convinced that (please correct me if I'm wrong)

For ElasticNet and Lasso, setting precompute = False, seems to be the way to go.
I'm not really sure about ElasticNetCV and LassoCV, for me both don't seem to be make much of a difference (wrt to timing), unlike your benches. Would you please be able to verify those cases? i.e Setting precompute to be False and "auto".

Whatever the default may be, I would like to get this PR merged quick so that I can continue work on the other PR.

ogrisel · 2014-06-02T14:26:03Z

The best strategy is probably data dependent, but what those experiments says is that the current heuristic implemented when precompute="auto" is probably not very good. It would be worth trying on more realistic datasets but I don't have any good one in mind.

Maybe @agramfort or @mblondel could suggest ideas? In my opinion ElasticNet (with the coordinate descent solver) is typically used with high dimensional, noisy data with a potentially large number of irrelevant features.

Maybe you can try to check whether precompute=False is always the best strategy on more noisy data such as: make_regression(n_samples=1000, n_features=50, noise=0.1, n_informatives=10) and make_regression(n_samples=50, n_features=1000, noise=0.1, n_informatives=10).

You can also try to see the impact of correlated features with data generated with effective_rank=10 for instance.

ogrisel · 2014-06-02T14:27:19Z

Also please try to address: #3220 (comment)

ogrisel · 2014-06-02T14:29:48Z

And we can separate the two issues currently addressed in this PR:

enet_coordinate_descent_gram is unused in enet_path even when precompute=True
what is the best behavior for precompute="auto".

Item 1 should not be controversial. Item 2 probably requires more investigations.

MechCoder · 2014-06-02T14:39:44Z

@ogrisel On a side note, is it possible that I'm seeing a drastic slowdown as compared to your benchmarks because of the way I installed scikit-learn.

I installed the dependencies.

sudo apt-get install build-essential python-dev python-numpy python-setuptools python-scipy   
libatlas-dev libatlas3-base

and then just did python setup.py . Is there any reason why this would slow it down?

jaidevd · 2014-06-02T14:48:29Z

@MechCoder @ogrisel

I tried to see if precompute=False is the dominant strategy for noisy data, and this was the result -

>>> X, y = make_regression(n_samples=1000,n_features=50, noise=0.1, n_informative=10)
>>> %timeit ElasticNet(precompute=False).fit(X,y)
1000 loops, best of 3: 915 µs per loop
>>> %timeit ElasticNet(precompute=True).fit(X,y)
1000 loops, best of 3: 1.01 ms per loop
>>> %timeit ElasticNetCV(precompute=False, cv=10, n_jobs=8).fit(X,y)
1 loops, best of 3: 572 ms per loop
>>> %timeit ElasticNetCV(precompute=True, cv=10, n_jobs=8).fit(X,y)
1 loops, best of 3: 367 ms per loop
>>> X, y = make_regression(n_samples=50,n_features=1000, noise=0.1, n_informative=10)
>>> %timeit ElasticNet(precompute=False).fit(X,y)                               
100 loops, best of 3: 14.8 ms per loop
>>> %timeit ElasticNet(precompute=True).fit(X,y)
10 loops, best of 3: 99.5 ms per loop
>>> %timeit ElasticNetCV(precompute=False, cv=10, n_jobs=8).fit(X,y)            
1 loops, best of 3: 1.36 s per loop
>>> %timeit ElasticNetCV(precompute=True, cv=10, n_jobs=8).fit(X,y)
1 loops, best of 3: 6.91 s per loop

Looks like that is indeed the case, except in the case of ElasticNetCV when n_samples >> n_features.

@ogrisel Is there some better test than simply %timeit to ascertain that this is the case?

MechCoder · 2014-06-02T14:54:43Z

@jaidevd @ogrisel Thanks. I'm getting simlar results.

@agramfort There seems to be just a small margin of speed gain in the case when n_samples >> n_features. What more can we do get this verified?

ogrisel · 2014-06-03T07:27:03Z

Is there any reason why this would slow it down?

No. You could rebuild atlas to tune it to your architecture (e.g. see: http://danielnouri.org/notes/2012/12/19/libblas-and-liblapack-issues-and-speed,-with-scipy-and-ubuntu/ ) but it's more likely that the absolute speed difference between our setups is explained by hardware (e.g. size of the CPU caches) rather than software in this case. But you should not focus on absolute perf numbers but rather relative performance between method on the same hardware.

@ogrisel Is there some better test than simply %timeit to ascertain that this is the case?

timeit is fine. We just need to check that the standard deviation across run is low enough. If not it's worth benchmarking on larger problems.

ogrisel · 2014-06-03T07:39:09Z

There seems to be just a small margin of speed gain in the case when n_samples >> n_features. What more can we do get this verified?

I played a bit more with noisy data generated with make_regression and I could never get the precompute=True version be significantly faster than that the precompute=False version. @agramfort any data in mind?

MechCoder · 2014-06-03T12:57:01Z

The build passes now.

coveralls · 2014-06-03T12:58:36Z

Coverage increased (+0.0%) when pulling 9a31569 on MechCoder:fix_precompute into 0a61119 on scikit-learn:master.

ogrisel · 2014-06-03T17:11:31Z

@MechCoder can you please address #3220 (comment) ? Unexpected values for the precompute parameter should raise a ValueError instead of silently fallbacking to the "auto" mode.

a] Raise ValueError for invalid precompute b] Remove precompute for MultiTask ENet/LassoCV

MechCoder · 2014-06-03T19:20:21Z

@ogrisel Fixed.

I also removed precompute from MultiTaskElasticNet / Lasso CV since it is not being used.

coveralls · 2014-06-03T19:39:49Z

Coverage increased (+0.0%) when pulling a5fe51f on MechCoder:fix_precompute into 0a61119 on scikit-learn:master.

agramfort · 2014-06-03T21:03:37Z

ok. Looks like the deal is done. LGTM. @ogrisel it is difficult to find relevant datasets with n_samples >> n_features. The gram trick is however useful for dico learning cf. dico_learning.py

+1 for merge on my side.

@ogrisel feel free to merge tomorrow if you're happy.

ogrisel · 2014-06-04T07:49:22Z

@ogrisel feel free to merge tomorrow if you're happy.

I am not that happy: I don't think we should keep an "auto" mode that is never useful and not used by default: I would rather deprecate it explicitly and add tests to check that the deprecation warnings work.

ogrisel · 2014-06-04T07:51:39Z

2] Remove precompute from MultiTaskElasticNet/Lasso CV

Why was the precompute option removed for the MultiTaskElasticNet/Lasso CV classes? Is it broklen, does setting precompute=True causes a crash on those guys?

In any case we cannot change the public API (removing parameters) without going through a deprecation cycle.

ogrisel · 2014-06-04T08:00:24Z

3] Use gram variant when precompute="auto" or True

The initial goal of precompute="auto" was to automatically determine whether or not to precompute the Gram matrix based on the n_samples < n_features test as performed in the sklearn.linear_model.base._pre_fit helper function: https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/linear_model/base.py#L400

We should respect that contract.

If the n_samples < n_features heuristic test is a bad test as seems to be demonstrated by the benchmarks in this PR then we might decide to come up with a better heuristic or decide to deprecate the precompute="auto" option entirely.

In any case we should not silently change the behavior of precompute="auto" to precompute=True without issuing deprecation warning and updating all the docstrings to remove any reference to the auto mode, and refactoring the _pre_fit helper method.

ogrisel · 2014-06-04T08:08:43Z

Also the number of targets might have an impact on whether or not precompute=True is faster than precompute=False. We have not benchmarked that either.

MechCoder · 2014-06-04T14:48:49Z

Why was the precompute option removed for the MultiTaskElasticNet/Lasso CV classes? Is it broklen, does setting precompute=True causes a crash on those guys?

MultiTaskElasticNet and LassoCV use a different objective function that just ElasticNet and LassoCV.
So the existing cd_fast.enet_coordinate_descent_gram is valid only for 1 dim y. (in the function signature it is evident). While writing this I did not notice that precompute is unused.

In any case we cannot change the public API (removing parameters) without going through a deprecation cycle.

Is this true, even if it has not been a part of a public release? These were added by me recently.

The initial goal of precompute="auto" was to automatically determine whether or not to precompute the Gram matrix based on the n_samples < n_features test as performed in the sklearn.linear_model.base._pre_fit helper function

Err yes I had meant that. Sorry for being ambiguous. I have updated my PR to, Use gram variant when precompute="auto" and n_samples > n_features or when precompute is True.

Also the number of targets might have an impact on whether or not precompute=True is faster than precompute=False

As mentioned before, ElasticNet and Lasso CV raise errors for multiple targets. If we need to fit multiple targets, we either need to do

models = [ clf.fit(X, y[:, i]  for i in range(n_targets)]

and then use these individually, or directly use MultiTaskElasticNet or LassoCV that does not have a Gram variant.

By the way, I'm already much behind according to my GSoC timeline. Is that ok?

GaelVaroquaux · 2014-06-04T16:17:00Z

On Wed, Jun 04, 2014 at 07:48:52AM -0700, Manoj Kumar wrote:

In any case we cannot change the public API (removing parameters) without
going through a deprecation cycle.
Is this true, even if it has not been a part of a public release? These were
added by me recently.

If it hasn't been released, it's not a problem.

By the way, I'm already much behind according to my GSoC timeline.

Yes, it's not the end of the world, but I agree that we need to keep in
mind the big picture, and therefore consider prioritization. Let's
discuss this in another thread.

ogrisel · 2014-06-05T08:14:49Z

By the way, I'm already much behind according to my GSoC timeline.

We won't hurry any merge to master because of the GSoC timeline. Especially as we are about to cut the 0.15 branch.

I need to find more time to review those changes in deeper details but I don't have the bandwidth to do so now unfortunately.

ogrisel · 2014-06-05T08:26:49Z

@MechCoder could you please split this PR into independent PRs for:

changes that impact the change of the default value of precompute='auto' to precompute=False
the actual fix to call cd_fast.enet_coordinate_descent_gram in the enet_path function when precompute is True which AFAIK is the only change required to benchmark the release the GIL in the cython file
the changes that remove the precompute param from the multitask models.

It seems to me that those 3 changes are independent of one another. I am not satisfied of the current state of item 1. so it will likely take longer to merge while the other 2 items should be less controversial as they are seemingly bugfixes.

MechCoder · 2014-06-05T10:47:46Z

@ogrisel I've opened 3 PR's #3247 #3248 #3249

Closed in favor of them.

ogrisel · 2014-06-05T11:24:11Z

Thanks!

agramfort reviewed May 29, 2014
View reviewed changes

MechCoder changed the title ~~FIX: Use coordinate_descent_gram when precompute is True | auto~~ [MRG+1]: Use coordinate_descent_gram when precompute is True | auto May 29, 2014

MechCoder mentioned this pull request May 30, 2014

[MRG+1] Releasing the GIL in the inner loop of coordinate descent #3102

Merged

5 tasks

MechCoder changed the title ~~[MRG+1]: Use coordinate_descent_gram when precompute is True | auto~~ [MRG]: Use coordinate_descent_gram when precompute is True | auto May 31, 2014

MechCoder added 3 commits June 2, 2014 14:07

FIX: Use coordinate_descent_gram when precompute is True | auto

8321470

Changed default argument of precompute to False as based on benchmarks

bf909b1

Updated whats_new.rst page

2d945ff

FIX: Doctest

9a31569

Made the following changes

a5fe51f

a] Raise ValueError for invalid precompute b] Remove precompute for MultiTask ENet/LassoCV

MechCoder mentioned this pull request Jun 5, 2014

[MRG] Change the default of precompute from "auto" to False #3249

Merged

MechCoder closed this Jun 5, 2014

MechCoder deleted the fix_precompute branch June 5, 2014 10:52

lesteve mentioned this pull request Apr 24, 2018

LassoCV always sets precompute to False before fitting the chosen alpha value #11014

Closed

[MRG]: Use coordinate_descent_gram when precompute is True | auto #3220

[MRG]: Use coordinate_descent_gram when precompute is True | auto #3220

Conversation

MechCoder commented May 29, 2014

MechCoder commented May 29, 2014

coveralls commented May 29, 2014

agramfort May 29, 2014

Choose a reason for hiding this comment

agramfort commented May 29, 2014

MechCoder commented May 29, 2014

agramfort commented May 30, 2014

MechCoder commented May 30, 2014

MechCoder commented May 31, 2014

agramfort commented Jun 1, 2014

MechCoder commented Jun 1, 2014

MechCoder commented Jun 1, 2014

MechCoder commented Jun 2, 2014

ogrisel commented Jun 2, 2014

ogrisel commented Jun 2, 2014

ogrisel commented Jun 2, 2014

MechCoder commented Jun 2, 2014

MechCoder commented Jun 2, 2014

ogrisel commented Jun 2, 2014

ogrisel commented Jun 2, 2014

ogrisel commented Jun 2, 2014

MechCoder commented Jun 2, 2014

jaidevd commented Jun 2, 2014

MechCoder commented Jun 2, 2014

ogrisel commented Jun 3, 2014

ogrisel commented Jun 3, 2014

MechCoder commented Jun 3, 2014

coveralls commented Jun 3, 2014

ogrisel commented Jun 3, 2014

MechCoder commented Jun 3, 2014

coveralls commented Jun 3, 2014

agramfort commented Jun 3, 2014

ogrisel commented Jun 4, 2014

ogrisel commented Jun 4, 2014

ogrisel commented Jun 4, 2014

ogrisel commented Jun 4, 2014

MechCoder commented Jun 4, 2014

GaelVaroquaux commented Jun 4, 2014

ogrisel commented Jun 5, 2014

ogrisel commented Jun 5, 2014

MechCoder commented Jun 5, 2014

ogrisel commented Jun 5, 2014