Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[MRG]: Use coordinate_descent_gram when precompute is True | auto #3220

Closed
wants to merge 5 commits into from

Conversation

MechCoder
Copy link
Member

This PR does the following

1] Bench to show that precompute="auto" offers very slight advantage.
2] Remove precompute from MultiTaskElasticNet/Lasso CV
3] Use gram variant when precompute="auto" or True

@MechCoder
Copy link
Member Author

Also I think there is a slight mistake in the docs in cd_fast.

    1 w^T Q w - q^T w + alpha norm(w, 1) + beta norm(w, 2)^2
    -                                      ----
    2                                        2

    which amount to the Elastic-Net problem when:
    Q = X^T X (Gram matrix)
    q = X^T y

Should there be an extra norm(y, 2)^2 term?

@coveralls
Copy link

Coverage Status

Coverage increased (+0.0%) when pulling b8b21a3 on MechCoder:fix_precompute into 92c4308 on scikit-learn:master.

tol, positive)
else:
model = cd_fast.enet_coordinate_descent(
coef_, l1_reg, l2_reg, X, y, max_iter, tol, positive)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

do you confirm a speed up with n_samples >> n_features?

@agramfort
Copy link
Member

Should there be an extra norm(y, 2)^2 term?

this term is constant so we don't care

@MechCoder
Copy link
Member Author

@agramfort It seems to slow down for me.

In [1]: from sklearn.datasets import make_regression

In [2]: X, y = make_regression(n_samples=5000, n_features=100)

In [3]: from sklearn.linear_model import *

In [4]: clf = ElasticNet(precompute=False)

In [5]: %timeit clf.fit(X, y)
10 loops, best of 3: 47.9 ms per loop

In [6]: clf = ElasticNet(precompute="auto")

In [7]: %timeit clf.fit(X, y)
1 loops, best of 3: 236 ms per loop

@MechCoder MechCoder changed the title FIX: Use coordinate_descent_gram when precompute is True | auto [MRG+1]: Use coordinate_descent_gram when precompute is True | auto May 29, 2014
@agramfort
Copy link
Member

this is weird although possible. Is the dual gap the same at the end? what if n_samples is even bigger and n_features smaller?

@MechCoder
Copy link
Member Author

@agramfort My laptop does give weird results sometimes, but I've tested it multiple times. Would you be able to check on your machine? I'll test the remaining cases.

@MechCoder
Copy link
Member Author

I've changed the default arguments of precompute, as based on the benchmarks run on the rackspace that @ogrisel gave me

In [28]: X, y = make_regression(n_samples=10000, n_features=50)

In [29]: clf = ElasticNet(precompute="auto")

In [30]: %timeit clf.fit(X, y)
10 loops, best of 3: 42.4 ms per loop

 In [31]: clf = ElasticNet(precompute=False)

 In [32]: %timeit clf.fit(X, y)
100 loops, best of 3: 11.2 ms per loop

In [33]: clf = ElasticNetCV(precompute=False, cv=10, n_jobs=-1)

In [34]: %timeit clf.fit(X, y)
1 loops, best of 3: 1.28 s per loop

In [35]: clf = ElasticNetCV(precompute=" auto", cv=10, n_jobs=-1)

In [36]: %timeit clf.fit(X, y)
1 loops, best of 3: 1.41 s per loop

@agramfort Please merge this, if you have no objections.

@MechCoder MechCoder changed the title [MRG+1]: Use coordinate_descent_gram when precompute is True | auto [MRG]: Use coordinate_descent_gram when precompute is True | auto May 31, 2014
@agramfort
Copy link
Member

I confirm that 'auto' is not doing the right thing when the model is trained with a single alpha. The overhead of computation of the gram matrix kills the benefit of fitting using the gram.

now is the conclusion still true when y is 2d with many targets are passed?

the what's new page API section will have to be udpated if we change the default arguments.

@MechCoder
Copy link
Member Author

@agramfort I've updated the whats_new page.

is the conclusion still true when y is 2d with many targets are passed?

The gram_coordinate_descent doesn't work for 2d y,

I've also tested it for Lasso and LassoCV and it does slow down.

@MechCoder
Copy link
Member Author

By the way, the Error in Travis has nothing to do with this PR. Seems to be a timeout.

@MechCoder
Copy link
Member Author

@agramfort The only time I think it has a really really slight advantage is when cv=3 or 4, but it is only very little in the order of 0.0x seconds. When cv is more, since we need to calculate the Gram repeatedly for multiple folds, the advantage is again lost.

Do you have any specific case, that you want me to bench?

@ogrisel
Copy link
Member

ogrisel commented Jun 2, 2014

By the way, the Error in Travis has nothing to do with this PR. Seems to be a timeout.

No, it is caused by a failing doctest that needs to take the change of this PR into account:

https://travis-ci.org/scikit-learn/scikit-learn/jobs/26520998#L5635

@ogrisel
Copy link
Member

ogrisel commented Jun 2, 2014

In #3220 (comment), the following line should have caused a value error:

In [35]: clf = ElasticNetCV(precompute=" auto", cv=10, n_jobs=-1)

@ogrisel
Copy link
Member

ogrisel commented Jun 2, 2014

Running this branch on my box:

>>> from sklearn.datasets import make_regression
>>> from sklearn.linear_model import ElasticNet
>>> from sklearn.linear_model import ElasticNetCV
>>> X, y = make_regression(n_samples=10000, n_features=50)
>>> %timeit ElasticNet(precompute=False).fit(X, y)
100 loops, best of 3: 9.57 ms per loop
>>> %timeit ElasticNet(precompute=True).fit(X, y)
10 loops, best of 3: 27.1 ms per loop
>>> %timeit ElasticNetCV(precompute=False, cv=10, n_jobs=8).fit(X, y)
1 loops, best of 3: 1.48 s per loop
>>> %timeit ElasticNetCV(precompute=True, cv=10, n_jobs=8).fit(X, y)
1 loops, best of 3: 405 ms per loop

I used %doctest_mode to make it easier to copy and paste the lines while ignoring the >>> prompts.

So Gram pre-computation seems to be benefiting the CV variant while not benefiting the original model with fixed alpha. This is rather confusing to me.

precompute=True is still slower even with lower values of alpha:

>>> %timeit ElasticNet(alpha=0.00001, precompute=False).fit(X, y)
100 loops, best of 3: 12.2 ms per loop
>>> %timeit ElasticNet(alpha=0.00001, precompute=True).fit(X, y)
10 loops, best of 3: 28 ms per loop

For wide problems (n_features >> n_samples), precomputing the Gram matrix is even worst (but I think this is expected):

>>> X, y = make_regression(n_samples=50, n_features=10000)
>>> %timeit ElasticNet(alpha=0.0001, precompute=False).fit(X, y)
100 loops, best of 3: 7.36 ms per loop
>>> %timeit ElasticNet(alpha=0.0001, precompute=True).fit(X, y)
1 loops, best of 3: 2.28 s per loop

I get the similar results with the CV variant:

>>> X, y = make_regression(n_samples=50, n_features=1000)
>>> %timeit clf = ElasticNetCV(cv=5, precompute=False, n_jobs=8).fit(X, y)
/volatile/ogrisel/code/scikit-learn/sklearn/linear_model/coordinate_descent.py:487: ConvergenceWarning: Objective did not converge. You might want to increase the number of iterations
  ConvergenceWarning)
1 loops, best of 3: 613 ms per loop
>>> %timeit clf = ElasticNetCV(cv=5, precompute=True, n_jobs=8).fit(X, y)
1 loops, best of 3: 5.02 s per loop

Both models find the same optimal value for alpha_.

@MechCoder
Copy link
Member Author

@ogrisel I get slightly varied results with 4 cores.

>>> from sklearn.linear_model import ElasticNet
>>> from sklearn.linear_model import ElasticNetCV
>>> X, y = make_regression(n_samples=10000, n_features=50) 
>>> %timeit ElasticNet(precompute=False).fit(X, y)
10 loops, best of 3: 25.9 ms per loop
>>> %timeit ElasticNet(precompute=True).fit(X, y)
10 loops, best of 3: 84.8 ms per loop
>>> %timeit ElasticNetCV(precompute=False, cv=10, n_jobs=-1).fit(X, y)
1 loops, best of 3: 5.74 s per loop
>>> %timeit ElasticNetCV(precompute=True, cv=10, n_jobs=-1).fit(X, y)
1 loops, best of 3: 6.1 s per loop

@MechCoder
Copy link
Member Author

@ogrisel Both in the Rackspace cloud, on my box and from your benches, we can be convinced that (please correct me if I'm wrong)

  1. For ElasticNet and Lasso, setting precompute = False, seems to be the way to go.
  2. I'm not really sure about ElasticNetCV and LassoCV, for me both don't seem to be make much of a difference (wrt to timing), unlike your benches. Would you please be able to verify those cases? i.e Setting precompute to be False and "auto".

Whatever the default may be, I would like to get this PR merged quick so that I can continue work on the other PR.

@ogrisel
Copy link
Member

ogrisel commented Jun 2, 2014

The best strategy is probably data dependent, but what those experiments says is that the current heuristic implemented when precompute="auto" is probably not very good. It would be worth trying on more realistic datasets but I don't have any good one in mind.

Maybe @agramfort or @mblondel could suggest ideas? In my opinion ElasticNet (with the coordinate descent solver) is typically used with high dimensional, noisy data with a potentially large number of irrelevant features.

Maybe you can try to check whether precompute=False is always the best strategy on more noisy data such as: make_regression(n_samples=1000, n_features=50, noise=0.1, n_informatives=10) and make_regression(n_samples=50, n_features=1000, noise=0.1, n_informatives=10).

You can also try to see the impact of correlated features with data generated with effective_rank=10 for instance.

@ogrisel
Copy link
Member

ogrisel commented Jun 2, 2014

Also please try to address: #3220 (comment)

@ogrisel
Copy link
Member

ogrisel commented Jun 2, 2014

And we can separate the two issues currently addressed in this PR:

  1. enet_coordinate_descent_gram is unused in enet_path even when precompute=True
  2. what is the best behavior for precompute="auto".

Item 1 should not be controversial. Item 2 probably requires more investigations.

@MechCoder
Copy link
Member Author

@ogrisel On a side note, is it possible that I'm seeing a drastic slowdown as compared to your benchmarks because of the way I installed scikit-learn.

I installed the dependencies.

sudo apt-get install build-essential python-dev python-numpy python-setuptools python-scipy   
libatlas-dev libatlas3-base

and then just did python setup.py . Is there any reason why this would slow it down?

@jaidevd
Copy link
Contributor

jaidevd commented Jun 2, 2014

@MechCoder @ogrisel

I tried to see if precompute=False is the dominant strategy for noisy data, and this was the result -

>>> X, y = make_regression(n_samples=1000,n_features=50, noise=0.1, n_informative=10)
>>> %timeit ElasticNet(precompute=False).fit(X,y)
1000 loops, best of 3: 915 µs per loop
>>> %timeit ElasticNet(precompute=True).fit(X,y)
1000 loops, best of 3: 1.01 ms per loop
>>> %timeit ElasticNetCV(precompute=False, cv=10, n_jobs=8).fit(X,y)
1 loops, best of 3: 572 ms per loop
>>> %timeit ElasticNetCV(precompute=True, cv=10, n_jobs=8).fit(X,y)
1 loops, best of 3: 367 ms per loop
>>> X, y = make_regression(n_samples=50,n_features=1000, noise=0.1, n_informative=10)
>>> %timeit ElasticNet(precompute=False).fit(X,y)                               
100 loops, best of 3: 14.8 ms per loop
>>> %timeit ElasticNet(precompute=True).fit(X,y)
10 loops, best of 3: 99.5 ms per loop
>>> %timeit ElasticNetCV(precompute=False, cv=10, n_jobs=8).fit(X,y)            
1 loops, best of 3: 1.36 s per loop
>>> %timeit ElasticNetCV(precompute=True, cv=10, n_jobs=8).fit(X,y)
1 loops, best of 3: 6.91 s per loop

Looks like that is indeed the case, except in the case of ElasticNetCV when n_samples >> n_features.

@ogrisel Is there some better test than simply %timeit to ascertain that this is the case?

@MechCoder
Copy link
Member Author

@jaidevd @ogrisel Thanks. I'm getting simlar results.

@agramfort There seems to be just a small margin of speed gain in the case when n_samples >> n_features. What more can we do get this verified?

@ogrisel
Copy link
Member

ogrisel commented Jun 3, 2014

Is there any reason why this would slow it down?

No. You could rebuild atlas to tune it to your architecture (e.g. see: http://danielnouri.org/notes/2012/12/19/libblas-and-liblapack-issues-and-speed,-with-scipy-and-ubuntu/ ) but it's more likely that the absolute speed difference between our setups is explained by hardware (e.g. size of the CPU caches) rather than software in this case. But you should not focus on absolute perf numbers but rather relative performance between method on the same hardware.

@ogrisel Is there some better test than simply %timeit to ascertain that this is the case?

timeit is fine. We just need to check that the standard deviation across run is low enough. If not it's worth benchmarking on larger problems.

@ogrisel
Copy link
Member

ogrisel commented Jun 3, 2014

There seems to be just a small margin of speed gain in the case when n_samples >> n_features. What more can we do get this verified?

I played a bit more with noisy data generated with make_regression and I could never get the precompute=True version be significantly faster than that the precompute=False version. @agramfort any data in mind?

@MechCoder
Copy link
Member Author

The build passes now.

@coveralls
Copy link

Coverage Status

Coverage increased (+0.0%) when pulling 9a31569 on MechCoder:fix_precompute into 0a61119 on scikit-learn:master.

@ogrisel
Copy link
Member

ogrisel commented Jun 3, 2014

@MechCoder can you please address #3220 (comment) ? Unexpected values for the precompute parameter should raise a ValueError instead of silently fallbacking to the "auto" mode.

a] Raise ValueError for invalid precompute
b] Remove precompute for MultiTask ENet/LassoCV
@MechCoder
Copy link
Member Author

@ogrisel Fixed.

I also removed precompute from MultiTaskElasticNet / Lasso CV since it is not being used.

@coveralls
Copy link

Coverage Status

Coverage increased (+0.0%) when pulling a5fe51f on MechCoder:fix_precompute into 0a61119 on scikit-learn:master.

@agramfort
Copy link
Member

ok. Looks like the deal is done. LGTM. @ogrisel it is difficult to find relevant datasets with n_samples >> n_features. The gram trick is however useful for dico learning cf. dico_learning.py

+1 for merge on my side.

@ogrisel feel free to merge tomorrow if you're happy.

@ogrisel
Copy link
Member

ogrisel commented Jun 4, 2014

@ogrisel feel free to merge tomorrow if you're happy.

I am not that happy: I don't think we should keep an "auto" mode that is never useful and not used by default: I would rather deprecate it explicitly and add tests to check that the deprecation warnings work.

@ogrisel
Copy link
Member

ogrisel commented Jun 4, 2014

2] Remove precompute from MultiTaskElasticNet/Lasso CV

Why was the precompute option removed for the MultiTaskElasticNet/Lasso CV classes? Is it broklen, does setting precompute=True causes a crash on those guys?

In any case we cannot change the public API (removing parameters) without going through a deprecation cycle.

@ogrisel
Copy link
Member

ogrisel commented Jun 4, 2014

3] Use gram variant when precompute="auto" or True

The initial goal of precompute="auto" was to automatically determine whether or not to precompute the Gram matrix based on the n_samples < n_features test as performed in the sklearn.linear_model.base._pre_fit helper function: https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/linear_model/base.py#L400

We should respect that contract.

If the n_samples < n_features heuristic test is a bad test as seems to be demonstrated by the benchmarks in this PR then we might decide to come up with a better heuristic or decide to deprecate the precompute="auto" option entirely.

In any case we should not silently change the behavior of precompute="auto" to precompute=True without issuing deprecation warning and updating all the docstrings to remove any reference to the auto mode, and refactoring the _pre_fit helper method.

@ogrisel
Copy link
Member

ogrisel commented Jun 4, 2014

Also the number of targets might have an impact on whether or not precompute=True is faster than precompute=False. We have not benchmarked that either.

@MechCoder
Copy link
Member Author

Why was the precompute option removed for the MultiTaskElasticNet/Lasso CV classes? Is it broklen, does setting precompute=True causes a crash on those guys?

MultiTaskElasticNet and LassoCV use a different objective function that just ElasticNet and LassoCV.
So the existing cd_fast.enet_coordinate_descent_gram is valid only for 1 dim y. (in the function signature it is evident). While writing this I did not notice that precompute is unused.

In any case we cannot change the public API (removing parameters) without going through a deprecation cycle.

Is this true, even if it has not been a part of a public release? These were added by me recently.

The initial goal of precompute="auto" was to automatically determine whether or not to precompute the Gram matrix based on the n_samples < n_features test as performed in the sklearn.linear_model.base._pre_fit helper function

Err yes I had meant that. Sorry for being ambiguous. I have updated my PR to, Use gram variant when precompute="auto" and n_samples > n_features or when precompute is True.

Also the number of targets might have an impact on whether or not precompute=True is faster than precompute=False

As mentioned before, ElasticNet and Lasso CV raise errors for multiple targets. If we need to fit multiple targets, we either need to do

models = [ clf.fit(X, y[:, i]  for i in range(n_targets)]

and then use these individually, or directly use MultiTaskElasticNet or LassoCV that does not have a Gram variant.

By the way, I'm already much behind according to my GSoC timeline. Is that ok?

@GaelVaroquaux
Copy link
Member

On Wed, Jun 04, 2014 at 07:48:52AM -0700, Manoj Kumar wrote:

In any case we cannot change the public API (removing parameters) without
going through a deprecation cycle.

Is this true, even if it has not been a part of a public release? These were
added by me recently.

If it hasn't been released, it's not a problem.

By the way, I'm already much behind according to my GSoC timeline.

Yes, it's not the end of the world, but I agree that we need to keep in
mind the big picture, and therefore consider prioritization. Let's
discuss this in another thread.

@ogrisel
Copy link
Member

ogrisel commented Jun 5, 2014

By the way, I'm already much behind according to my GSoC timeline.

We won't hurry any merge to master because of the GSoC timeline. Especially as we are about to cut the 0.15 branch.

I need to find more time to review those changes in deeper details but I don't have the bandwidth to do so now unfortunately.

@ogrisel
Copy link
Member

ogrisel commented Jun 5, 2014

@MechCoder could you please split this PR into independent PRs for:

  1. changes that impact the change of the default value of precompute='auto' to precompute=False
  2. the actual fix to call cd_fast.enet_coordinate_descent_gram in the enet_path function when precompute is True which AFAIK is the only change required to benchmark the release the GIL in the cython file
  3. the changes that remove the precompute param from the multitask models.

It seems to me that those 3 changes are independent of one another. I am not satisfied of the current state of item 1. so it will likely take longer to merge while the other 2 items should be less controversial as they are seemingly bugfixes.

@MechCoder
Copy link
Member Author

@ogrisel I've opened 3 PR's #3247 #3248 #3249

Closed in favor of them.

@MechCoder MechCoder closed this Jun 5, 2014
@MechCoder MechCoder deleted the fix_precompute branch June 5, 2014 10:52
@ogrisel
Copy link
Member

ogrisel commented Jun 5, 2014

Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants