Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[MRG] Improve stability of SGDClassifier / SGDRegressor with gradient clipping #3883

Merged
merged 2 commits into from Nov 26, 2014

Conversation

ogrisel
Copy link
Member

@ogrisel ogrisel commented Nov 25, 2014

The squared_hinge loss of SGDClassifier (and potentially the squared loss of SGDRegressor) tend to trigger numerical overflows even on normalized data for some hyper parameter combinations.

This PR fixes that issue by clipping dloss to 1e12. All existing still tests pass.

I have also had to prevent strong l2 regularization with large learning rates to trigger negative scales (which are meaningless and can also cause numerical divergence if lower than -1). Instead I set the weights to zero in that case. A new non regression tests highlights this case as well.

Both non regression tests were inspired by #3040. They both fail at epoch #2 and #3 of the iris data with the sgd_fast.pyx implementation from master.

The l2 weight decay rescaling is also kept positive (or null)
in case of strong regularization.
@amueller
Copy link
Member

Can you bench this against master please?

@pprett
Copy link
Member

pprett commented Nov 25, 2014

+1 for a benchmark

otherwise looks good to me

@ogrisel
Copy link
Member Author

ogrisel commented Nov 25, 2014

I am wondering wether there is a better way to compute the clipping in cython.

@larsmans
Copy link
Member

LGTM. As long as you use >, < and not >=, <=, GCC should turn this construct into maxsd/minsd instructions on x86-64. (It can't for >= because of NaN semantics, I think. Always use > if you can!)

@ogrisel
Copy link
Member Author

ogrisel commented Nov 26, 2014

The benchmark seems to show that the change is fine. Here is my script:

import numpy as np
from time import time
from sklearn.linear_model import SGDClassifier

rng = np.random.RandomState(42)

n_samples = int(1e6)
data = rng.randn(n_samples, 100)
target = rng.randint(0, 2, n_samples)

durations = []
for i in range(10):
    t0 = time()
    SGDClassifier(n_iter=5, random_state=10).fit(data, target)
    d = time() - t0
    durations.append(d)
    print("%0.3fs" % d)

print("%0.3f+/-%0.3fs" % (np.mean(durations), np.std(durations)))
  • On master:
$ python ~/tmp/bench_sgd.py
1.638s
1.619s
1.647s
1.650s
1.623s
1.629s
1.669s
1.669s
1.649s
1.660s
1.645+/-0.017s
  • On this branch:
$ python ~/tmp/bench_sgd.py
1.652s
1.636s
1.627s
1.625s
1.676s
1.633s
1.671s
1.644s
1.646s
1.632s
1.644+/-0.017s

@ogrisel
Copy link
Member Author

ogrisel commented Nov 26, 2014

Thanks @larsmans for the tip. Shall I merge?

larsmans added a commit that referenced this pull request Nov 26, 2014
Improve stability of SGDClassifier / SGDRegressor with gradient clipping
@larsmans larsmans merged commit f5e0ea0 into scikit-learn:master Nov 26, 2014
@ogrisel
Copy link
Member Author

ogrisel commented Nov 26, 2014

Thanks! Let me add a whats_new.rst entry.

@GaelVaroquaux
Copy link
Member

Great job!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants