[MRG+1] Use more natural class_weight="auto" heuristic #4347

amueller · 2015-03-05T22:26:45Z

Fix for #4324.
No test yet. Travis passes, so we don't have very strict tests for this ^^ (apart from the manual reimplementation in the test).

amueller · 2015-03-05T22:44:12Z

sklearn/utils/tests/test_class_weight.py

+    logreg0 = LogisticRegression(class_weight="auto").fit(X_0, y_0)
+    logreg = LogisticRegression(class_weight="auto").fit(X_, y_)
+    assert_array_almost_equal(logreg1.coef_, logreg0.coef_)
+    assert_array_almost_equal(logreg.coef_, logreg0.coef_)


This is not true in master.

trevorstephens · 2015-03-08T00:58:27Z

Hey @amueller & @ogrisel , as I mentioned on the Gitters, the change (while I do not dispute it being much more intuitive) could break reproducible results between scikit-learn versions. I ran a couple of quick toy dataset tests and there are some differences in results between implementations. I think some comment about the effect in the what's new would be a good idea.

X, y = make_classification(n_samples=100, n_features=20, n_informative=10, 
                           weights=[0.8, 0.2], random_state=415)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, 
                                                    random_state=415)

Logit:

clf = LogisticRegression(random_state=415, class_weight='auto')
clf.fit(X_train, y_train)
print clf.predict_proba(X_test)[:, 1]
print clf.coef_

Old version:

[ 0.28526629  0.99479878  0.77898055  0.01463206  0.02895576  0.47175511
  0.23164939  0.08577271  0.96457237  0.77217252  0.98978379  0.34129472
  0.01877176  0.996313    0.05685967  0.99951695  0.93041836  0.0174167
  0.19766607  0.31876434]
[[-0.15651352 -0.23085976 -0.22061261  0.89981967  0.64370659 -0.50625197
   0.26951804 -0.05344735 -0.20691435  0.40288703 -0.70316265  0.23589797
   0.1325999   0.91270498 -0.53125852  0.56085782 -0.74460298  0.71283277
  -0.78188798 -0.2205095 ]]

New version:

[ 0.27663815  0.99687075  0.80955765  0.00809551  0.02114547  0.42299626
  0.18592452  0.0623878   0.97494142  0.7559168   0.99441047  0.28947663
  0.01098863  0.9981927   0.05008777  0.99979476  0.94687392  0.01078669
  0.17458846  0.2906671 ]
[[-0.16641486 -0.2567103  -0.26308952  0.9869177   0.73400539 -0.58061991
   0.28760975 -0.06580417 -0.24562278  0.45810175 -0.79721965  0.26902468
   0.15327072  1.02790971 -0.61152246  0.63661494 -0.81292435  0.78248383
  -0.88835426 -0.26757662]]

Forests:

clf = RandomForestClassifier(n_estimators=200, random_state=415, class_weight='auto')
clf.fit(X_train, y_train)
print clf.predict_proba(X_test)[:, 1]
print clf.feature_importances_

Old version:

[ 0.2    0.435  0.29   0.255  0.12   0.21   0.415  0.08   0.255  0.265
  0.325  0.43   0.165  0.35   0.095  0.42   0.34   0.1    0.265  0.155]
[ 0.05371131  0.05131622  0.03093554  0.07824664  0.03288011  0.03451192
  0.05496429  0.04458952  0.04265068  0.02835565  0.04921904  0.02570175
  0.03054679  0.03845795  0.02101664  0.0634564   0.08183292  0.13486313
  0.07393151  0.028812  ]

New version:

[ 0.205  0.42   0.315  0.27   0.12   0.2    0.38   0.075  0.25   0.24
  0.39   0.42   0.165  0.415  0.105  0.45   0.36   0.105  0.305  0.14 ]
[ 0.05633612  0.05142512  0.02951519  0.07920261  0.03280725  0.03518552
  0.0541076   0.04417705  0.04117407  0.02718026  0.04751108  0.02731006
  0.03113306  0.03888567  0.02224603  0.06455641  0.08175883  0.13231104
  0.07435735  0.02881968]

amueller · 2015-03-09T14:44:05Z

@trevorstephens do you think the whatsnew entry that I added is not sufficient? This will definitely change results, but as I remarked elsewhere, if the user grid-searches C in the linear models, the results will be the same. Not sure about the forests.

hannawallach · 2015-03-09T15:05:05Z

Sorry to have been silent (was at a conference). Yes, if the user grid searches C in any of the linear models they can theoretically get the same optimized parameters as before, depending on how many C values they search over. Specifically, if w0, ..., wM are the old class weights and w'0, ..., w'M are the new class weights, such that w'm = lambda * wm where lambda is the constant described above, then the optimized parameters of the model w/ class weights w0, ..., wM and an inverse regularization paremter C will be identical to the optimized parameters of the model w/ class weights w'0, ..., w'M and an inverse regularization parameter C / lambda.

So, without grid searching over C (or setting C to whatever value was used previously divided by lambda), this change to class weights will produce different results. So I think there are two options: leave "auto" as it is and just be much clearer about what it does (even acknowleding that maybe it's not the most obvious thing to do, but differs only in the "effective" number of data points) or change it.

If we go the latter route, we also need to give a much clearer explanation in the documentation of what "auto" does and how it relates to the value of C, but we should also make very clear somewhere (and I don't know where) exactly how the change will affect people's results with a fixed C value and how they should set C to recreate their old results. (I think this is important, not least because changes to default parameters can be very confusing and easy to overlook/miss. I ran into this last week w/ the change to the default min_df value in CountVectorizer() -- ugh.)

Anyway, all that said, I'd like to hold off on changing anything for another 24 hours or so. Two reasons: 1) I have no idea how this change will affect random forests or other (nonlinear) classifiers. Someone should investigate this before changing anything and should really do so in terms of writing down math, in addition to plugging in values and seeing what happens. 2) I'd like to find out what other weighting schemes (if any) other people (e.g., Hal, John, etc.) can think of just to check that there aren't other, even more natural, options that we're overlooking.

amueller · 2015-03-09T15:14:29Z

Thanks for your feedback. We do document all changes in the and the , unfortunately that list is rather long and so it is a bit unlikely that users will go through each on an upgrade. I don't know of a better solution, though.
This PR does include a better documentation of what class_weight="auto" does, though it would still require the user to read code to find it, so it is not ideal. With the new heuristic, it would be easy to add a line to all docstrings for class_weights. The old one is a bit longer to explain.

We could also make the compute_class_weight function public and refer to it.
For the trees, maybe @glouppe has some input?

trevorstephens · 2015-03-09T15:18:45Z

@amueller @hannawallach ... I guess I just meant that if people had hard-coded a 'good' set of parameters, through grid search or heuristics, their models might change a bit. Some comment about this in the what's new is what I was suggesting.

As for forests and trees, class_weight is not yet in a public release, so I wouldn't worry too much about that as few people are likely using it. Though I would say this change should try to get into the final 0.16 release so as not to mess with it down the road. Any change in the scaling of the sample weights shouldn't do much to the trees at the top nodes from what I can think of, but down the bottom of full trees, as tiny gini improvements are split on, floating point representation fun might kick in. I'll try to introspect a couple of trees tonight to see what their structures look like between implementations.

amueller · 2015-03-09T17:04:23Z

I'm slightly surprised that the tree output changes. have you checked if the tree changes or only the probability estimates?
The incompatibility should not be an issue here, as you say. I am just curious as to why this happens.

amueller · 2015-03-09T18:03:25Z

@trevorstephens what was the data you compared the trees on?

amueller · 2015-03-09T18:16:32Z

Never mind, I reproduced the tree things here: #4366
Interesting....
For this fix, I think it is more relevant are the linear models, and we need to decide to either:

support both methods, as "auto" and "auto2"
deprecate "auto" and introduce "balance_samples" (or any other name for the new heuristic)
change the behavior of "auto" as done here, and document in whatsnew (as done in this PR)

I'm not super happy about behavior changes, as they can surprise users and we can't really assume users read whatsnew.rst.

trevorstephens · 2015-03-09T18:39:29Z

@trevorstephens what was the data you compared the trees on?

Was in the first code block of the post, just a rather small unbalanced make_classification:

X, y = make_classification(n_samples=100, n_features=20, n_informative=10, 
                           weights=[0.8, 0.2], random_state=415)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, 
                                                    random_state=415)

hannawallach · 2015-03-09T18:46:56Z

Personally, I'd prefer option number 2 (deprecate "auto" and introduce "balance" or some similar name) just so that it's very clear to everyone that there's been a change. I'm open to being convinced otherwise though :-)

amueller · 2015-03-09T19:08:59Z

If we go with number 2, does anyone have opinions on whether we should have the old tests for auto in addition to new tests for the new version?
Also @ogrisel any opinions on which of the three options you'd prefer? [or anyone else]

trevorstephens · 2015-03-09T19:34:29Z

I'd prefer the 2nd, deprecate, option as well. From a Kaggler POV, I want to be able to reproduce results wherever possible for verification, or just for blending runs. Plus 'balance' or something seems more informative than 'auto' IMO.

As requested, looking into the two forests, there's definitely some strange stuff going on. The first tree in the ensemble even diverges, albeit only slightly. Not necessarily on gini, samples, possibly not even the outcome, etc... But you can see a different variable split at the far left node. Other ones I've looked at further down the ensemble diverge even more than this one with differing gini's down the tree. Using:

tree.export_graphviz(clf.estimators_[0], out_file=out)

Master:

New Version:

GaelVaroquaux · 2015-03-09T19:37:06Z

Beyond Kaggle, which is a just an eloborate game, people won't trust scikit-learn if the behavior changes between versions. I think we should strive to go through a depreciation cycle for this too

amueller · 2015-03-09T21:41:01Z

@GaelVaroquaux a deprecation cycle meaning a name change and removing the old name (just to confirm).
@trevorstephens see the discussion in #4366. This is not really related to the issue.

trevorstephens · 2015-03-09T21:43:08Z

Ok, will head on over there. Sorry to pollute the waters.

amueller · 2015-03-16T22:08:11Z

This has the deprecations and additional parameter now. Any reviews?

amueller · 2015-05-12T19:48:41Z

well I don't see a different way, so maybe it will be "balanced_subsample"

amueller · 2015-05-12T20:10:01Z

Ok should be good now.

ogrisel · 2015-05-29T10:16:04Z

sklearn/utils/estimator_checks.py

-        1: 1. / 3 / mean_weight,
-        -1: 1. / 2 / mean_weight,
+        1: 5. / (2 * 3),
+        -1: 5. / (2 * 2)
    }


To make this test easier to read I would do:

y = np.array([1, 1, 1, -1, -1]) n_samples = len(y) n_classes = len(np.unique(y)) class_weight = { 1: n_samples / (np.sum(y == 1) * n_classes), -1: n_samples / (np.sum(y == -1) * n_classes), }

ogrisel · 2015-05-29T10:23:25Z

Besides minor comments, LGTM.

ogrisel · 2015-05-29T13:47:14Z

doc/whats_new.rst

@@ -56,6 +56,9 @@ Enhancements
     :class:`linear_model.LogisticRegression`, by avoiding loss computation.
     By `Mathieu Blondel`_ and `Tom Dupre la Tour`_.

+   - Improved heuristic for ``class_weight="auto"`` for classifiers supporting
+     ``class_weight`` by Hanna Wallach and `Andreas Müller`_


There should be a bit more details here: at least explain that class_weight="auto" has been deprecated in favor of class_weight="balanced" that provides invariance to class imbalance introduced by re-sampling with replacement for linear models.

Maybe also mention class_weight='balanced_subsample' for forest models.

…ward compatibility to 0.16

GaelVaroquaux · 2015-06-01T16:54:21Z

Aside the small comments by @ogrisel and myself, 👍 for merge. Thank you

GaelVaroquaux · 2015-06-01T17:41:16Z

Have you seen that the tests are failing?

amueller · 2015-06-01T20:13:59Z

fixed.

GaelVaroquaux · 2015-06-01T20:15:21Z

OK. Merging!

[MRG+1] Use more natural class_weight="auto" heuristic

amueller · 2015-06-01T20:16:26Z

thanks :) 🍻

rebase on top of scikit-learn#4347 improve error message

rebase on top of scikit-learn#4347 improve error message update error msg

amueller force-pushed the class_weight_auto branch from 518ca64 to 3d7ad59 Compare March 5, 2015 22:43

amueller reviewed Mar 5, 2015
View reviewed changes

amueller mentioned this pull request Mar 5, 2015

Possible bug in compute_class_weight() #4324

Closed

amueller force-pushed the class_weight_auto branch from 3d7ad59 to 76badb1 Compare March 5, 2015 22:53

amueller changed the title ~~[WIP] Use more natural class_weight="auto" heuristic~~ [MRG] Use more natural class_weight="auto" heuristic Mar 5, 2015

amueller force-pushed the class_weight_auto branch from 76badb1 to 71e2e88 Compare March 6, 2015 17:40

amueller force-pushed the class_weight_auto branch from c25e335 to b2cdc55 Compare March 9, 2015 16:27

amueller mentioned this pull request Mar 9, 2015

Instabilities in sample weights in trees #4366

Closed

amueller force-pushed the class_weight_auto branch 5 times, most recently from 1920ff0 to e99705c Compare March 10, 2015 20:54

amueller added this to the 0.17 milestone May 19, 2015

amueller force-pushed the class_weight_auto branch from 6f320fe to df3d97b Compare May 28, 2015 15:46

ogrisel reviewed May 29, 2015
View reviewed changes

ogrisel changed the title ~~[MRG] Use more natural class_weight="auto" heuristic~~ [MRG+1] Use more natural class_weight="auto" heuristic May 29, 2015

ogrisel reviewed May 29, 2015
View reviewed changes

amueller added 2 commits June 1, 2015 12:49

Use more natural class_weight="auto" heuristic

eedc1cd

FIX add class_weight="balanced_subsample" to the forests to keep back…

123a8c9

…ward compatibility to 0.16

Cosmit readability improvements and better whatsnew.

2ddb503

amueller force-pushed the class_weight_auto branch from df3d97b to 2ddb503 Compare June 1, 2015 16:56

float division

4ca3878

GaelVaroquaux added a commit that referenced this pull request Jun 1, 2015

Merge pull request #4347 from amueller/class_weight_auto

01aff46

[MRG+1] Use more natural class_weight="auto" heuristic

GaelVaroquaux merged commit 01aff46 into scikit-learn:master Jun 1, 2015

trevorstephens mentioned this pull request Jun 2, 2015

[MRG + 1] Add class_weight to PA Classifier, remove from PA Regressor #4767

Merged

trevorstephens added a commit to trevorstephens/scikit-learn that referenced this pull request Jun 4, 2015

rebase on top of scikit-learn#4347

0166186

trevorstephens added a commit to trevorstephens/scikit-learn that referenced this pull request Jun 4, 2015

add class_weight to PA cls, remove from PA reg

718825d

rebase on top of scikit-learn#4347 improve error message

trevorstephens added a commit to trevorstephens/scikit-learn that referenced this pull request Jun 6, 2015

add class_weight to PA cls, remove from PA reg

ee78879

rebase on top of scikit-learn#4347 improve error message update error msg

jnothman mentioned this pull request Jun 2, 2016

Remove remaining references to class_weight='auto' #6851

Closed

amueller deleted the class_weight_auto branch May 19, 2017 20:44

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[MRG+1] Use more natural class_weight="auto" heuristic #4347

[MRG+1] Use more natural class_weight="auto" heuristic #4347

amueller commented Mar 5, 2015

amueller Mar 5, 2015

trevorstephens commented Mar 8, 2015

amueller commented Mar 9, 2015

hannawallach commented Mar 9, 2015

amueller commented Mar 9, 2015

trevorstephens commented Mar 9, 2015

amueller commented Mar 9, 2015

amueller commented Mar 9, 2015

amueller commented Mar 9, 2015

trevorstephens commented Mar 9, 2015

hannawallach commented Mar 9, 2015

amueller commented Mar 9, 2015

trevorstephens commented Mar 9, 2015

GaelVaroquaux commented Mar 9, 2015 via email

amueller commented Mar 9, 2015

trevorstephens commented Mar 9, 2015

amueller commented Mar 16, 2015

amueller commented May 12, 2015

amueller commented May 12, 2015

ogrisel May 29, 2015

ogrisel commented May 29, 2015

ogrisel May 29, 2015

GaelVaroquaux commented Jun 1, 2015

GaelVaroquaux commented Jun 1, 2015

amueller commented Jun 1, 2015

GaelVaroquaux commented Jun 1, 2015

amueller commented Jun 1, 2015

[MRG+1] Use more natural class_weight="auto" heuristic #4347

[MRG+1] Use more natural class_weight="auto" heuristic #4347

Conversation

amueller commented Mar 5, 2015

amueller Mar 5, 2015

Choose a reason for hiding this comment

trevorstephens commented Mar 8, 2015

amueller commented Mar 9, 2015

hannawallach commented Mar 9, 2015

amueller commented Mar 9, 2015

trevorstephens commented Mar 9, 2015

amueller commented Mar 9, 2015

amueller commented Mar 9, 2015

amueller commented Mar 9, 2015

trevorstephens commented Mar 9, 2015

hannawallach commented Mar 9, 2015

amueller commented Mar 9, 2015

trevorstephens commented Mar 9, 2015

GaelVaroquaux commented Mar 9, 2015 via email

amueller commented Mar 9, 2015

trevorstephens commented Mar 9, 2015

amueller commented Mar 16, 2015

amueller commented May 12, 2015

amueller commented May 12, 2015

ogrisel May 29, 2015

Choose a reason for hiding this comment

ogrisel commented May 29, 2015

ogrisel May 29, 2015

Choose a reason for hiding this comment

GaelVaroquaux commented Jun 1, 2015

GaelVaroquaux commented Jun 1, 2015

amueller commented Jun 1, 2015

GaelVaroquaux commented Jun 1, 2015

amueller commented Jun 1, 2015