Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[MRG+1] Use more natural class_weight="auto" heuristic #4347

Merged
merged 4 commits into from Jun 1, 2015

Conversation

amueller
Copy link
Member

@amueller amueller commented Mar 5, 2015

Fix for #4324.
No test yet. Travis passes, so we don't have very strict tests for this ^^ (apart from the manual reimplementation in the test).

logreg0 = LogisticRegression(class_weight="auto").fit(X_0, y_0)
logreg = LogisticRegression(class_weight="auto").fit(X_, y_)
assert_array_almost_equal(logreg1.coef_, logreg0.coef_)
assert_array_almost_equal(logreg.coef_, logreg0.coef_)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not true in master.

@amueller amueller changed the title [WIP] Use more natural class_weight="auto" heuristic [MRG] Use more natural class_weight="auto" heuristic Mar 5, 2015
@trevorstephens
Copy link
Contributor

Hey @amueller & @ogrisel , as I mentioned on the Gitters, the change (while I do not dispute it being much more intuitive) could break reproducible results between scikit-learn versions. I ran a couple of quick toy dataset tests and there are some differences in results between implementations. I think some comment about the effect in the what's new would be a good idea.

X, y = make_classification(n_samples=100, n_features=20, n_informative=10, 
                           weights=[0.8, 0.2], random_state=415)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, 
                                                    random_state=415)

Logit:

clf = LogisticRegression(random_state=415, class_weight='auto')
clf.fit(X_train, y_train)
print clf.predict_proba(X_test)[:, 1]
print clf.coef_

Old version:

[ 0.28526629  0.99479878  0.77898055  0.01463206  0.02895576  0.47175511
  0.23164939  0.08577271  0.96457237  0.77217252  0.98978379  0.34129472
  0.01877176  0.996313    0.05685967  0.99951695  0.93041836  0.0174167
  0.19766607  0.31876434]
[[-0.15651352 -0.23085976 -0.22061261  0.89981967  0.64370659 -0.50625197
   0.26951804 -0.05344735 -0.20691435  0.40288703 -0.70316265  0.23589797
   0.1325999   0.91270498 -0.53125852  0.56085782 -0.74460298  0.71283277
  -0.78188798 -0.2205095 ]]

New version:

[ 0.27663815  0.99687075  0.80955765  0.00809551  0.02114547  0.42299626
  0.18592452  0.0623878   0.97494142  0.7559168   0.99441047  0.28947663
  0.01098863  0.9981927   0.05008777  0.99979476  0.94687392  0.01078669
  0.17458846  0.2906671 ]
[[-0.16641486 -0.2567103  -0.26308952  0.9869177   0.73400539 -0.58061991
   0.28760975 -0.06580417 -0.24562278  0.45810175 -0.79721965  0.26902468
   0.15327072  1.02790971 -0.61152246  0.63661494 -0.81292435  0.78248383
  -0.88835426 -0.26757662]]

Forests:

clf = RandomForestClassifier(n_estimators=200, random_state=415, class_weight='auto')
clf.fit(X_train, y_train)
print clf.predict_proba(X_test)[:, 1]
print clf.feature_importances_

Old version:

[ 0.2    0.435  0.29   0.255  0.12   0.21   0.415  0.08   0.255  0.265
  0.325  0.43   0.165  0.35   0.095  0.42   0.34   0.1    0.265  0.155]
[ 0.05371131  0.05131622  0.03093554  0.07824664  0.03288011  0.03451192
  0.05496429  0.04458952  0.04265068  0.02835565  0.04921904  0.02570175
  0.03054679  0.03845795  0.02101664  0.0634564   0.08183292  0.13486313
  0.07393151  0.028812  ]

New version:

[ 0.205  0.42   0.315  0.27   0.12   0.2    0.38   0.075  0.25   0.24
  0.39   0.42   0.165  0.415  0.105  0.45   0.36   0.105  0.305  0.14 ]
[ 0.05633612  0.05142512  0.02951519  0.07920261  0.03280725  0.03518552
  0.0541076   0.04417705  0.04117407  0.02718026  0.04751108  0.02731006
  0.03113306  0.03888567  0.02224603  0.06455641  0.08175883  0.13231104
  0.07435735  0.02881968]

@amueller
Copy link
Member Author

amueller commented Mar 9, 2015

@trevorstephens do you think the whatsnew entry that I added is not sufficient? This will definitely change results, but as I remarked elsewhere, if the user grid-searches C in the linear models, the results will be the same. Not sure about the forests.

@hannawallach
Copy link
Contributor

Sorry to have been silent (was at a conference). Yes, if the user grid searches C in any of the linear models they can theoretically get the same optimized parameters as before, depending on how many C values they search over. Specifically, if w0, ..., wM are the old class weights and w'0, ..., w'M are the new class weights, such that w'm = lambda * wm where lambda is the constant described above, then the optimized parameters of the model w/ class weights w0, ..., wM and an inverse regularization paremter C will be identical to the optimized parameters of the model w/ class weights w'0, ..., w'M and an inverse regularization parameter C / lambda.

So, without grid searching over C (or setting C to whatever value was used previously divided by lambda), this change to class weights will produce different results. So I think there are two options: leave "auto" as it is and just be much clearer about what it does (even acknowleding that maybe it's not the most obvious thing to do, but differs only in the "effective" number of data points) or change it.

If we go the latter route, we also need to give a much clearer explanation in the documentation of what "auto" does and how it relates to the value of C, but we should also make very clear somewhere (and I don't know where) exactly how the change will affect people's results with a fixed C value and how they should set C to recreate their old results. (I think this is important, not least because changes to default parameters can be very confusing and easy to overlook/miss. I ran into this last week w/ the change to the default min_df value in CountVectorizer() -- ugh.)

Anyway, all that said, I'd like to hold off on changing anything for another 24 hours or so. Two reasons: 1) I have no idea how this change will affect random forests or other (nonlinear) classifiers. Someone should investigate this before changing anything and should really do so in terms of writing down math, in addition to plugging in values and seeing what happens. 2) I'd like to find out what other weighting schemes (if any) other people (e.g., Hal, John, etc.) can think of just to check that there aren't other, even more natural, options that we're overlooking.

@amueller
Copy link
Member Author

amueller commented Mar 9, 2015

Thanks for your feedback. We do document all changes in the api change summary and the bug fixes, unfortunately that list is rather long and so it is a bit unlikely that users will go through each on an upgrade. I don't know of a better solution, though.
This PR does include a better documentation of what class_weight="auto" does, though it would still require the user to read code to find it, so it is not ideal. With the new heuristic, it would be easy to add a line to all docstrings for class_weights. The old one is a bit longer to explain.

We could also make the compute_class_weight function public and refer to it.
For the trees, maybe @glouppe has some input?

@trevorstephens
Copy link
Contributor

@amueller @hannawallach ... I guess I just meant that if people had hard-coded a 'good' set of parameters, through grid search or heuristics, their models might change a bit. Some comment about this in the what's new is what I was suggesting.

As for forests and trees, class_weight is not yet in a public release, so I wouldn't worry too much about that as few people are likely using it. Though I would say this change should try to get into the final 0.16 release so as not to mess with it down the road. Any change in the scaling of the sample weights shouldn't do much to the trees at the top nodes from what I can think of, but down the bottom of full trees, as tiny gini improvements are split on, floating point representation fun might kick in. I'll try to introspect a couple of trees tonight to see what their structures look like between implementations.

@amueller
Copy link
Member Author

amueller commented Mar 9, 2015

I'm slightly surprised that the tree output changes. have you checked if the tree changes or only the probability estimates?
The incompatibility should not be an issue here, as you say. I am just curious as to why this happens.

@amueller
Copy link
Member Author

amueller commented Mar 9, 2015

@trevorstephens what was the data you compared the trees on?

@amueller
Copy link
Member Author

amueller commented Mar 9, 2015

Never mind, I reproduced the tree things here: #4366
Interesting....
For this fix, I think it is more relevant are the linear models, and we need to decide to either:

  • support both methods, as "auto" and "auto2"
  • deprecate "auto" and introduce "balance_samples" (or any other name for the new heuristic)
  • change the behavior of "auto" as done here, and document in whatsnew (as done in this PR)

I'm not super happy about behavior changes, as they can surprise users and we can't really assume users read whatsnew.rst.

@trevorstephens
Copy link
Contributor

@trevorstephens what was the data you compared the trees on?

Was in the first code block of the post, just a rather small unbalanced make_classification:

X, y = make_classification(n_samples=100, n_features=20, n_informative=10, 
                           weights=[0.8, 0.2], random_state=415)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, 
                                                    random_state=415)

@hannawallach
Copy link
Contributor

Personally, I'd prefer option number 2 (deprecate "auto" and introduce "balance" or some similar name) just so that it's very clear to everyone that there's been a change. I'm open to being convinced otherwise though :-)

@amueller
Copy link
Member Author

amueller commented Mar 9, 2015

If we go with number 2, does anyone have opinions on whether we should have the old tests for auto in addition to new tests for the new version?
Also @ogrisel any opinions on which of the three options you'd prefer? [or anyone else]

@trevorstephens
Copy link
Contributor

I'd prefer the 2nd, deprecate, option as well. From a Kaggler POV, I want to be able to reproduce results wherever possible for verification, or just for blending runs. Plus 'balance' or something seems more informative than 'auto' IMO.

As requested, looking into the two forests, there's definitely some strange stuff going on. The first tree in the ensemble even diverges, albeit only slightly. Not necessarily on gini, samples, possibly not even the outcome, etc... But you can see a different variable split at the far left node. Other ones I've looked at further down the ensemble diverge even more than this one with differing gini's down the tree. Using:

tree.export_graphviz(clf.estimators_[0], out_file=out)

Master:

master

New Version:

amueller

@GaelVaroquaux
Copy link
Member

GaelVaroquaux commented Mar 9, 2015 via email

@amueller
Copy link
Member Author

amueller commented Mar 9, 2015

@GaelVaroquaux a deprecation cycle meaning a name change and removing the old name (just to confirm).
@trevorstephens see the discussion in #4366. This is not really related to the issue.

@trevorstephens
Copy link
Contributor

Ok, will head on over there. Sorry to pollute the waters.

@amueller amueller force-pushed the class_weight_auto branch 5 times, most recently from 1920ff0 to e99705c Compare March 10, 2015 20:54
@amueller
Copy link
Member Author

This has the deprecations and additional parameter now. Any reviews?

@amueller
Copy link
Member Author

well I don't see a different way, so maybe it will be "balanced_subsample"

@amueller
Copy link
Member Author

Ok should be good now.

@amueller amueller added this to the 0.17 milestone May 19, 2015
1: 1. / 3 / mean_weight,
-1: 1. / 2 / mean_weight,
1: 5. / (2 * 3),
-1: 5. / (2 * 2)
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To make this test easier to read I would do:

y = np.array([1, 1, 1, -1, -1])
n_samples = len(y)
n_classes = len(np.unique(y))

class_weight = {
    1: n_samples / (np.sum(y == 1) * n_classes), 
    -1: n_samples / (np.sum(y == -1) * n_classes), 
}

@ogrisel
Copy link
Member

ogrisel commented May 29, 2015

Besides minor comments, LGTM.

@ogrisel ogrisel changed the title [MRG] Use more natural class_weight="auto" heuristic [MRG+1] Use more natural class_weight="auto" heuristic May 29, 2015
@@ -56,6 +56,9 @@ Enhancements
:class:`linear_model.LogisticRegression`, by avoiding loss computation.
By `Mathieu Blondel`_ and `Tom Dupre la Tour`_.

- Improved heuristic for ``class_weight="auto"`` for classifiers supporting
``class_weight`` by Hanna Wallach and `Andreas Müller`_
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There should be a bit more details here: at least explain that class_weight="auto" has been deprecated in favor of class_weight="balanced" that provides invariance to class imbalance introduced by re-sampling with replacement for linear models.

Maybe also mention class_weight='balanced_subsample' for forest models.

@GaelVaroquaux
Copy link
Member

Aside the small comments by @ogrisel and myself, 👍 for merge. Thank you

@GaelVaroquaux
Copy link
Member

Have you seen that the tests are failing?

@amueller
Copy link
Member Author

amueller commented Jun 1, 2015

fixed.

@GaelVaroquaux
Copy link
Member

OK. Merging!

GaelVaroquaux added a commit that referenced this pull request Jun 1, 2015
[MRG+1] Use more natural class_weight="auto" heuristic
@GaelVaroquaux GaelVaroquaux merged commit 01aff46 into scikit-learn:master Jun 1, 2015
@amueller
Copy link
Member Author

amueller commented Jun 1, 2015

thanks :) 🍻

trevorstephens added a commit to trevorstephens/scikit-learn that referenced this pull request Jun 4, 2015
trevorstephens added a commit to trevorstephens/scikit-learn that referenced this pull request Jun 4, 2015
trevorstephens added a commit to trevorstephens/scikit-learn that referenced this pull request Jun 6, 2015
rebase on top of scikit-learn#4347

improve error message

update error msg
@amueller amueller deleted the class_weight_auto branch May 19, 2017 20:44
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants