Instabilities in sample weights in trees #4366

amueller · 2015-03-09T18:11:49Z

This has come up in #4347.
Changing all sample weights by a constant factor changes the output of the trees.
I thought it should not change the math, and this seems pretty substantial for floating point issues:

from sklearn.tree import DecisionTreeClassifier
from sklearn.datasets import fetch_mldata
mnist = fetch_mldata("MNIST original")
tree = DecisionTreeClassifier(max_depth=10, random_state=1).fit(mnist.data, mnist.target)
tree3 = DecisionTreeClassifier(max_depth=10, random_state=1).fit(mnist.data, mnist.target, sample_weight=1.1 * np.ones(len(mnist.target)))
np.sum(tree.predict(mnist.data) != tree3.predict(mnist.data))

72

amueller · 2015-03-09T18:12:07Z

ping @glouppe

glouppe · 2015-03-09T21:04:11Z

The output is different because in the left branch, there is a tie for the gini score. X[5] and X[16] gives identical impurity improvement (gini = 0.48). I see two explanations:

Either features are not evaluated in the same order
Or, there is a small difference between the two scores, which becomes significant after scaling the weights. (Unfortunately, floating point operations is not invariant to scaling... ie \sum (w*v_i) == w * (\sum v_i) does not necessarily hold due to floating point rounding errors)

amueller · 2015-03-09T21:32:17Z

Do you think this is worth looking into? It is a bit surprising from a user perspective. I realize that at some point we can't do much "because finite precision".

trevorstephens · 2015-03-09T21:47:59Z

Referring to the toy-example in #4347 ...

X, y = make_classification(n_samples=100, n_features=20, n_informative=10, 
                           weights=[0.8, 0.2], random_state=415)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, 
                                                    random_state=415)
clf = RandomForestClassifier(n_estimators=200, random_state=415, class_weight='auto')
clf.fit(X_train, y_train)

The second tree in this little ensemble is even more extreme in its differences between implementations of the class_weight heuristic in #4347 which is essentially just changing the scaling of the sample_weight param. The very first split is tied (on the same variable at a different cut-point), and between implementations a very different tree comes out... From the same toy example, looking at clf.estimators_[1]

Master:

amueller:class_weight_auto:

Seems you might be right @glouppe , that the big differences between trees is due to evaluation of ties, or floating point almost-ties. Probably this is why the feature importances change, and I guess out of sample probas might be due to the variable make-up of those observations.

trevorstephens · 2015-03-09T21:49:20Z

BTW, I believe that the reported samples and values above are properly working off the same bootstrap sample here, just weighted differently due to class_weight='auto' implementations...

arjoly · 2015-10-21T16:29:51Z

closing as I think @glouppe comment is correct

amueller mentioned this issue Mar 9, 2015

[MRG+1] Use more natural class_weight="auto" heuristic #4347

Merged

arjoly closed this as completed Oct 21, 2015

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Instabilities in sample weights in trees #4366

Instabilities in sample weights in trees #4366

amueller commented Mar 9, 2015

amueller commented Mar 9, 2015

glouppe commented Mar 9, 2015

amueller commented Mar 9, 2015

trevorstephens commented Mar 9, 2015

trevorstephens commented Mar 9, 2015

arjoly commented Oct 21, 2015

Instabilities in sample weights in trees #4366

Instabilities in sample weights in trees #4366

Comments

amueller commented Mar 9, 2015

amueller commented Mar 9, 2015

glouppe commented Mar 9, 2015

amueller commented Mar 9, 2015

trevorstephens commented Mar 9, 2015

trevorstephens commented Mar 9, 2015

arjoly commented Oct 21, 2015