Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Instabilities in sample weights in trees #4366

Closed
amueller opened this issue Mar 9, 2015 · 6 comments
Closed

Instabilities in sample weights in trees #4366

amueller opened this issue Mar 9, 2015 · 6 comments

Comments

@amueller
Copy link
Member

amueller commented Mar 9, 2015

This has come up in #4347.
Changing all sample weights by a constant factor changes the output of the trees.
I thought it should not change the math, and this seems pretty substantial for floating point issues:

from sklearn.tree import DecisionTreeClassifier
from sklearn.datasets import fetch_mldata
mnist = fetch_mldata("MNIST original")
tree = DecisionTreeClassifier(max_depth=10, random_state=1).fit(mnist.data, mnist.target)
tree3 = DecisionTreeClassifier(max_depth=10, random_state=1).fit(mnist.data, mnist.target, sample_weight=1.1 * np.ones(len(mnist.target)))
np.sum(tree.predict(mnist.data) != tree3.predict(mnist.data))

72

@amueller
Copy link
Member Author

amueller commented Mar 9, 2015

ping @glouppe

@glouppe
Copy link
Contributor

glouppe commented Mar 9, 2015

The output is different because in the left branch, there is a tie for the gini score. X[5] and X[16] gives identical impurity improvement (gini = 0.48). I see two explanations:

  • Either features are not evaluated in the same order
  • Or, there is a small difference between the two scores, which becomes significant after scaling the weights. (Unfortunately, floating point operations is not invariant to scaling... ie \sum (w*v_i) == w * (\sum v_i) does not necessarily hold due to floating point rounding errors)

@amueller
Copy link
Member Author

amueller commented Mar 9, 2015

Do you think this is worth looking into? It is a bit surprising from a user perspective. I realize that at some point we can't do much "because finite precision".

@trevorstephens
Copy link
Contributor

Referring to the toy-example in #4347 ...

X, y = make_classification(n_samples=100, n_features=20, n_informative=10, 
                           weights=[0.8, 0.2], random_state=415)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, 
                                                    random_state=415)
clf = RandomForestClassifier(n_estimators=200, random_state=415, class_weight='auto')
clf.fit(X_train, y_train)

The second tree in this little ensemble is even more extreme in its differences between implementations of the class_weight heuristic in #4347 which is essentially just changing the scaling of the sample_weight param. The very first split is tied (on the same variable at a different cut-point), and between implementations a very different tree comes out... From the same toy example, looking at clf.estimators_[1]

Master:

master2

amueller:class_weight_auto:

amueller2

Seems you might be right @glouppe , that the big differences between trees is due to evaluation of ties, or floating point almost-ties. Probably this is why the feature importances change, and I guess out of sample probas might be due to the variable make-up of those observations.

@trevorstephens
Copy link
Contributor

BTW, I believe that the reported samples and values above are properly working off the same bootstrap sample here, just weighted differently due to class_weight='auto' implementations...

@arjoly
Copy link
Member

arjoly commented Oct 21, 2015

closing as I think @glouppe comment is correct

@arjoly arjoly closed this as completed Oct 21, 2015
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants