Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[MRG] Tree speedup #946

Merged
merged 47 commits into from Jul 18, 2012
Merged

[MRG] Tree speedup #946

merged 47 commits into from Jul 18, 2012

Conversation

glouppe
Copy link
Contributor

@glouppe glouppe commented Jul 11, 2012

Concerns issue #933

Hi guys!

This is a very very early pull request to improve the tree module in terms of training time. This is so early that please, don't look at the code for now, it still needs A LOT of work.

The main contribution is basically to rewrite the Tree class from tree.py into a Cython class in _tree.pyx. In its current state, this already achieves a ~2x speedup w.r.t. master. Not bad, but I expect better in the changes to come.

@pprett Could you point me to the benchmark you used in #923? Thanks :)


Update 16/07: All tests now pass. This is ready for merge.

@pprett
Copy link
Member

pprett commented Jul 11, 2012

Gilles, you can find the benchmark here: https://gist.github.com/3090011

Personally, I think this benchmark is too simple, we should rather use multiple datasets with different characteristics (num_features vs. num_samples, feature types, regression vs. classification).

Maybe we can re-use something from the GBRT benchmark code: https://github.com/pprett/scikit-learn/blob/gradient_boosting-benchmarks/benchmarks/bench_gbrt.py

@glouppe
Copy link
Contributor Author

glouppe commented Jul 11, 2012

All tests in test_tree.py but graphviz and pickle now pass. I keep it here for today. The next major issue is to implement __reduce__ and __setstate__ to make Tree serializable. That's for tomorrow :)

@amueller
Copy link
Member

Should I run this through the profiler if I find time or are you still making major changes?

@glouppe
Copy link
Contributor Author

glouppe commented Jul 11, 2012

I am still making major changes. I let you know as soon as you can start review the code in more depth. Thanks :)

@glouppe
Copy link
Contributor Author

glouppe commented Jul 12, 2012

All tests in test_tree.py and test_forest.py now pass! Yay :)

@pprett Gradient Boosting still need to be fixed though :) I was thinking maybe we could do that together since this PR involves many changes to the Tree API. In particular I think that in _gradient_boosting.pyx we could now directly makes use of _tree.pyx and circumvent all the decapsulation mechanisms. What do you think?

@pprett
Copy link
Member

pprett commented Jul 12, 2012

great - _gradient_boosting.pyx should defiantly circumvent everything that's not necessary :-)

unfortunately, I'm a little busy at the moment - I don't think I can make it in the coming days...

@glouppe
Copy link
Contributor Author

glouppe commented Jul 12, 2012

Okay then, don't worry. I will have a look at it myself first.

@bdholt1
Copy link
Member

bdholt1 commented Jul 17, 2012

I know we've benchmarked the tree training times, do we have any idea of the difference in tree prediction times between this branch and master?

@glouppe
Copy link
Contributor Author

glouppe commented Jul 17, 2012

RuntimeWarning: divide by zero encountered in log proba[k] = np.log(proba[k])

What should be the expected behavior? Put NaNs if proba[k] is 0?

@bdholt1
Copy link
Member

bdholt1 commented Jul 17, 2012

RuntimeWarning: divide by zero encountered in log proba[k] = np.log(proba[k])

What should be the expected behavior? Put NaNs if proba[k] is 0?

I suppose the RuntimeWarning is as good as it gets. Perhaps if we are expecting this warning then we can catch it so it doesn't show up on the nosetests?

@glouppe
Copy link
Contributor Author

glouppe commented Jul 17, 2012

I turned off the divide-by-0 warnings in the test suite.

Regarding the other issue, it is the same as before. I don't know what's best... @pprett ?

@pprett
Copy link
Member

pprett commented Jul 17, 2012

hmm... personally, warning and -inf is fine to me - I'm not a friend of exceptions and -inf is better than NaN IMHO

@ogrisel
Copy link
Member

ogrisel commented Jul 17, 2012

+1 for having a -info. Ok for catching the warning in the tests as we expect it to happen in this case.

@glouppe
Copy link
Contributor Author

glouppe commented Jul 17, 2012

Actually np.log(0) already returns -inf and outputs the warning. So problem solved.

I was rather discussing about the test fail of test_feature_importances :) What should be done?

@pprett
Copy link
Member

pprett commented Jul 17, 2012

@glouppe this is again a float32 vs. float64 issue which caused two samples to flip rank - please change the ground truth to reflect the new ranking:

In master init_error is float32; now init_error is float64 - anyways, just a minor issue - I totally agree if you just modify the ranking to reflect the new ranking.

@glouppe
Copy link
Contributor Author

glouppe commented Jul 17, 2012

@bdholt1 Can you check whether my last commit solves the issue?

@bdholt1
Copy link
Member

bdholt1 commented Jul 17, 2012

You couldn't make this up if you tried!

======================================================================
FAIL: sklearn.ensemble.tests.test_gradient_boosting.test_feature_importances
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/usr/lib/pymodules/python2.6/nose/case.py", line 183, in runTest
    self.test(*self.arg)
  File "/vol/vssp/signsrc/brian/python/scikit-learn/sklearn/ensemble/tests/test_gradient_boosting.py", line 204, in test_feature_importances
    assert_array_equal(true_ranking, feature_importances.argsort())
  File "/usr/lib/python2.6/dist-packages/numpy/testing/utils.py", line 463, in assert_array_equal
    verbose=verbose, header='Arrays are not equal')
  File "/usr/lib/python2.6/dist-packages/numpy/testing/utils.py", line 395, in assert_array_compare
    raise AssertionError(msg)
AssertionError: 
Arrays are not equal

(mismatch 15.3846153846%)
 x: array([ 3,  1,  8, 10,  2,  9,  4, 11,  0,  6,  7,  5, 12])
 y: array([ 3,  8,  1, 10,  2,  9,  4, 11,  0,  6,  7,  5, 12])
>>  raise AssertionError('\nArrays are not equal\n\n(mismatch 15.3846153846%)\n x: array([ 3,  1,  8, 10,  2,  9,  4, 11,  0,  6,  7,  5, 12])\n y: array([ 3,  8,  1, 10,  2,  9,  4, 11,  0,  6,  7,  5, 12])')

@bdholt1
Copy link
Member

bdholt1 commented Jul 17, 2012

Could it just be my build that's unstable? Or is it GBRT feature importances algorithm that's unstable?

@pprett
Copy link
Member

pprett commented Jul 17, 2012

ok - looks like its pretty unstable... 1 sec

2012/7/17 Brian Holt
reply@reply.github.com:

Could it just be my build that's unstable? Or is it GBRT feature importances algorithm that's unstable?


Reply to this email directly or view it on GitHub:
#946 (comment)

Peter Prettenhofer

@ogrisel
Copy link
Member

ogrisel commented Jul 17, 2012

Would it be possible to change the dataset used in the tests so that feature importances are strictly monotonic (e.g. no two feature have close importances)?

Alternatively could we implement a deterministic tie breaking scheme (if those are exact ties and that the current scheme is not deterministic because of the use of dictionary or similar non deterministic datastructure somewhere)?

@pprett
Copy link
Member

pprett commented Jul 17, 2012

@glouppe can you add this to the test_feature_importances test case and check if it works fine::

X = np.array(boston.data, dtype=np.float32)
y = np.array(boston.target, dtype=np.float32)

I think there is an issue with np.asfortranarray(X, dtype=DTYPE) that I've to investigate separately

@glouppe
Copy link
Contributor Author

glouppe commented Jul 17, 2012

This caused the ranking to change again. I added the lines anyway and changed the ranking again in my last commit.

@glouppe
Copy link
Contributor Author

glouppe commented Jul 18, 2012

Any clue @pprett ? I tried various things but none seem to yield a stable ranking...

@pprett
Copy link
Member

pprett commented Jul 18, 2012

Gilles, are you using a 32bit arch?

2012/7/18 Gilles Louppe
reply@reply.github.com:

Any clue @pprett ? I tried various things but none seem to yield a stable ranking...


Reply to this email directly or view it on GitHub:
#946 (comment)

Peter Prettenhofer

@glouppe
Copy link
Contributor Author

glouppe commented Jul 18, 2012

Yes

@pprett
Copy link
Member

pprett commented Jul 18, 2012

I noticed reproducability issues on 32 bit arch - don't know whether
this originates from the tree code or from GBRT; the only thing I know
yet is that it occurs during fitting, so the GBRT cython code is not
the source - I need to fix that in a separate bugfix - would you mind
skipping the test and I'll open an issue.

2012/7/18 Gilles Louppe
reply@reply.github.com:

Yes


Reply to this email directly or view it on GitHub:
#946 (comment)

Peter Prettenhofer

@glouppe
Copy link
Contributor Author

glouppe commented Jul 18, 2012

My last commit disables the test. Feel free to merge :)

@bdholt1
Copy link
Member

bdholt1 commented Jul 18, 2012

+1

@pprett
Copy link
Member

pprett commented Jul 18, 2012

+1 for merge too - can't wait to bench gbm on the new master!

@glouppe
Copy link
Contributor Author

glouppe commented Jul 18, 2012

Okay then, I click the green button! Thanks all of you for the reviews :) I'll open a new issue regarding the find_split algorithm.

glouppe added a commit that referenced this pull request Jul 18, 2012
@glouppe glouppe merged commit a2bb8f7 into scikit-learn:master Jul 18, 2012
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

7 participants