[MRG] Tree speedup #946

glouppe · 2012-07-11T12:04:12Z

Concerns issue #933

Hi guys!

This is a very very early pull request to improve the tree module in terms of training time. This is so early that please, don't look at the code for now, it still needs A LOT of work.

The main contribution is basically to rewrite the Tree class from tree.py into a Cython class in _tree.pyx. In its current state, this already achieves a ~2x speedup w.r.t. master. Not bad, but I expect better in the changes to come.

@pprett Could you point me to the benchmark you used in #923? Thanks :)

Update 16/07: All tests now pass. This is ready for merge.

…ee-speedup

pprett · 2012-07-11T12:20:43Z

Gilles, you can find the benchmark here: https://gist.github.com/3090011

Personally, I think this benchmark is too simple, we should rather use multiple datasets with different characteristics (num_features vs. num_samples, feature types, regression vs. classification).

Maybe we can re-use something from the GBRT benchmark code: https://github.com/pprett/scikit-learn/blob/gradient_boosting-benchmarks/benchmarks/bench_gbrt.py

…ee-speedup Conflicts: sklearn/tree/tree.py

glouppe · 2012-07-11T17:22:16Z

All tests in test_tree.py but graphviz and pickle now pass. I keep it here for today. The next major issue is to implement __reduce__ and __setstate__ to make Tree serializable. That's for tomorrow :)

…ee-speedup

amueller · 2012-07-11T20:17:07Z

Should I run this through the profiler if I find time or are you still making major changes?

glouppe · 2012-07-11T20:59:57Z

I am still making major changes. I let you know as soon as you can start review the code in more depth. Thanks :)

…ee-speedup Conflicts: sklearn/tree/tree.py

glouppe · 2012-07-12T09:51:08Z

All tests in test_tree.py and test_forest.py now pass! Yay :)

@pprett Gradient Boosting still need to be fixed though :) I was thinking maybe we could do that together since this PR involves many changes to the Tree API. In particular I think that in _gradient_boosting.pyx we could now directly makes use of _tree.pyx and circumvent all the decapsulation mechanisms. What do you think?

pprett · 2012-07-12T09:58:28Z

great - _gradient_boosting.pyx should defiantly circumvent everything that's not necessary :-)

unfortunately, I'm a little busy at the moment - I don't think I can make it in the coming days...

glouppe · 2012-07-12T10:00:03Z

Okay then, don't worry. I will have a look at it myself first.

bdholt1 · 2012-07-17T12:53:01Z

I know we've benchmarked the tree training times, do we have any idea of the difference in tree prediction times between this branch and master?

glouppe · 2012-07-17T12:53:35Z

RuntimeWarning: divide by zero encountered in log proba[k] = np.log(proba[k])

What should be the expected behavior? Put NaNs if proba[k] is 0?

bdholt1 · 2012-07-17T12:55:58Z

RuntimeWarning: divide by zero encountered in log proba[k] = np.log(proba[k])

What should be the expected behavior? Put NaNs if proba[k] is 0?

I suppose the RuntimeWarning is as good as it gets. Perhaps if we are expecting this warning then we can catch it so it doesn't show up on the nosetests?

glouppe · 2012-07-17T13:39:27Z

I turned off the divide-by-0 warnings in the test suite.

Regarding the other issue, it is the same as before. I don't know what's best... @pprett ?

pprett · 2012-07-17T13:58:44Z

hmm... personally, warning and -inf is fine to me - I'm not a friend of exceptions and -inf is better than NaN IMHO

ogrisel · 2012-07-17T14:04:45Z

+1 for having a -info. Ok for catching the warning in the tests as we expect it to happen in this case.

glouppe · 2012-07-17T14:06:57Z

Actually np.log(0) already returns -inf and outputs the warning. So problem solved.

I was rather discussing about the test fail of test_feature_importances :) What should be done?

pprett · 2012-07-17T14:17:16Z

@glouppe this is again a float32 vs. float64 issue which caused two samples to flip rank - please change the ground truth to reflect the new ranking:

In master init_error is float32; now init_error is float64 - anyways, just a minor issue - I totally agree if you just modify the ranking to reflect the new ranking.

glouppe · 2012-07-17T14:30:15Z

@bdholt1 Can you check whether my last commit solves the issue?

bdholt1 · 2012-07-17T14:39:57Z

You couldn't make this up if you tried!

======================================================================
FAIL: sklearn.ensemble.tests.test_gradient_boosting.test_feature_importances
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/usr/lib/pymodules/python2.6/nose/case.py", line 183, in runTest
    self.test(*self.arg)
  File "/vol/vssp/signsrc/brian/python/scikit-learn/sklearn/ensemble/tests/test_gradient_boosting.py", line 204, in test_feature_importances
    assert_array_equal(true_ranking, feature_importances.argsort())
  File "/usr/lib/python2.6/dist-packages/numpy/testing/utils.py", line 463, in assert_array_equal
    verbose=verbose, header='Arrays are not equal')
  File "/usr/lib/python2.6/dist-packages/numpy/testing/utils.py", line 395, in assert_array_compare
    raise AssertionError(msg)
AssertionError: 
Arrays are not equal

(mismatch 15.3846153846%)
 x: array([ 3,  1,  8, 10,  2,  9,  4, 11,  0,  6,  7,  5, 12])
 y: array([ 3,  8,  1, 10,  2,  9,  4, 11,  0,  6,  7,  5, 12])
>>  raise AssertionError('\nArrays are not equal\n\n(mismatch 15.3846153846%)\n x: array([ 3,  1,  8, 10,  2,  9,  4, 11,  0,  6,  7,  5, 12])\n y: array([ 3,  8,  1, 10,  2,  9,  4, 11,  0,  6,  7,  5, 12])')

bdholt1 · 2012-07-17T14:41:10Z

Could it just be my build that's unstable? Or is it GBRT feature importances algorithm that's unstable?

pprett · 2012-07-17T14:42:30Z

ok - looks like its pretty unstable... 1 sec

2012/7/17 Brian Holt
reply@reply.github.com:

Could it just be my build that's unstable? Or is it GBRT feature importances algorithm that's unstable?

Reply to this email directly or view it on GitHub:
#946 (comment)

Peter Prettenhofer

ogrisel · 2012-07-17T14:44:01Z

Would it be possible to change the dataset used in the tests so that feature importances are strictly monotonic (e.g. no two feature have close importances)?

Alternatively could we implement a deterministic tie breaking scheme (if those are exact ties and that the current scheme is not deterministic because of the use of dictionary or similar non deterministic datastructure somewhere)?

pprett · 2012-07-17T15:00:07Z

@glouppe can you add this to the test_feature_importances test case and check if it works fine::

X = np.array(boston.data, dtype=np.float32)
y = np.array(boston.target, dtype=np.float32)

I think there is an issue with np.asfortranarray(X, dtype=DTYPE) that I've to investigate separately

glouppe · 2012-07-17T16:00:47Z

This caused the ranking to change again. I added the lines anyway and changed the ranking again in my last commit.

glouppe · 2012-07-18T09:18:34Z

Any clue @pprett ? I tried various things but none seem to yield a stable ranking...

pprett · 2012-07-18T09:31:38Z

Gilles, are you using a 32bit arch?

2012/7/18 Gilles Louppe
reply@reply.github.com:

Any clue @pprett ? I tried various things but none seem to yield a stable ranking...

Reply to this email directly or view it on GitHub:
#946 (comment)

Peter Prettenhofer

glouppe · 2012-07-18T09:32:27Z

Yes

pprett · 2012-07-18T09:41:00Z

I noticed reproducability issues on 32 bit arch - don't know whether
this originates from the tree code or from GBRT; the only thing I know
yet is that it occurs during fitting, so the GBRT cython code is not
the source - I need to fix that in a separate bugfix - would you mind
skipping the test and I'll open an issue.

2012/7/18 Gilles Louppe
reply@reply.github.com:

Yes

Reply to this email directly or view it on GitHub:
#946 (comment)

Peter Prettenhofer

glouppe · 2012-07-18T09:44:09Z

My last commit disables the test. Feel free to merge :)

bdholt1 · 2012-07-18T09:47:17Z

+1

pprett · 2012-07-18T10:00:49Z

+1 for merge too - can't wait to bench gbm on the new master!

glouppe · 2012-07-18T10:02:40Z

Okay then, I click the green button! Thanks all of you for the reviews :) I'll open a new issue regarding the find_split algorithm.

[MRG] Tree speedup

glouppe added 9 commits July 6, 2012 10:16

DOC: docstrings for criteria

4dd43fc

DOC: docstrings

a4974cb

Merge branch 'master' of github.com:scikit-learn/scikit-learn into tr…

8f55a18

…ee-speedup

Merge branch 'master' of github.com:scikit-learn/scikit-learn into tr…

60464de

…ee-speedup

Tree refactoring (1)

e122dc0

Tree refactoring (2)

3786458

Tree refactoring (3)

3054660

Tree refactoring (4)

e976713

Tree refactoring (5)

0015350

glouppe added 6 commits July 11, 2012 14:22

Tree refactoring (6)

db9cb78

Tree refactoring (7)

a868024

Tree refactoring (8)

c9ac2ff

Tree refactoring (9)

30f62f2

Tree refactoring (10)

1e5aac8

Merge branch 'master' of github.com:scikit-learn/scikit-learn into tr…

a29897d

…ee-speedup Conflicts: sklearn/tree/tree.py

Merge branch 'master' of github.com:scikit-learn/scikit-learn into tr…

1bb8526

…ee-speedup

glouppe added 4 commits July 12, 2012 08:35

Merge branch 'master' of github.com:scikit-learn/scikit-learn into tr…

df04b4c

…ee-speedup Conflicts: sklearn/tree/tree.py

ENH: Tree properties

2347423

Tree refactoring (11)

b6e68a3

ENH: make Tree picklable

c9da1f4

glouppe added 3 commits July 12, 2012 13:24

Tree refactoring (12)

13cad8c

Tree refactoring (13)

6bc9b82

FIX: avoid useless data conversion

f1410e5

Turn off warnings

463ea61

FIX: test_feature_importances

734cf7d

FIX: test_feature_importances?

2562842

TEST: disable test_feature_importances for now

c6aa568

glouppe added a commit that referenced this pull request Jul 18, 2012

Merge pull request #946 from glouppe/tree-speedup

a2bb8f7

[MRG] Tree speedup

glouppe merged commit a2bb8f7 into scikit-learn:master Jul 18, 2012

glouppe mentioned this pull request Jul 18, 2012

Tree: better algorithm for find_split #964

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[MRG] Tree speedup #946

[MRG] Tree speedup #946

glouppe commented Jul 11, 2012

pprett commented Jul 11, 2012

glouppe commented Jul 11, 2012

amueller commented Jul 11, 2012

glouppe commented Jul 11, 2012

glouppe commented Jul 12, 2012

pprett commented Jul 12, 2012

glouppe commented Jul 12, 2012

bdholt1 commented Jul 17, 2012

glouppe commented Jul 17, 2012

bdholt1 commented Jul 17, 2012

glouppe commented Jul 17, 2012

pprett commented Jul 17, 2012

ogrisel commented Jul 17, 2012

glouppe commented Jul 17, 2012

pprett commented Jul 17, 2012

glouppe commented Jul 17, 2012

bdholt1 commented Jul 17, 2012

bdholt1 commented Jul 17, 2012

pprett commented Jul 17, 2012

ogrisel commented Jul 17, 2012

pprett commented Jul 17, 2012

glouppe commented Jul 17, 2012

glouppe commented Jul 18, 2012

pprett commented Jul 18, 2012

glouppe commented Jul 18, 2012

pprett commented Jul 18, 2012

glouppe commented Jul 18, 2012

bdholt1 commented Jul 18, 2012

pprett commented Jul 18, 2012

glouppe commented Jul 18, 2012

[MRG] Tree speedup #946

[MRG] Tree speedup #946

Conversation

glouppe commented Jul 11, 2012

Concerns issue #933

pprett commented Jul 11, 2012

glouppe commented Jul 11, 2012

amueller commented Jul 11, 2012

glouppe commented Jul 11, 2012

glouppe commented Jul 12, 2012

pprett commented Jul 12, 2012

glouppe commented Jul 12, 2012

bdholt1 commented Jul 17, 2012

glouppe commented Jul 17, 2012

bdholt1 commented Jul 17, 2012

glouppe commented Jul 17, 2012

pprett commented Jul 17, 2012

ogrisel commented Jul 17, 2012

glouppe commented Jul 17, 2012

pprett commented Jul 17, 2012

glouppe commented Jul 17, 2012

bdholt1 commented Jul 17, 2012

bdholt1 commented Jul 17, 2012

pprett commented Jul 17, 2012

ogrisel commented Jul 17, 2012

pprett commented Jul 17, 2012

glouppe commented Jul 17, 2012

glouppe commented Jul 18, 2012

pprett commented Jul 18, 2012

glouppe commented Jul 18, 2012

pprett commented Jul 18, 2012

glouppe commented Jul 18, 2012

bdholt1 commented Jul 18, 2012

pprett commented Jul 18, 2012

glouppe commented Jul 18, 2012