[MRG] Add partial_fit function to DecisionTreeClassifier #18889

PSSF23 · 2020-11-20T22:08:09Z

Reference Issues/PRs

First step for #18888

What does this implement/fix? Explain your changes.

Add the partial_fit function to DecisionTreeClassifier

Any other comments?

Collaboration of @neurodata

Repo for ensemble attempts and benchmarks: https://github.com/PSSF23/SDTF/
arXiv preprint: https://arxiv.org/abs/2110.08483

Thank you for feedback!

amueller · 2020-11-20T22:39:02Z

Thanks for the PR. Can you show that this is faster than building the tree from scratch?

PSSF23 · 2020-11-20T22:49:53Z

@amueller I don't think speed is what I have in mind. The VFDT name might cause some confusion, but this PR is more like a preliminary step that allows future algorithms to have a focus on streaming data. In those cases, data samples would come continuously and saving all of them to wait for a batch fitting would be quite expensive.

I will check the time differences in benchmarks though. Thanks for the advice!

jnothman · 2020-12-01T23:51:03Z

If the goal is online learning, this should be implemented as partial_fit, but you would need to show that multiple calls to partial_fit roughly equate to fitting in batch.

PSSF23 · 2020-12-02T01:12:52Z

@jnothman Thank you for the advice. So another partial_fit function would be better than a parameter in fit? Hope I understand your words right.

I am working on benchmarking with CIFAR-10 and will get back to you when I have satisfactory results!

glemaitre · 2020-12-18T16:28:26Z

@jnothman Thank you for the advice. So another partial_fit function would be better than a parameter in fit? Hope I understand your words right.

Yes, partial_fit is our current API form online learning.

thomasjpfan · 2021-09-26T15:03:29Z

Hi @thomasjpfan, I saw that you added feature_names_in to tree attributes, but there is currently no implementations for it in the tree module, right?

The estimators in the tree module sets feature_names_in_ when _validate_data is called:

scikit-learn/sklearn/tree/_classes.py

Lines 165 to 167 in f33fb0a

    
           X, y = self._validate_data( 
        
               X, y, validate_separately=(check_X_params, check_y_params) 
        
           )

PSSF23

~~Have some tests been disabled? The code should be covered earlier...~~

…nto stream

PSSF23

The test error in test_different_endianness_joblib_pickle doesn't seem to be related:

ValueError: Big-endian buffer not supported on little-endian compiler

lesteve · 2021-11-30T16:43:07Z

The test error in test_different_endianness_joblib_pickle doesn't seem to be related:

This is actually related although in a not so trivial way. One likely fix would be to change ClassificationCriterion.__cinit__ similarly to https://github.com/scikit-learn/scikit-learn/pull/21552/files#diff-ecb2f5fb06ba7e14c6d06a8fbc811d684eaa534640d2e2f8f0102a1c4d4afca2R588.

More details:

the test was added recently in main to test deployment cases where the machine you train and pickles your model does not have the same endianness as the machine you unpickle your model to do predictions, see Pickle portability little 🡒 big endian #21237 for more details. There is a variation of this for cross-bitness, you train on a 64 bit machine and deploy on a 32bit machine e.g. Support cross 32bit/64bit pickles for decision tree #21552.
the change in your PR that triggered this error is that you added self.builder_ to the DecisionTreeClassifier object, so the builder object (and so the criterion object as well) needs to be pickled when pickling the DecisionTreeClassifier object. ClassificationCriterion.__cinit__ is too specific about the dtype of the n_classes it takes.
this is quite likely that your PR would fail in cross-bitness deployment use cases. The thing is that for cross-bitness, the test is done in a more manual way (basically tweaking the output of __reduce__ to make it look like it has been generated with a different bitness) and the test pass.

PSSF23 · 2021-11-30T18:10:29Z

@lesteve Thanks! ~~I'll look into it.~~ I think the problem is fixed~

PSSF23

This line:

X, y = fetch_california_housing(return_X_y=True)

causes the following error, which is definitely unrelated this time:

urllib.error.HTTPError: HTTP Error 403: Forbidden

glemaitre · 2021-12-01T21:08:55Z

This might be a temporary issue. We are planning to make a release with a retry mechanism at some point.

PSSF23 added 15 commits November 6, 2020 20:25

Start implementing the update function for trees

1542765

Update _tree.pxd

8ded0f7

Remove unused attribute

d6d5879

Remove duplicate operations

0ed0819

Keep whole function for reference

bebe2bc

Catch AttributeError

6ca6725

Evaluate tree building logic

a403f5b

Follow node addition logic

cb4cf43

Work with counting issues and overflowing trees

eb7af31

Work with high variability

c24c87a

Fix y coordinates

5e6685c

Duplicate sample organization

5f6c373

Add _update_split_node function for BestFirstTree

7ac15f2

Work without max_leaf_nodes limit

2a94fa2

Update .gitignore

d6c03a7

github-actions bot added the module:tree label Nov 20, 2020

Remove capacity resetting

7a3985a

PSSF23 added 4 commits January 20, 2021 21:45

Resolve 1 node tree problem

4f8605e

Optimize node order

11764a1

Update _tree.pyx

02ca737

Optimize partial_fit api

92f7e18

PSSF23 changed the title ~~Add an option to update existing trees in fit function~~ Add partial_fit function to DecisionTreeClassifier Jan 21, 2021

This comment has been minimized.

Sign in to view

Base automatically changed from master to main January 22, 2021 10:53

Update from main branch to stream branch

ab51a53

This was referenced Sep 15, 2021

TST fix "Linux ubuntu_atlas" HTTP time out #21059

Closed

Optimize _check_partial_fit_first_call for regressors #21060

Closed

PSSF23 changed the title ~~Add partial_fit function to DecisionTreeClassifier~~ Add partial_fit function to decision trees Sep 15, 2021

FIX restrict partial_fit to classifiers

71265db

PSSF23 changed the title ~~Add partial_fit function to decision trees~~ Add partial_fit function to DecisionTreeClassifier Sep 16, 2021

PSSF23 mentioned this pull request Sep 16, 2021

Add streaming option to decision trees #18888

Open

PSSF23 changed the title ~~Add partial_fit function to DecisionTreeClassifier~~ [MRG] Add partial_fit function to DecisionTreeClassifier Sep 16, 2021

PSSF23 mentioned this pull request Oct 6, 2021

EHN transfer streaming tree function to fork neurodata/scikit-learn#12

Merged

PSSF23 added 2 commits October 13, 2021 12:18

Merge branch 'scikit-learn:main' into stream

005e5fe

Merge branch 'scikit-learn:main' into stream

cd8864e

PSSF23 commented Oct 18, 2021

View reviewed changes

PSSF23 added 8 commits October 21, 2021 10:09

Merge branch 'scikit-learn:main' into stream

30a6237

DOC add changelog

814e67e

Merge branch 'main' of https://github.com/scikit-learn/scikit-learn i…

46a9ccc

…nto stream

DOC optimize log format

f0d0eb0

Merge branch 'main' of https://github.com/scikit-learn/scikit-learn i…

e9e62e4

…nto stream

FIX remove deprecated parameter

aef4f84

Merge branch 'main' into stream

7d9ff8b

Merge branch 'scikit-learn:main' into stream

9ba887f

PSSF23 commented Nov 29, 2021

View reviewed changes

PSSF23 added 3 commits November 30, 2021 14:56

FIX optimize n_classes format

55a6b4b

FIX add internal function

8d3f5c7

MNT remove unnecessary checks

c47bbb2

PSSF23 commented Dec 1, 2021

View reviewed changes

Merge branch 'scikit-learn:main' into stream

a2aab5f

PSSF23 mentioned this pull request Mar 2, 2023

Add streaming function to trees and forests neurodata/scikit-tree#40

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[MRG] Add partial_fit function to DecisionTreeClassifier #18889

[MRG] Add partial_fit function to DecisionTreeClassifier #18889

PSSF23 commented Nov 20, 2020 •

edited

amueller commented Nov 20, 2020

PSSF23 commented Nov 20, 2020 •

edited

jnothman commented Dec 1, 2020 via email

PSSF23 commented Dec 2, 2020

glemaitre commented Dec 18, 2020

This comment has been minimized.

thomasjpfan commented Sep 26, 2021

PSSF23 left a comment •

edited

PSSF23 left a comment

lesteve commented Nov 30, 2021 •

edited

PSSF23 commented Nov 30, 2021 •

edited

PSSF23 left a comment

glemaitre commented Dec 1, 2021 via email

[MRG] Add partial_fit function to DecisionTreeClassifier #18889

Are you sure you want to change the base?

[MRG] Add partial_fit function to DecisionTreeClassifier #18889

Conversation

PSSF23 commented Nov 20, 2020 • edited

Reference Issues/PRs

What does this implement/fix? Explain your changes.

Any other comments?

amueller commented Nov 20, 2020

PSSF23 commented Nov 20, 2020 • edited

jnothman commented Dec 1, 2020 via email

PSSF23 commented Dec 2, 2020

glemaitre commented Dec 18, 2020

This comment has been minimized.

thomasjpfan commented Sep 26, 2021

PSSF23 left a comment • edited

Choose a reason for hiding this comment

PSSF23 left a comment

Choose a reason for hiding this comment

lesteve commented Nov 30, 2021 • edited

PSSF23 commented Nov 30, 2021 • edited

PSSF23 left a comment

Choose a reason for hiding this comment

glemaitre commented Dec 1, 2021 via email

PSSF23 commented Nov 20, 2020 •

edited

PSSF23 commented Nov 20, 2020 •

edited

PSSF23 left a comment •

edited

lesteve commented Nov 30, 2021 •

edited

PSSF23 commented Nov 30, 2021 •

edited