Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[MRG] Add partial_fit function to DecisionTreeClassifier #18889

Open
wants to merge 58 commits into
base: main
Choose a base branch
from

Conversation

PSSF23
Copy link
Contributor

@PSSF23 PSSF23 commented Nov 20, 2020

Reference Issues/PRs

First step for #18888

What does this implement/fix? Explain your changes.

  • Add the partial_fit function to DecisionTreeClassifier

Any other comments?

Collaboration of @neurodata

Thank you for feedback!

@amueller
Copy link
Member

Thanks for the PR. Can you show that this is faster than building the tree from scratch?

@PSSF23
Copy link
Contributor Author

PSSF23 commented Nov 20, 2020

@amueller I don't think speed is what I have in mind. The VFDT name might cause some confusion, but this PR is more like a preliminary step that allows future algorithms to have a focus on streaming data. In those cases, data samples would come continuously and saving all of them to wait for a batch fitting would be quite expensive.

I will check the time differences in benchmarks though. Thanks for the advice!

@jnothman
Copy link
Member

jnothman commented Dec 1, 2020 via email

@PSSF23
Copy link
Contributor Author

PSSF23 commented Dec 2, 2020

@jnothman Thank you for the advice. So another partial_fit function would be better than a parameter in fit? Hope I understand your words right.

I am working on benchmarking with CIFAR-10 and will get back to you when I have satisfactory results!

@glemaitre
Copy link
Member

@jnothman Thank you for the advice. So another partial_fit function would be better than a parameter in fit? Hope I understand your words right.

Yes, partial_fit is our current API form online learning.

@PSSF23 PSSF23 changed the title Add an option to update existing trees in fit function Add partial_fit function to DecisionTreeClassifier Jan 21, 2021
@PSSF23

This comment has been minimized.

Base automatically changed from master to main January 22, 2021 10:53
@PSSF23 PSSF23 changed the title Add partial_fit function to DecisionTreeClassifier Add partial_fit function to decision trees Sep 15, 2021
@PSSF23 PSSF23 changed the title Add partial_fit function to decision trees Add partial_fit function to DecisionTreeClassifier Sep 16, 2021
@PSSF23 PSSF23 changed the title Add partial_fit function to DecisionTreeClassifier [MRG] Add partial_fit function to DecisionTreeClassifier Sep 16, 2021
@thomasjpfan
Copy link
Member

Hi @thomasjpfan, I saw that you added feature_names_in to tree attributes, but there is currently no implementations for it in the tree module, right?

The estimators in the tree module sets feature_names_in_ when _validate_data is called:

X, y = self._validate_data(
X, y, validate_separately=(check_X_params, check_y_params)
)

Copy link
Contributor Author

@PSSF23 PSSF23 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Have some tests been disabled? The code should be covered earlier...

Copy link
Contributor Author

@PSSF23 PSSF23 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The test error in test_different_endianness_joblib_pickle doesn't seem to be related:

ValueError: Big-endian buffer not supported on little-endian compiler

@lesteve
Copy link
Member

lesteve commented Nov 30, 2021

The test error in test_different_endianness_joblib_pickle doesn't seem to be related:

This is actually related although in a not so trivial way. One likely fix would be to change ClassificationCriterion.__cinit__ similarly to https://github.com/scikit-learn/scikit-learn/pull/21552/files#diff-ecb2f5fb06ba7e14c6d06a8fbc811d684eaa534640d2e2f8f0102a1c4d4afca2R588.

More details:

  • the test was added recently in main to test deployment cases where the machine you train and pickles your model does not have the same endianness as the machine you unpickle your model to do predictions, see Pickle portability little 🡒 big endian #21237 for more details. There is a variation of this for cross-bitness, you train on a 64 bit machine and deploy on a 32bit machine e.g. Support cross 32bit/64bit pickles for decision tree #21552.
  • the change in your PR that triggered this error is that you added self.builder_ to the DecisionTreeClassifier object, so the builder object (and so the criterion object as well) needs to be pickled when pickling the DecisionTreeClassifier object. ClassificationCriterion.__cinit__ is too specific about the dtype of the n_classes it takes.
  • this is quite likely that your PR would fail in cross-bitness deployment use cases. The thing is that for cross-bitness, the test is done in a more manual way (basically tweaking the output of __reduce__ to make it look like it has been generated with a different bitness) and the test pass.

@PSSF23
Copy link
Contributor Author

PSSF23 commented Nov 30, 2021

@lesteve Thanks! I'll look into it. I think the problem is fixed~

Copy link
Contributor Author

@PSSF23 PSSF23 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This line:

X, y = fetch_california_housing(return_X_y=True)

causes the following error, which is definitely unrelated this time:

urllib.error.HTTPError: HTTP Error 403: Forbidden

@glemaitre
Copy link
Member

glemaitre commented Dec 1, 2021 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants