Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add a facility that allows random forest classifiers to be combined after training #26326

Open
davedice opened this issue May 4, 2023 · 2 comments
Labels
module:ensemble Needs Decision - Include Feature Requires decision regarding including feature New Feature

Comments

@davedice
Copy link

davedice commented May 4, 2023

Describe the workflow you want to enable

In a federated environment, I have federation elements that build private random forest classifiers, which I would like to combine after the fact into a single random forest.

Describe your proposed solution

See the "alternatives" section.

Describe alternatives you've considered, if relevant

Stacking might suffice as a work-around, although I'd like to avoid that.

As a throw-away experiment, simply concatenating all the constituent decision tree estimators into a common estimators_ array (and adjusting the count) seems to work superficially, but clearly isn't good practice.

In addition, this approach can fail, if, say, we try to combine random forest instance #1 which has classes_ of [dog, cat] and forest #2 which has classes_ of [cow, dog, cat]. To address that concern, I looked at forcing the union of all possible classes (over all the forests) into the resultant combined forest, and the underlying trees. This appears to work at some level, but doesn't handle misshapen oob_decision_function_ which is shaped according n_classes_.

Another approach to dealing with classes_ heterogeneity is to make sure each federation forest is exposed to the full gamut of potential classes during training. (Even then, one worries about the order of the elements found in classes_ : [dog,cat] vs [cat,dog]). It appears that classes_ is constructed before any bootstrap sampling, so, assuming we can rely on that implementation detail, and we expose each federation to a consistently ordered and specially constructed "gamut" pre-pended to their X, we can (hopefully) expect all forest instances to have identical classes_ with the same elements in the same order. That, in turn, would allow easier combining of the forests. Ensuring complete exposure via the "gamut" might also impact accuracy. (The training "gamut" is a minimal set of X records that produce all possible y categorical values).

Additional context

R provides a combine() operator.

@davedice davedice added Needs Triage Issue requires triage New Feature labels May 4, 2023
@betatim
Copy link
Member

betatim commented May 5, 2023

For features like this it would be good to find a/several papers or other examples "out in the wild" to see how people are solving this problem. Scikit-learn doesn't really want to be in the position of inventing solutions, instead we like to "follow the consensus".

Do you know of papers (or other sources) describing how to solve this problem? In addition there are fairly stringent requirements for a new thing to be added https://scikit-learn.org/dev/faq.html#selectiveness

@davedice
Copy link
Author

davedice commented May 5, 2023

Thanks, and I appreciate the prudent stewardship of scikit-learn.

I know you were looking for solutions, but for the moment, to help motivate the idea, I've pasted below a quick list of github projects that try to glue forests together. I didn't try to chase down how often combine() is used in R. And I understand if you need to close out the feature request, as it's likely to be used in a very small number of cases.

https://github.com/Alexsandruss/federated-learning-experiments/blob/f173d0f8ba15f976b8cbc8d607987174901fd5e8/trees-aggregation/run_simulation.py#L116

https://github.com/AnnikaLarissa/MIMIC-IV/blob/b0336bd99fcfa655ff114b0bef9d998ebfde9a82/federated_forest.ipynb [Cell #6]

https://github.com/f4b1an92/Master-Thesis/blob/3bb46e8c08a757af994c8e7e1106008d2e81ce05/code/modules/hfed_models.py#L680

https://github.com/Lalezish/FL_RF_DP2/blob/1397af9737f01c6b8e8a15f4227d634f678999f7/myFL.py#L24

https://github.com/zenas91/FedForest/blob/db8ee899fddfb450d0929935a711f6ab66f20a6a/fedforest/strategy.py#L88

@thomasjpfan thomasjpfan added module:ensemble Needs Decision - Include Feature Requires decision regarding including feature and removed Needs Triage Issue requires triage labels May 31, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
module:ensemble Needs Decision - Include Feature Requires decision regarding including feature New Feature
Projects
None yet
Development

No branches or pull requests

3 participants