New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implement secure boost scheme - secure evaluation and validation (during training) without local feature leakage #10079
base: vertical-federated-learning
Are you sure you want to change the base?
Implement secure boost scheme - secure evaluation and validation (during training) without local feature leakage #10079
Conversation
…ute under secure scenario
…valent to broadcast
…lobal best split, but need to further apply split correctly
…ute under secure scenario
…valent to broadcast
…lobal best split, but need to further apply split correctly
Add alternate vertical splits
…x for training phase
Hi @trivialfis , the method implementation part for secure inference is ready, I added detailed information to our RFC under the section "Design for Secure Inference - avoid leakage of feature cut value". @YuanTingHsieh will add / make modifications to the unit testing. Thanks! |
@@ -445,29 +449,27 @@ void SketchContainerImpl<WQSketch>::MakeCuts(Context const *ctx, MetaInfo const | |||
max_cat = std::max(max_cat, AddCategories(categories_.at(fid), p_cuts)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Based on my understanding, categorical features are not yet supported right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
right, will need to find a proper use-case / testing data with categorical features to add the support, it seems the categorical feature is "experimental" according to some of the last year's release notes, is it still the case? maybe we can add the support later when we find it really necessary.
if (!is_secure_) { | ||
split_pt = cut_val[i]; // not used for partition based | ||
best.Update(loss_chg, fidx, split_pt, d_step == -1, false, left_sum, right_sum); | ||
} else { | ||
// secure mode: record the best split point, rather than the actual value | ||
// since it is not accessible at this point (active party finding best-split) | ||
best.Update(loss_chg, fidx, i, d_step == -1, false, left_sum, right_sum); | ||
} | ||
} else { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you think a policy class might help here? Or maybe there are other efficient ways to handle these conditions? I'm losing track of these conditions, considering that we have three enumeration functions:
- numeric
- partition
- one hot
Then we have three split modes:
- column
- row
- column + secure
So, in combination, 9 potential cases, and we haven't counted vector leaf yet. Need to find a better way to manage these conditions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it can be tricky to consolidate, since the 9 cases have high overlaps (e.g. same enumeration logic for all splits modes except when secure+passive party), some further processing only for col_split (w/ w/o secure), but irrelevant to enumeration.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Another thing regarding this mode combinations: potentially with the upcoming processor interface we will be able to enable encrypted horizontal, shall we further add a row + secure mode, adding a 4th one for
enum class DataSplitMode : int { kRow = 0, kCol = 1, kColSecure = 2 };
? (or maybe there are better options?)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For my preference, I would have put it in the CommunicatorContext
configuration for whether the channel is encrypted.
Hi, could you please share how to run some high level tests? |
Sure, this is what I am using for testing: |
Another general challenge for any vertical pipelines: at inference time, all parties need to be online, and as our model records the "global feature index", the "order" of the clients need to remain the same. We may need some mechanisms to ensure this order. |
I will leave that to nvflare. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The code looks good to me overall. We can merge it once we have some basic unittests.
As for integration tests in Python with nvflare (in future PRs), we can assert that
- models are different for different workers.
- predictions are the same
- evaluation result are the same
- only works if the 0th worker has the label.
I highly recommend using the hypothesis test framework (see python tests in xgboost and search the term hypothesis).
Thanks! @YuanTingHsieh , could you add the unit tests according to @trivialfis 's suggestions? |
Those points are all for integration tests not for small unittest. I think the integration tests in Python with nvflare will take more effort, we don't need to rush it in this PR. |
Hi, is there any update? |
Thanks for asking! :) @YuanTingHsieh has been busy with a related NVFlare release in the past two weeks, now the release is close to finish, he will have time to work on this soon. |
Add secure inf unit tests
@trivialfis Yuanting just added some unit tests, seems there is a failed R-test, but not sure if it is related to our modifications, the error message being
|
That should be unrelated, will look into this PR today. |
Hi @trivialfis , thanks for the updates, just merged it. |
Triggered the rest of the CI. |
Hi @trivialfis , there are 3 failed checks, but I think they align with the rebase merge, shall we just merge this? Thanks! |
For implementing Vertical Federated Learning with Secure Features, as discussed in
#9987
This part is independent from the encryption and the alternative vertical pipeline. The purpose is to avoid leaking the real cut value information from participants. Hence add as a separate PR.
This PR is based on #10037, which should be reviewed and merged first.