New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ENH Add support for feature names in monotonic_cst #24855
ENH Add support for feature names in monotonic_cst #24855
Conversation
@@ -28,7 +28,7 @@ | |||
|
|||
rng = np.random.RandomState(0) | |||
|
|||
n_samples = 5000 | |||
n_samples = 1000 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I reduce the number of samples to make the plot less crowded while conveying the same intuitions and furthermore making the example run faster.
Let's keep this PR focused for now. Follow-up PR(s) should probably:
|
|
||
- 1: monotonic increase | ||
- 0: no constraint | ||
- -1: monotonic decrease |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I had to remove the indentation of the bullet list to avoid a warning for the old version of sphinx...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM modulo a few unitary negative and positive reviews comments, ahem.
f"monotonic_cst has shape {monotonic_cst.shape} but the input data " | ||
f"X has {estimator.n_features_in_} features." | ||
) | ||
unexpected_cst = np.setdiff1d(monotonic_cst, [-1, 0, 1]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TIL that numpy.setdiff1d
is a thing!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I knew about it but I must admit that github copilot suggested it to me :) Using explicit variable names such as unexpected_cst
makes it very smart.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(I knew it, Olivier is a robot! One powered by copilot ;))
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
scikit-learn 2.0.0: Human Learning in Python?
Co-authored-by: Julien Jerphanion <git@jjerphan.xyz>
If a dict with str keys, map feature to monotonic constraints by name. | ||
If an array, the feature are mapped to constraints by position. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What do you think about linking to the subsection in the example from this PR that uses a dictionary for monotonic constraints?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How would you do so? with a restructured text reference anchor in the "markdown" cell just before the final code snippet?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Co-authored-by: Thomas J. Fan <thomasjpfan@gmail.com>
|
||
If a dict with str keys, map feature to monotonic constraints by name. | ||
If an array, the feature are mapped to constraints by position. See | ||
:ref:`monotonic_cst_features_names` for a usage example. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Otherwise LGTM
sklearn/ensemble/_hist_gradient_boosting/tests/test_monotonic_contraints.py
Outdated
Show resolved
Hide resolved
…contraints.py Co-authored-by: Thomas J. Fan <thomasjpfan@gmail.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, modulo typo resolution
One thing related but also not: I don't know much about when you should or shouldn't specify this constraint or what the traps are you might fall in to. After reading the example I now think that it makes sense to specify the constraint when you know that a feature value will increase/decrease with the target value. It is a way to add information to the model instead of the model having to discover this relationship itself. A trap might be that if there is underlying structure to the general trend then you might or might not want to specify the constraint. If the structure is noise, specify it. If the structure is real, don't specify it. The tricky thing is of course knowing which case it is (for real world data). If you get it right the performance of the model should improve In addition, there is a use-case that is driven by "business decisions". Not sure I can cook up a realistic example on the spot. Maybe something like "houses with bigger area should not be cheaper than ones with less land". Here you might decrease the performance, but you can natively include a constraint from the business side in your model. Not sure if it is worth linking to a good guide about this from the docs. (New PR either way) |
Co-authored-by: Tim Head <betatim@gmail.com>
I think this is the main use case for this feature: enforce some a-priori defined business rules into the machine learning model decisions. They might decrease (or not) the predictive accuracy a bit but they might make the model compliant with regulations for instance. Adding constraints can also act as a regularizer when labeled data is scarce and could improve the test set accuracy if the training set is "noisy" and make the model more "robust" in a way. |
Merged! Thanks for the reviews. |
Towards #24852.
TODO
I did not bother moving the
MonotonicConstraint
enum to thesklearn.utils.validation
module. Not sure if I should do it or not. Maybe.