Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH Add support for feature names in monotonic_cst #24855

Merged
merged 26 commits into from Nov 15, 2022

Conversation

ogrisel
Copy link
Member

@ogrisel ogrisel commented Nov 7, 2022

Towards #24852.

TODO

  • update all the docstrings of the public API
  • add tests
  • update an example
  • changelog

I did not bother moving the MonotonicConstraint enum to the sklearn.utils.validation module. Not sure if I should do it or not. Maybe.

@@ -28,7 +28,7 @@

rng = np.random.RandomState(0)

n_samples = 5000
n_samples = 1000
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I reduce the number of samples to make the plot less crowded while conveying the same intuitions and furthermore making the example run faster.

@ogrisel
Copy link
Member Author

ogrisel commented Nov 10, 2022

I did not bother moving the MonotonicConstraint enum to the sklearn.utils.validation module. Not sure if I should do it or not. Maybe.

Let's keep this PR focused for now.

Follow-up PR(s) should probably:

  • move some tests to an estimator agnostic checks for (both for input data errors and monotonicity on random data);
  • move the user guide do an estimator agnostic section on monotonic_cst with cross-linking between estimators that support this parameter;
  • use the MonotonicConstraint enum in those estimators.

@ogrisel ogrisel marked this pull request as ready for review November 10, 2022 15:21

- 1: monotonic increase
- 0: no constraint
- -1: monotonic decrease
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had to remove the indentation of the bullet list to avoid a warning for the old version of sphinx...

@ogrisel ogrisel added this to the 1.2 milestone Nov 10, 2022
@ogrisel ogrisel added the Quick Review For PRs that are quick to review label Nov 10, 2022
Copy link
Member

@jjerphan jjerphan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM modulo a few unitary negative and positive reviews comments, ahem.

doc/whats_new/v1.2.rst Outdated Show resolved Hide resolved
examples/ensemble/plot_monotonic_constraints.py Outdated Show resolved Hide resolved
examples/ensemble/plot_monotonic_constraints.py Outdated Show resolved Hide resolved
examples/ensemble/plot_monotonic_constraints.py Outdated Show resolved Hide resolved
examples/ensemble/plot_monotonic_constraints.py Outdated Show resolved Hide resolved
sklearn/utils/validation.py Outdated Show resolved Hide resolved
sklearn/utils/validation.py Outdated Show resolved Hide resolved
f"monotonic_cst has shape {monotonic_cst.shape} but the input data "
f"X has {estimator.n_features_in_} features."
)
unexpected_cst = np.setdiff1d(monotonic_cst, [-1, 0, 1])
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TIL that numpy.setdiff1d is a thing!

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I knew about it but I must admit that github copilot suggested it to me :) Using explicit variable names such as unexpected_cst makes it very smart.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(I knew it, Olivier is a robot! One powered by copilot ;))

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

scikit-learn 2.0.0: Human Learning in Python?

sklearn/utils/validation.py Outdated Show resolved Hide resolved
sklearn/utils/validation.py Show resolved Hide resolved
sklearn/ensemble/_hist_gradient_boosting/grower.py Outdated Show resolved Hide resolved
Comment on lines 1240 to 1241
If a dict with str keys, map feature to monotonic constraints by name.
If an array, the feature are mapped to constraints by position.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do you think about linking to the subsection in the example from this PR that uses a dictionary for monotonic constraints?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How would you do so? with a restructured text reference anchor in the "markdown" cell just before the final code snippet?

Copy link
Member Author

@ogrisel ogrisel Nov 11, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done in 1925f0b and fa814d0.


If a dict with str keys, map feature to monotonic constraints by name.
If an array, the feature are mapped to constraints by position. See
:ref:`monotonic_cst_features_names` for a usage example.
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member

@thomasjpfan thomasjpfan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Otherwise LGTM

ogrisel and others added 2 commits November 14, 2022 09:12
@jjerphan jjerphan changed the title Add support for feature names in monotonic_cst ENH Add support for feature names in monotonic_cst Nov 14, 2022
Copy link
Member

@jjerphan jjerphan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Thank you, @ogrisel. To me, this is a notable UX enhancement.

Since @betatim did a review, I leave him re-review and approve this PR before merging it.

Copy link
Member

@betatim betatim left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, modulo typo resolution

@betatim
Copy link
Member

betatim commented Nov 14, 2022

One thing related but also not: I don't know much about when you should or shouldn't specify this constraint or what the traps are you might fall in to. After reading the example I now think that it makes sense to specify the constraint when you know that a feature value will increase/decrease with the target value. It is a way to add information to the model instead of the model having to discover this relationship itself.

A trap might be that if there is underlying structure to the general trend then you might or might not want to specify the constraint. If the structure is noise, specify it. If the structure is real, don't specify it. The tricky thing is of course knowing which case it is (for real world data). If you get it right the performance of the model should improve

In addition, there is a use-case that is driven by "business decisions". Not sure I can cook up a realistic example on the spot. Maybe something like "houses with bigger area should not be cheaper than ones with less land". Here you might decrease the performance, but you can natively include a constraint from the business side in your model.

Not sure if it is worth linking to a good guide about this from the docs. (New PR either way)

@jjerphan jjerphan removed the Quick Review For PRs that are quick to review label Nov 14, 2022
Co-authored-by: Tim Head <betatim@gmail.com>
@ogrisel
Copy link
Member Author

ogrisel commented Nov 15, 2022

In addition, there is a use-case that is driven by "business decisions". Not sure I can cook up a realistic example on the spot. Maybe something like "houses with bigger area should not be cheaper than ones with less land". Here you might decrease the performance, but you can natively include a constraint from the business side in your model.

I think this is the main use case for this feature: enforce some a-priori defined business rules into the machine learning model decisions. They might decrease (or not) the predictive accuracy a bit but they might make the model compliant with regulations for instance.

Adding constraints can also act as a regularizer when labeled data is scarce and could improve the test set accuracy if the training set is "noisy" and make the model more "robust" in a way.

@ogrisel ogrisel merged commit 74ddf01 into scikit-learn:main Nov 15, 2022
@ogrisel ogrisel deleted the monotonic_cst-feature-names branch November 15, 2022 15:32
@ogrisel
Copy link
Member Author

ogrisel commented Nov 15, 2022

Merged! Thanks for the reviews.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants