Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

HistGradientBoostingClassifier raise error with monotonic constraints and categorical features #28898

Closed
cespeleta opened this issue Apr 26, 2024 · 3 comments · Fixed by #28925
Labels

Comments

@cespeleta
Copy link

cespeleta commented Apr 26, 2024

Describe the bug

Creating an HistGradientBoostingClassifier with monotonic_cst and categorical_features is not possible because it throws an error. The monotonic_cst is a numeric feature that is not included in the categorical features.

Steps/Code to Reproduce

from sklearn.datasets import fetch_openml
from sklearn.ensemble import HistGradientBoostingClassifier

X_adult, y_adult = fetch_openml("adult", version=2, return_X_y=True)
X_adult = X_adult[["age", "workclass", "education"]]
print(X_adult.dtypes)
# age             int64
# workclass    category
# education    category
# dtype: object

hist = HistGradientBoostingClassifier(
    monotonic_cst={"age": 1}, categorical_features="from_dtype"
)
hist.fit(X_adult, y_adult)

ValueError: Categorical features cannot have monotonic constraints.

hist = HistGradientBoostingClassifier(
    monotonic_cst={"age": 1}, categorical_features=["workclass", "education"]
)
hist.fit(X_adult, y_adult)

ValueError: Categorical features cannot have monotonic constraints.


### Expected Results

The expected result will be a fitted model

### Actual Results

```python
{
    "name": "ValueError",
    "message": "Categorical features cannot have monotonic constraints.",
    "stack": "---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
Cell In[13], line 10
      6 print(X_adult.dtypes)
      7 hist = HistGradientBoostingClassifier(
      8     monotonic_cst={\"age\": 1}, categorical_features=\"from_dtype\"
      9 )
---> 10 hist.fit(X_adult, y_adult)
     12 hist = HistGradientBoostingClassifier(
     13     monotonic_cst={\"age\": 1}, categorical_features=[\"workclass\", \"education\"]
     14 )
     15 hist.fit(X_adult, y_adult)

File ~/Projects/your_project/.venv/lib/python3.10/site-packages/sklearn/base.py:1474, in _fit_context.<locals>.decorator.<locals>.wrapper(estimator, *args, **kwargs)
   1467     estimator._validate_params()
   1469 with config_context(
   1470     skip_parameter_validation=(
   1471         prefer_skip_nested_validation or global_skip_validation
   1472     )
   1473 ):
-> 1474     return fit_method(estimator, *args, **kwargs)

File ~/Projects/your_project/.venv/lib/python3.10/site-packages/sklearn/ensemble/_hist_gradient_boosting/gradient_boosting.py:889, in BaseHistGradientBoosting.fit(self, X, y, sample_weight)
    887 # Build `n_trees_per_iteration` trees.
    888 for k in range(self.n_trees_per_iteration_):
--> 889     grower = TreeGrower(
    890         X_binned=X_binned_train,
    891         gradients=g_view[:, k],
    892         hessians=h_view[:, k],
    893         n_bins=n_bins,
    894         n_bins_non_missing=self._bin_mapper.n_bins_non_missing_,
    895         has_missing_values=has_missing_values,
    896         is_categorical=self._is_categorical_remapped,
    897         monotonic_cst=monotonic_cst,
    898         interaction_cst=interaction_cst,
    899         max_leaf_nodes=self.max_leaf_nodes,
    900         max_depth=self.max_depth,
    901         min_samples_leaf=self.min_samples_leaf,
    902         l2_regularization=self.l2_regularization,
    903         feature_fraction_per_split=self.max_features,
    904         rng=self._feature_subsample_rng,
    905         shrinkage=self.learning_rate,
    906         n_threads=n_threads,
    907     )
    908     grower.grow()
    910     acc_apply_split_time += grower.total_apply_split_time

File ~/Projects/your_project/.venv/lib/python3.10/site-packages/sklearn/ensemble/_hist_gradient_boosting/grower.py:300, in TreeGrower.__init__(self, X_binned, gradients, hessians, max_leaf_nodes, max_depth, min_samples_leaf, min_gain_to_split, min_hessian_to_split, n_bins, n_bins_non_missing, has_missing_values, is_categorical, monotonic_cst, interaction_cst, l2_regularization, feature_fraction_per_split, rng, shrinkage, n_threads)
    293     is_categorical = np.asarray(is_categorical, dtype=np.uint8)
    295 if np.any(
    296     np.logical_and(
    297         is_categorical == 1, monotonic_cst != MonotonicConstraint.NO_CST
    298     )
    299 ):
--> 300     raise ValueError(\"Categorical features cannot have monotonic constraints.\")
    302 hessians_are_constant = hessians.shape[0] == 1
    303 self.histogram_builder = HistogramBuilder(
    304     X_binned, n_bins, gradients, hessians, hessians_are_constant, n_threads
    305 )

ValueError: Categorical features cannot have monotonic constraints."
}

Versions

System:
    python: 3.10.13 (main, Apr  9 2024, 09:36:37) [Clang 15.0.0 (clang-1500.3.9.4)]
executable: /Users/user/Projects/your-project/.venv/bin/python
   machine: macOS-14.4.1-arm64-arm-64bit

Python dependencies:
      sklearn: 1.4.2
          pip: 24.0
   setuptools: 69.2.0
        numpy: 1.26.4
        scipy: 1.13.0
       Cython: None
       pandas: 2.2.1
   matplotlib: 3.8.4
       joblib: 1.4.0
threadpoolctl: 3.4.0

Built with OpenMP: True

threadpoolctl info:
       user_api: openmp
   internal_api: openmp
    num_threads: 10
         prefix: libomp
       filepath: /Users/user/Projects/your-project/.venv/lib/python3.10/site-packages/sklearn/.dylibs/libomp.dylib
        version: None

       user_api: blas
   internal_api: openblas
    num_threads: 10
         prefix: libopenblas
       filepath: /Users/user/Projects/your-project/.venv/lib/python3.10/site-packages/numpy/.dylibs/libopenblas64_.0.dylib
        version: 0.3.23.dev
threading_layer: pthreads
   architecture: armv8

       user_api: blas
   internal_api: openblas
    num_threads: 10
         prefix: libopenblas
       filepath: /Users/user/Projects/your-project/.venv/lib/python3.10/site-packages/scipy/.dylibs/libopenblas.0.dylib
        version: 0.3.26.dev
threading_layer: pthreads
   architecture: neoversen1
@cespeleta cespeleta added Bug Needs Triage Issue requires triage labels Apr 26, 2024
@ogrisel ogrisel removed the Needs Triage Issue requires triage label Apr 29, 2024
@ogrisel
Copy link
Member

ogrisel commented Apr 29, 2024

Thanks for the report. I confirm I can reproduce locally on the dev branch using the provided reproducer.

@cespeleta would you be interested in submitting a PR to fix this (along with a non regression test that does not require a network connection)?

@cespeleta
Copy link
Author

Hi @ogrisel, although it would interesting for me to dig into the problem, I don't have enough time right now to look into it.

@yuanx749
Copy link
Contributor

yuanx749 commented May 1, 2024

I digged into a bit. The problem is that during fitting, ColumnTransformer places the categorical features at the beginning of X, but monotonic_cst still keeps the original order of the features. I think we can remap monotonic_cst to fix the issue, sort of like self._is_categorical_remapped.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
3 participants