Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unexpected behavior of sklearn.feature_selection.mutual_info_regression if copy=False #28793

Open
aivarsoo opened this issue Apr 9, 2024 · 8 comments

Comments

@aivarsoo
Copy link

aivarsoo commented Apr 9, 2024

Describe the bug

The parameter copy of the function mutual_info_regression is described as follows

copy : bool, default=True
Whether to make a copy of the given data. If set to False, the initial
data will be overwritten.

I read it as both X and y should be modified if copy=False and X has continuous features. However, y is always copied. I think the lines

if not discrete_target:
y = scale(y, with_mean=False)
should be

    if not discrete_target:
        y = y.astype(np.float64, copy=copy)
        y = scale(y, with_mean=False, copy=False)

Similarly to the the treatment of X

if np.any(continuous_mask):
X = X.astype(np.float64, copy=copy)
X[:, continuous_mask] = scale(
X[:, continuous_mask], with_mean=False, copy=False
)

Steps/Code to Reproduce

import numpy as np
from sklearn.feature_selection import mutual_info_regression
n_samples_, n_feats = 30, 2
X = np.random.randn(n_samples_, n_feats)
y = np.random.randn(n_samples_, )
y_copy = y.copy()
X_copy = X.copy()
mutual_info_regression(X, y, copy=False)
print(np.allclose(y, y_copy), np.allclose(X, X_copy))

Expected Results

The result should be

False, False

since both X and y should be modified in place by the function mutual_info_regression.

Actual Results

True, False

Versions

System:
    python: 3.11.7 (main, Dec  5 2023, 19:13:35) [GCC 10.2.1 20210110]
executable: /usr/local/bin/python
   machine: Linux-5.15.146.1-microsoft-standard-WSL2-x86_64-with-glibc2.31

Python dependencies:
      sklearn: 1.3.0
          pip: 23.3.1
   setuptools: 69.0.2
        numpy: 1.23.2
        scipy: 1.12.0
       Cython: 3.0.9
       pandas: 2.1.4
   matplotlib: 3.8.2
       joblib: 1.3.2
threadpoolctl: 3.2.0

Built with OpenMP: True

threadpoolctl info:
       user_api: blas
   internal_api: openblas
    num_threads: 12
         prefix: libopenblas
       filepath: /home/user/.local/lib/python3.11/site-packages/numpy.libs/libopenblas64_p-r0-742d56dc.3.20.so
        version: 0.3.20
threading_layer: pthreads
   architecture: Prescott

       user_api: blas
   internal_api: openblas
    num_threads: 12
         prefix: libopenblas
       filepath: /home/user/.local/lib/python3.11/site-packages/scipy.libs/libopenblasp-r0-23e5df77.3.21.dev.so
        version: 0.3.21.dev
threading_layer: pthreads
   architecture: Prescott

       user_api: openmp
   internal_api: openmp
    num_threads: 12
         prefix: libgomp
       filepath: /home/user/.local/lib/python3.11/site-packages/scikit_learn.libs/libgomp-a34b3233.so.1.0.0
        version: None
@aivarsoo aivarsoo added Bug Needs Triage Issue requires triage labels Apr 9, 2024
@lesteve
Copy link
Member

lesteve commented Apr 10, 2024

It kinds of remind me of #27307 and #27691. In general, copy=False means "it may avoid a copy" but it is not always possible. In other words, copy=False may reduce the memory usage, but the fact that the input array is modified should not be relied upon.

I think there are two complementary ways to improve the situation:

@lesteve lesteve added help wanted and removed Needs Triage Issue requires triage labels Apr 10, 2024
@aivarsoo
Copy link
Author

aivarsoo commented Apr 10, 2024

Regarding the first way. In this case, I guess the docstring should qualify that only X is affected by copy, but y is always copied in addition to the issues with sparse matrices.

I am a bit biased here as I want y to have the same behavior of X for my application, but I am trying to understand if there's a reason to always copy y?

@lesteve
Copy link
Member

lesteve commented Apr 10, 2024

It would help a lot if you could explain your use case a bit 🙏. In particular how does it affect you that y is copied and not X when using copy=False?

@aivarsoo
Copy link
Author

This issue may go in a bit different direction now :) Let me know if i should separate it into a new one.

I am updating a third party sklearn plug-in for feature selection, where we want to add a possibility for feature selection using conditional mutual information based on the paper. Note that unlike the paper itself, our solution will use a proprietary solver, but the plug-in itself is open-source.

Mathematically we will aim to solve

$$ \max\limits_{\delta_i} \sum\limits_{i=1}^n I(x_i; y) \delta_i + \sum\limits_{i, j=1}^n I(x_i; y | x_j) \delta_i \delta_j $$

where $x_i$ are features, $y$ is the target variable, and the variables $\delta_i\in{0,1}$ are binary with an associated constraint $\sum \delta_i = K$ for a pre-defined value $K$. If the variable $\delta_i =1$ then we will pick the feature $i$.

The conditional mutual information algorithm will follow mostly this implementation, which is in turn based on the sklearn implementation for mutual information computation.

Essentially, I want to compute the values $I(x_i; y)$, which I can do using the public methods mutual_info_classif, mutual_info_regression from sklearn.feature_selection or a private method _compute_mi from sklearn.feature_selection._mutual_info.

  1. If I use the public method mutual_info_regression, X and y will be rescaled if they correspond to a data of a non-discrete distribution. I will then need to rescale y again for computing $I(x_i; y | x_j)$. If X and y would have the same behavior, then I can avoid rescaling y twice and essentially using different data y for computing $I(x_i; y)$ and $I(x_i; y | x_j)$.
  2. If we can make the method _compute_mi public, then I can use my custom function for computing the values of $I(x_i; y | x_j)$ and $I(x_i; y)$ simultaneously.

The second option would also allow for using the problem formulation described in the mRMR paper. Similar to an existing issue #8889 but using our solver. I was actually considering making the second option (making _compute_mi public) an issue too, but decided to start with a smaller issue.

I am happy to submit a PR on some of these issues.

@lesteve
Copy link
Member

lesteve commented Apr 11, 2024

Thanks for the details!

So my understanding is that you do something like this:

mutual_info_1 = mutual_info_regression(X, y, copy=False)
# make assumptions on how X and y have been modified and call `mutual_info_regression` again
mutual_info_2 = mutual_info_regression(..., copy=False)

How much do you care about using copy=False, in other words, does using copy=True have a big effect in your use case? One suggestion would be to use copy=True in the first mutual_info_regression call and then copy=False in the second one (if that really makes a difference).

My current understanding that may not be fully accurate: when using copy=False there is no guarantee on how your input values may change, basically you may avoid a copy at the cost of not being able to use your input array after the function call.

Even if copy=False makes a big difference in your use case, I would rather rely on a private function that on how X and y are modified when using copy=False. I don't think there is any guarantee on this and this may change without warnings, at least if we rename or remove _compute_mi you will get an error.

Then there is the question on _compute_mi public, I am not very familiar with this part of scikit-learn, so I don't have an informed opinion. One thing to have in mind is that having an additional public function could lead users to confusion and/or not using the correct public function.

@aivarsoo
Copy link
Author

Thanks for the response!

An example code would be something like

# We rescale X and y in the sklearn method computing I(x_i; y) for all i:
mutual_info = mutual_info_regression(X, y, copy=False)
# We re-use the rescaled X and y in our method computing I(x_i; y| x_j) 
# for all i and j , where we do not modify X or y
conditional_mutual_info = conditional_mutual_info(X, y)

Note that the mutual_info methods rescale and add noise to X and y, which I wouldn't want to do several times. Also I want to use the same X and y for all the computations/estimations.

But if I understood you correctly using copy=False doesn't give me much guarantees on what will happen to X after the call to this method and I shouldn't rely on this functionality? If so then I may need to look for a different solution anyway.

On another note, I think importing the private sklearn method wouldn't work in the long run for our plug-in.

@lesteve
Copy link
Member

lesteve commented Apr 11, 2024

But if I understood you correctly using copy=False doesn't give me much guarantees on what will happen to X after the call to this method and I shouldn't rely on this functionality?

Yes, I asked other maintainers inputs and basically there seems to be agreement on this: if you use sklearn_func(X, copy=False) don't reuse X because there is no guarantee on how its values have changed after the sklearn_func call.

I am going to label this issue as Documentation because I think the most reasonable thing to do would be to document this in the glossary. I guess adding a link to the glossary entry in all the function/classes that uses copy may be a good idea too but quite some work.

@lesteve lesteve added Documentation and removed Bug labels Apr 11, 2024
@aivarsoo
Copy link
Author

Many thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants