FEA Add Information Gain and Information Gain Ratio feature selection functions #28905

StefanieSenger · 2024-04-28T06:45:33Z

Reference Issues/PRs

What does this implement/fix? Explain your changes.

This 2016 PR intended to add info_gain and info_gain_ratio functions for univariate feature selection. Here, I update and finish it up. For further information, please refer to the discussion on the old PR.

…ions setup

…kar/scikit-learn into ig-and-igr-feature-selection

…nual values; moved IGR tests

…ctions

OmarManzoor

Thanks for the PR @StefanieSenger . Would it make sense to add a test which compares the transformed X between the information gain and information gain ratio, since they should be generally the same?

StefanieSenger · 2024-05-10T11:55:06Z

I have added such a test @OmarManzoor, maybe it helps if one day someone works on the if ratio block in _info_gain(), which is infact the only few lines that differ between both functions.

OmarManzoor

A few minor suggestions otherwise this looks good. Thanks @StefanieSenger

sklearn/feature_selection/_univariate_selection.py

Co-authored-by: Omar Salman <omar.salman@arbisoft.com>

StefanieSenger · 2024-05-13T07:35:47Z

Nice, thank you @OmarManzoor

glemaitre

Just a couple of first comments to use scipy instead of our own implementation of the entropy or the KL divergence.

doc/modules/feature_selection.rst

glemaitre · 2024-05-21T15:31:06Z

doc/whats_new/v1.6.rst

+
+- |Feature| :func:`~feature_selection.info_gain` and
+  :func:`~feature_selection.info_gain_ratio` can now be used for
+  univariate feature selection. :pr:`28905` by :user:`Viktor Pekar <vpekar>`.


Suggested change

univariate feature selection. :pr:`28905` by :user:`Viktor Pekar <vpekar>`.

univariate feature selection.

:pr:`28905` by :user:`Viktor Pekar <vpekar>` and

:user:`Stefanie Senger <StefanieSenger>`.

glemaitre · 2024-05-21T15:34:01Z

sklearn/feature_selection/_univariate_selection.py

+def _get_entropy(prob):
+    t = np.log2(prob)
+    t[~np.isfinite(t)] = 0
+    return np.multiply(-prob, t)


Nowadays, I think this is implemented in scipy.stats.entropy: https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.entropy.html

The base here is set to 2 (I have to check if it makes sense or not).

I have substituted this function with one of the scipy entropy ones (scipi.special.entr()), though I need to admid I don't understand that and have just chosen the one that would not raise when running the tests.

glemaitre · 2024-05-21T18:00:12Z

sklearn/feature_selection/_univariate_selection.py

+    def _a_log_a_div_b(a, b):
+        with np.errstate(invalid="ignore", divide="ignore"):
+            t = np.log2(a / b)
+        t[~np.isfinite(t)] = 0
+        return np.multiply(a, t)


supposidely this could be replaced by the rel_entr from scipy: https://docs.scipy.org/doc/scipy/reference/generated/scipy.special.rel_entr.html#scipy.special.rel_entr

The difference is that we use log2 instead of the log base e in the scipy definition. I have to check.

So I assume that we could use the natural logarithm anywhere because it would only be different by a constant multiplier and since we are only comparing the information everywhere then it should not matter

Okay, we had talked about this. This time I found that scipy.special.rel_entr() was the one that had the same results as before.

glemaitre · 2024-05-21T18:21:41Z

sklearn/feature_selection/_univariate_selection.py

+    c_prob = c_count / c_count.sum()
+    fc_prob = fc_count / total
+
+    c_f = _a_log_a_div_b(fc_prob, c_prob * f_prob)


To give an example regarding the base, here it would be equivalent to:

c_f = rel_entr(fc_prob, c_prob * f_prob) / np.log(2)

Yes, that worked.

glemaitre · 2024-05-21T18:29:31Z

examples/feature_selection/plot_compare_feature_selection.py

@@ -0,0 +1,115 @@
+"""


We will probably avoid to have a new example and instead we should edit an existing one.

glemaitre · 2024-05-21T18:30:13Z

sklearn/feature_selection/_univariate_selection.py

+    """Count feature, class, joint and total frequencies
+
+    Returns
+    -------
+    f_count : array, shape = (n_features,)
+    c_count : array, shape = (n_classes,)
+    fc_count : array, shape = (n_features, n_classes)
+    total: int
+    """


we will need a proper docstring with our new standards

Even for private functions? I wonder, because many other private functions don't have anything similar to numpy docstring style, which I think is what you are referring to(?).

I tried to do some improvements. I there some test I can run to find out it it is enough? The CI didn't raise because of that.

glemaitre · 2024-05-21T18:31:11Z

sklearn/feature_selection/_univariate_selection.py

+    return np.asarray(scores).reshape(-1)
+
+
+def _get_fc_counts(X, y):


Since this is call a single time, we should not need to have a function.

I actually like it, because though having this function name it gives meaning to this part of the code and structures _info_gain(). Maybe we should even rename to avoid the unclear fc in it's naming. I will make a suggestions with my push.
Can you also imagine to keep this function?

glemaitre · 2024-05-21T18:32:05Z

sklearn/feature_selection/_univariate_selection.py

+        with np.errstate(invalid="ignore", divide="ignore"):
+            scores = scores / (_get_entropy(c_prob) + _get_entropy(1 - c_prob))
+
+    # the feature score is averaged over classes


I think the comment only apply to the first case

True, I will delete it entirely. I think it's not really necessary to have it at all.

glemaitre · 2024-05-21T18:42:48Z

sklearn/feature_selection/_univariate_selection.py

+    c_nf = _a_log_a_div_b((c_count - fc_count) / total, c_prob * (1 - f_prob))
+    nc_f = _a_log_a_div_b((f_count - fc_count) / total, (1 - c_prob) * f_prob)
+
+    scores = c_f + nc_nf + c_nf + nc_f


I think that I would prefer _info_gain to return this score and

have the ratio below done in the info_gain_ratio and finally have a function to that could be called twice to just make the reduction.

def _info_gain(X, y): # probably the name of the function should be better. ... return scores, c_prob def info_gain(X, y, aggregate=np.max): return aggregate.reduce(_info_gain(X, y)[0], axis=0) def info_gain_ratio(X, y, aggregate=np.max): scores, c_prob = _info_gain(X, y) with np.errstate(invalid="ignore", divide="ignore"): scores /= (entropy(c_prob) + entropy(1 - c_prob)) return aggregate.reduce(scores, axis=0)

StefanieSenger

Thank you, @glemaitre, for your review and your explanations in the call. I have tried to address what we talked about. I will push the recent changes and try to continue understanding the rest.

StefanieSenger · 2024-06-03T09:15:49Z

sklearn/feature_selection/_univariate_selection.py

+        with np.errstate(invalid="ignore", divide="ignore"):
+            scores = scores / (_get_entropy(c_prob) + _get_entropy(1 - c_prob))
+
+    # the feature score is averaged over classes


True, I will delete it entirely. I think it's not really necessary to have it at all.

StefanieSenger · 2024-06-03T09:22:20Z

sklearn/feature_selection/_univariate_selection.py

+    return np.asarray(scores).reshape(-1)
+
+
+def _get_fc_counts(X, y):


I actually like it, because though having this function name it gives meaning to this part of the code and structures _info_gain(). Maybe we should even rename to avoid the unclear fc in it's naming. I will make a suggestions with my push.
Can you also imagine to keep this function?

StefanieSenger · 2024-06-03T09:36:19Z

sklearn/feature_selection/_univariate_selection.py

+    """Count feature, class, joint and total frequencies
+
+    Returns
+    -------
+    f_count : array, shape = (n_features,)
+    c_count : array, shape = (n_classes,)
+    fc_count : array, shape = (n_features, n_classes)
+    total: int
+    """


Even for private functions? I wonder, because many other private functions don't have anything similar to numpy docstring style, which I think is what you are referring to(?).

I tried to do some improvements. I there some test I can run to find out it it is enough? The CI didn't raise because of that.

StefanieSenger · 2024-06-03T09:58:52Z

sklearn/feature_selection/_univariate_selection.py

+def _get_entropy(prob):
+    t = np.log2(prob)
+    t[~np.isfinite(t)] = 0
+    return np.multiply(-prob, t)


I have substituted this function with one of the scipy entropy ones (scipi.special.entr()), though I need to admid I don't understand that and have just chosen the one that would not raise when running the tests.

StefanieSenger · 2024-06-03T21:40:20Z

sklearn/feature_selection/_univariate_selection.py

+    def _a_log_a_div_b(a, b):
+        with np.errstate(invalid="ignore", divide="ignore"):
+            t = np.log2(a / b)
+        t[~np.isfinite(t)] = 0
+        return np.multiply(a, t)


Okay, we had talked about this. This time I found that scipy.special.rel_entr() was the one that had the same results as before.

StefanieSenger · 2024-06-03T21:40:46Z

sklearn/feature_selection/_univariate_selection.py

+    c_prob = c_count / c_count.sum()
+    fc_prob = fc_count / total
+
+    c_f = _a_log_a_div_b(fc_prob, c_prob * f_prob)


Yes, that worked.

Co-authored-by: Guillaume Lemaitre <g.lemaitre58@gmail.com>

glemaitre · 2024-06-04T15:57:21Z

So I had a look at the original paper from the original PR (Machine Learning in Automatic Text
Categorization). Actually, we should be looking at Table 1.

It appears that the definition of Information Gain is also known as the expected mutual information. Thus, this corresponds to the implementation in mutual_info_classif for the case of discrete_features=True.

In the paper, they also refer to Mutual Information that is really confusing. This definition is actually the pointwise mutual information. It defers from the expected mutual information because it only consider the event t=t_kandc=c_i(a termt_kappear together with the classc_i`).

So the bottom line is that we can close this PR and the stalled one because it implements something that we already have.

Something that I really struggle though was about the type of matrix X to input into mutual_info_score or chi2. I'll open an issue because I really think that our documentation is wrong in some regards and does not point the right direction. We will probably need some documentation PR to solve the problem but also maybe think about some use-case.

vpekar · 2024-06-10T10:23:21Z

Hi all, sorry I've just noticed the conversation on this PR.

I suggest the proposed implementation of Information Gain is merged, for the following reasons:

info_gain implements a "vanilla" version of Information Gain, the same as the one used in WEKA and a popular R package (FSelector. The IG version is the one described in, e.g. J.R. Quinlan. 1993. C4.5: Programs for Machine Learning. San Mateo, CA: Morgan Kaufmann.,
info_gain also includes Information Gain Ratio, which is also implemented in the FSelector R package, and is a popular FS method.
mutual_info_classif seems to use a less known variety of Information Gain, described in L. F. Kozachenko, N. N. Leonenko, "Sample Estimate of the Entropy of a Random Vector:, Probl. Peredachi Inf., 23:2 (1987), 9-16. As a result, the feature rankings produced by mutual_info_classif are considerably different from those produced by info_gain. See a example here.
info_gain is 20x faster than mutual_info_classif - see the example in the link above.

vpekar added 30 commits March 13, 2016 22:02

Added IG and IGR feature selection functions

12300fd

Fixed a broken test

2ce8c92

Merge branch 'master' into ig-and-igr-feature-selection

a0ca2f9

Added an extra return var to conform to other feature selection funct…

1ba5b75

…ions setup

Removed the pvals return param from mi function

e576fe0

Dealing with functions that don't return pvals

97744fe

Removed unused import

2cda7af

Renamed vars, using __future__.division

d7701f2

Moved __future__.division

56eb381

Fixed import error

b4f02f8

Merge branch 'master' into ig-and-igr-feature-selection

39053da

Fixing flake8 errors

201abc4

Merge branch 'ig-and-igr-feature-selection' of https://github.com/vpe…

d453dae

…kar/scikit-learn into ig-and-igr-feature-selection

Added support for dense arrays for ig and igr, added formulas

a7b663f

Removed unused import

4a6a849

Removed unused import

1eb379a

Corrected IGR formula

f4f0517

Updated docstrings

6d55cea

Added info_gain and info_gain_ratio examples

6ad6f7d

Fixed PyFlakes errors

ef48e09

Code refactoring, using safe_sparse_dot on all matrix types

1deb585

Reverted feature_selection.rst

3684364

Using max as the default globalization strategy

fc01086

Updated docstrings and rst documentation

a966d1e

Merge branch 'master' into ig-and-igr-feature-selection

738afc2

Docstrings: links only on titles

8c2a41c

Refactored to calculate IGR inside _info_gain; added tests against ma…

676bbdc

…nual values; moved IGR tests

Removed IGR tests

30ff737

Added an example comparing different univariate feature selection fun…

1b76234

…ctions

Removed IG and IGR from two examples

b21c655

StefanieSenger marked this pull request as ready for review April 29, 2024 09:23

StefanieSenger and others added 2 commits May 3, 2024 12:23

added testing for aggretate={'mean', 'sum'}

2b9b6bd

Merge branch 'main' into information_gain

e43c2c5

OmarManzoor reviewed May 6, 2024

View reviewed changes

add test for equally distributed classes

506855c

StefanieSenger and others added 3 commits May 10, 2024 14:00

unfunctional code removed

8f97e01

Merge branch 'main' into information_gain

4bcbf46

update changelog

b6d0481

OmarManzoor approved these changes May 13, 2024

View reviewed changes

sklearn/feature_selection/_univariate_selection.py Outdated Show resolved Hide resolved

sklearn/feature_selection/_univariate_selection.py Outdated Show resolved Hide resolved

sklearn/feature_selection/_univariate_selection.py Outdated Show resolved Hide resolved

Apply suggestions from code review

6bc738c

Co-authored-by: Omar Salman <omar.salman@arbisoft.com>

StefanieSenger and others added 3 commits May 17, 2024 13:50

Merge branch 'main' into information_gain

4d4b368

resolve merge conflict

3718599

delete classes.rst again

b7d25ac

glemaitre self-requested a review May 21, 2024 15:28

glemaitre reviewed May 21, 2024

View reviewed changes

Merge branch 'main' into information_gain

ffe4a4c

glemaitre self-requested a review June 3, 2024 20:59

StefanieSenger commented Jun 3, 2024

View reviewed changes

StefanieSenger and others added 2 commits June 3, 2024 23:44

changes after review

8393cee

Apply suggestions from code review

90cab0e

Co-authored-by: Guillaume Lemaitre <g.lemaitre58@gmail.com>

glemaitre closed this Jun 4, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

FEA Add Information Gain and Information Gain Ratio feature selection functions #28905

FEA Add Information Gain and Information Gain Ratio feature selection functions #28905

StefanieSenger commented Apr 28, 2024

OmarManzoor left a comment

StefanieSenger commented May 10, 2024

OmarManzoor left a comment •

edited

StefanieSenger commented May 13, 2024

glemaitre left a comment

glemaitre May 21, 2024

glemaitre May 21, 2024

StefanieSenger Jun 3, 2024

glemaitre May 21, 2024

glemaitre May 21, 2024

StefanieSenger Jun 3, 2024

glemaitre May 21, 2024

StefanieSenger Jun 3, 2024

glemaitre May 21, 2024

glemaitre May 21, 2024

StefanieSenger Jun 3, 2024

glemaitre May 21, 2024

StefanieSenger Jun 3, 2024

glemaitre May 21, 2024

StefanieSenger Jun 3, 2024

glemaitre May 21, 2024

StefanieSenger left a comment

StefanieSenger Jun 3, 2024

StefanieSenger Jun 3, 2024

StefanieSenger Jun 3, 2024

StefanieSenger Jun 3, 2024

StefanieSenger Jun 3, 2024

StefanieSenger Jun 3, 2024

glemaitre commented Jun 4, 2024

vpekar commented Jun 10, 2024

		return np.asarray(scores).reshape(-1)


		def _get_fc_counts(X, y):

FEA Add Information Gain and Information Gain Ratio feature selection functions #28905

FEA Add Information Gain and Information Gain Ratio feature selection functions #28905

Conversation

StefanieSenger commented Apr 28, 2024

Reference Issues/PRs

What does this implement/fix? Explain your changes.

OmarManzoor left a comment

Choose a reason for hiding this comment

StefanieSenger commented May 10, 2024

OmarManzoor left a comment • edited

Choose a reason for hiding this comment

StefanieSenger commented May 13, 2024

glemaitre left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

StefanieSenger left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

glemaitre commented Jun 4, 2024

vpekar commented Jun 10, 2024

OmarManzoor left a comment •

edited