-
-
Notifications
You must be signed in to change notification settings - Fork 25.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
FEA Add Information Gain and Information Gain Ratio feature selection functions #28905
FEA Add Information Gain and Information Gain Ratio feature selection functions #28905
Conversation
…kar/scikit-learn into ig-and-igr-feature-selection
…nual values; moved IGR tests
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the PR @StefanieSenger . Would it make sense to add a test which compares the transformed X between the information gain and information gain ratio, since they should be generally the same?
I have added such a test @OmarManzoor, maybe it helps if one day someone works on the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A few minor suggestions otherwise this looks good. Thanks @StefanieSenger
Co-authored-by: Omar Salman <omar.salman@arbisoft.com>
Nice, thank you @OmarManzoor |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just a couple of first comments to use scipy instead of our own implementation of the entropy or the KL divergence.
|
||
- |Feature| :func:`~feature_selection.info_gain` and | ||
:func:`~feature_selection.info_gain_ratio` can now be used for | ||
univariate feature selection. :pr:`28905` by :user:`Viktor Pekar <vpekar>`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
univariate feature selection. :pr:`28905` by :user:`Viktor Pekar <vpekar>`. | |
univariate feature selection. | |
:pr:`28905` by :user:`Viktor Pekar <vpekar>` and | |
:user:`Stefanie Senger <StefanieSenger>`. |
def _get_entropy(prob): | ||
t = np.log2(prob) | ||
t[~np.isfinite(t)] = 0 | ||
return np.multiply(-prob, t) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nowadays, I think this is implemented in scipy.stats.entropy
: https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.entropy.html
The base
here is set to 2
(I have to check if it makes sense or not).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have substituted this function with one of the scipy entropy ones (scipi.special.entr()
), though I need to admid I don't understand that and have just chosen the one that would not raise when running the tests.
def _a_log_a_div_b(a, b): | ||
with np.errstate(invalid="ignore", divide="ignore"): | ||
t = np.log2(a / b) | ||
t[~np.isfinite(t)] = 0 | ||
return np.multiply(a, t) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
supposidely this could be replaced by the rel_entr
from scipy: https://docs.scipy.org/doc/scipy/reference/generated/scipy.special.rel_entr.html#scipy.special.rel_entr
The difference is that we use log2
instead of the log base e in the scipy definition. I have to check.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So I assume that we could use the natural logarithm anywhere because it would only be different by a constant multiplier and since we are only comparing the information everywhere then it should not matter
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay, we had talked about this. This time I found that scipy.special.rel_entr()
was the one that had the same results as before.
c_prob = c_count / c_count.sum() | ||
fc_prob = fc_count / total | ||
|
||
c_f = _a_log_a_div_b(fc_prob, c_prob * f_prob) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To give an example regarding the base, here it would be equivalent to:
c_f = rel_entr(fc_prob, c_prob * f_prob) / np.log(2)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, that worked.
@@ -0,0 +1,115 @@ | |||
""" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We will probably avoid to have a new example and instead we should edit an existing one.
"""Count feature, class, joint and total frequencies | ||
|
||
Returns | ||
------- | ||
f_count : array, shape = (n_features,) | ||
c_count : array, shape = (n_classes,) | ||
fc_count : array, shape = (n_features, n_classes) | ||
total: int | ||
""" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we will need a proper docstring with our new standards
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Even for private functions? I wonder, because many other private functions don't have anything similar to numpy docstring style, which I think is what you are referring to(?).
I tried to do some improvements. I there some test I can run to find out it it is enough? The CI didn't raise because of that.
return np.asarray(scores).reshape(-1) | ||
|
||
|
||
def _get_fc_counts(X, y): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since this is call a single time, we should not need to have a function.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I actually like it, because though having this function name it gives meaning to this part of the code and structures _info_gain()
. Maybe we should even rename to avoid the unclear fc
in it's naming. I will make a suggestions with my push.
Can you also imagine to keep this function?
with np.errstate(invalid="ignore", divide="ignore"): | ||
scores = scores / (_get_entropy(c_prob) + _get_entropy(1 - c_prob)) | ||
|
||
# the feature score is averaged over classes |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the comment only apply to the first case
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
True, I will delete it entirely. I think it's not really necessary to have it at all.
c_nf = _a_log_a_div_b((c_count - fc_count) / total, c_prob * (1 - f_prob)) | ||
nc_f = _a_log_a_div_b((f_count - fc_count) / total, (1 - c_prob) * f_prob) | ||
|
||
scores = c_f + nc_nf + c_nf + nc_f |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that I would prefer _info_gain
to return this score and
have the ratio below done in the info_gain_ratio
and finally have a function to that could be called twice to just make the reduction.
def _info_gain(X, y):
# probably the name of the function should be better.
...
return scores, c_prob
def info_gain(X, y, aggregate=np.max):
return aggregate.reduce(_info_gain(X, y)[0], axis=0)
def info_gain_ratio(X, y, aggregate=np.max):
scores, c_prob = _info_gain(X, y)
with np.errstate(invalid="ignore", divide="ignore"):
scores /= (entropy(c_prob) + entropy(1 - c_prob))
return aggregate.reduce(scores, axis=0)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you, @glemaitre, for your review and your explanations in the call. I have tried to address what we talked about. I will push the recent changes and try to continue understanding the rest.
with np.errstate(invalid="ignore", divide="ignore"): | ||
scores = scores / (_get_entropy(c_prob) + _get_entropy(1 - c_prob)) | ||
|
||
# the feature score is averaged over classes |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
True, I will delete it entirely. I think it's not really necessary to have it at all.
return np.asarray(scores).reshape(-1) | ||
|
||
|
||
def _get_fc_counts(X, y): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I actually like it, because though having this function name it gives meaning to this part of the code and structures _info_gain()
. Maybe we should even rename to avoid the unclear fc
in it's naming. I will make a suggestions with my push.
Can you also imagine to keep this function?
"""Count feature, class, joint and total frequencies | ||
|
||
Returns | ||
------- | ||
f_count : array, shape = (n_features,) | ||
c_count : array, shape = (n_classes,) | ||
fc_count : array, shape = (n_features, n_classes) | ||
total: int | ||
""" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Even for private functions? I wonder, because many other private functions don't have anything similar to numpy docstring style, which I think is what you are referring to(?).
I tried to do some improvements. I there some test I can run to find out it it is enough? The CI didn't raise because of that.
def _get_entropy(prob): | ||
t = np.log2(prob) | ||
t[~np.isfinite(t)] = 0 | ||
return np.multiply(-prob, t) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have substituted this function with one of the scipy entropy ones (scipi.special.entr()
), though I need to admid I don't understand that and have just chosen the one that would not raise when running the tests.
def _a_log_a_div_b(a, b): | ||
with np.errstate(invalid="ignore", divide="ignore"): | ||
t = np.log2(a / b) | ||
t[~np.isfinite(t)] = 0 | ||
return np.multiply(a, t) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay, we had talked about this. This time I found that scipy.special.rel_entr()
was the one that had the same results as before.
c_prob = c_count / c_count.sum() | ||
fc_prob = fc_count / total | ||
|
||
c_f = _a_log_a_div_b(fc_prob, c_prob * f_prob) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, that worked.
Co-authored-by: Guillaume Lemaitre <g.lemaitre58@gmail.com>
So I had a look at the original paper from the original PR (Machine Learning in Automatic Text It appears that the definition of Information Gain is also known as the expected mutual information. Thus, this corresponds to the implementation in In the paper, they also refer to Mutual Information that is really confusing. This definition is actually the pointwise mutual information. It defers from the expected mutual information because it only consider the event t=t_k So the bottom line is that we can close this PR and the stalled one because it implements something that we already have. Something that I really struggle though was about the type of matrix |
Hi all, sorry I've just noticed the conversation on this PR. I suggest the proposed implementation of Information Gain is merged, for the following reasons:
|
Reference Issues/PRs
closes #6534
What does this implement/fix? Explain your changes.
This 2016 PR intended to add
info_gain
andinfo_gain_ratio
functions for univariate feature selection. Here, I update and finish it up. For further information, please refer to the discussion on the old PR.