Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[MRG] Regression metrics return arrays for multi-output cases #2493

Closed
wants to merge 14 commits into from

Conversation

MechCoder
Copy link
Member

Hi, I tried to solve issue #2200 . It now returns arrays for multi-output cases.

  • - Consensus on what the default argument to output_weights should be. Micro or Macro averaging.
  • - On what the keyword should be output_weights or multi_output as @GaelVaroquaux as suggested.

Once there is a consensus on these two things, this can be merged.

@MechCoder
Copy link
Member Author

Hi, @mblondel , @arjoly . Would you be able to review this in your free time?

@arjoly
Copy link
Member

arjoly commented Oct 6, 2013

You need to write tests for your new functionalities see sklearn/metrics/tests/test_metrics.py. For an explanation on how invariance testing works in test_metrics see the added doc in #2460.

You have some travis failures.

@MechCoder
Copy link
Member Author

The Travis Failure seem to be due to the doctests, that I've added.

For eg

from sklearn.metrics import mean_absolute_error
y_true = [[0.5, 1], [-1, 1], [7, -6]]
y_pred = [[0, 2], [-1, 2], [8, -5]]
mean_absolute_error(y_true, y_pred, average=False)
array([ 0.5,  1. ])

However in the doctest that I wrote array([0.5, 1]). (without the spaces in between) I'm a beginner and I'm curious as to why this causes an error/

@@ -1894,10 +1894,14 @@ def mean_absolute_error(y_true, y_pred):
y_pred : array-like of shape = [n_samples] or [n_samples, n_outputs]
Estimated target values.

average : True or False
Default value is True. If False, returns an array (multi-output)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nitpick: I would write the sentence "If True, ..., if False, ... (default: True)"

@MechCoder
Copy link
Member Author

@mblondel , @arjoly . Do you have any more comments on this PR? Any more comments are welcome :)

@MechCoder
Copy link
Member Author

@mblondel : I'm sorry but I'm a bit new to Machine Learning. I have two queries.

1.] Could you give me an example of what macro, actually means? I understand that micro implies, you flatten the resulting array across one dimension.

2] Also does this mean for now, average can have two values, micro and False?

@mblondel
Copy link
Member

mblondel commented Oct 7, 2013

1.] Could you give me an example of what macro, actually means? I understand that micro implies, you flatten the resulting array across one dimension

macro is the average of the array obtained with average=False

2] Also does this mean for now, average can have two values, micro and False?

Yep, otherwise your PR will take too long to merge. Well-focused PRs have shorter review cycles :)

@MechCoder
Copy link
Member Author

I suppose its good to go now?

@@ -1894,10 +1894,16 @@ def mean_absolute_error(y_true, y_pred):
y_pred : array-like of shape = [n_samples] or [n_samples, n_outputs]
Estimated target values.

average : 'micro' or False
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To be consistent with the other metrics, it should be None instead of False see precision_score for instance.

@MechCoder
Copy link
Member Author

I have made the changes you've told me to. Thanks :)

@MechCoder
Copy link
Member Author

@arjoly , @mblondel , Is it good to go now?

if average:
return 1.0
else:
return np.ones(y_true.shape[1], dtype=np.float64)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think there is a little mistake here, since you can have a defined r2_score for some output, but not all.

A test would be needed for that case.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would something like this do?

y_true = [[1, 1], [1, 1]] 
y_pred = [[1, 1], [1, 1]]
assert array_equal(r2_score(y_true, y_pred, average=None), np.array([1.], [1.]))

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider for instance,

In [2]: from sklearn.metrics import r2_score

In [3]: r2_score([0, 0], [2, 1])
Out[3]: 0.0

In [4]: r2_score([-1, 1], [2, 1])
Out[4]: -3.5

I expect r2_score([[0, -1],[0, 1]], [[2, 2],[1, 1]], average=None) to be equal to np.array([0, -3.5]).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah okay.

@arjoly
Copy link
Member

arjoly commented Oct 8, 2013

You need to handle the case where average is not None or 'micro'.

@@ -1894,10 +1894,16 @@ def mean_absolute_error(y_true, y_pred):
y_pred : array-like of shape = [n_samples] or [n_samples, n_outputs]
Estimated target values.

average : 'micro' or None
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this would be prettier
average : string, ['micro' (default), None]

@MechCoder
Copy link
Member Author

You need to handle the case where average is not None or 'micro'

By raising an error?

@arjoly
Copy link
Member

arjoly commented Oct 8, 2013

By raising an error?
Yes

Looking more into mean_absolute_error and mean_squared_error, I think we want something like weight_output and not something like averaging.

@MechCoder
Copy link
Member Author

@arjoly , @mblondel : I did it using enumerate, I couldn't think of a more efficient numpy way of doing things, either using np.where or otherwise.

@arjoly
Copy link
Member

arjoly commented Oct 9, 2013

@arjoly , @mblondel : I did it using enumerate, I couldn't think of a more efficient numpy way of doing things, either using np.where or otherwise.

I don't have time to look more in detail, but I think this could be done with numpy operation.
See how 0 denominator is handled in precision_recall_fscore function.

@coveralls
Copy link

Coverage Status

Changes Unknown when pulling 097b62272805db5e0b93e3b2caa8690c7ba94c5a on Manoj-Kumar-S:metrics into * on scikit-learn:master*.

@MechCoder
Copy link
Member Author

@@ -1894,10 +1894,13 @@ def mean_absolute_error(y_true, y_pred):
y_pred : array-like of shape = [n_samples] or [n_samples, n_outputs]
Estimated target values.

average : string, ['micro' (default), None]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm still not convinced this is the way to go since
mean_absolute_error(y_true, y_pred, average="micro") == mean_absolute_error(y_true, y_pred, average="macro") == mean_absolute_error(y_true, y_pred, average="samples"). What is your opionion @mblondel ?

@MechCoder
Copy link
Member Author

@arjoly : Umm.. sorry for disturbing you.. but any updates. What are the values that you think, should be supplied to average?

@arjoly
Copy link
Member

arjoly commented Oct 14, 2013

@arjoly : Umm.. sorry for disturbing you.. but any updates. What are the values that you think, should be supplied to average?

In order to add your new feature, I would add an output_weights argument with option 'uniform' (current behaviour) or None (no averaging and return the statistics for each output).Thus, the computation of mean_absolute_error, mean_squared_erorr and r2_score would be based on the weighted norm:

equation

where W would be the output_weights vector. So the uniform argument would mean that output_weights= np.ones(...).

I wouldn't use the average keyword since it's more adapted for the extension of binary measure. Later or in this pr, a custom output_weights could be provided by the used. What is you opinion @manoj-kumar-s ?

@MechCoder
Copy link
Member Author

Ah okay, I suppose the predicted values that you get out of your model, would be automatically scaled.

@eickenberg
Copy link
Contributor

Indeed, your estimator/predictor should adapt to the scaled input data and the outputs will automatically scale as well.

If you write a scoring metric that normalizes y_true and y_pred and sticks them into r**2, then, if my arithmetic is correct, you should get something like 2 * corr(y_true, y_pred) - 1 give or take.

Now take ridge regression as an example. If you penalize too hard, r ** 2 will tend to 0, because your coefficients will become too small and you find yourself on the wrong scale. Not so with correlation, or r ** 2 with normalized entries: If you use the r ** 2 on normalized predictions, you will get a much better score, since you will be able to compensate partially for the squeezing done by the penalty. This is not in the spirit of the r ** 2 score.

My main point though, was that your example, even though you scale predictions, corroborates the fact that after normalization of the targets, the different scoring schemes become equivalent.
(And this will still be the case if you use an actual estimator and do not normalize predictions!)

The R ** 2 from the Kolar & Xing 2010 paper is only useful in the case where all target variances are the same. But in this case this scoring comes down to doing macro r ** 2 using a mean. So macro r ** 2 using arithmetic mean is a more general way of scoring, and since the scikit learn tries to be as general as possible in the functionality it offers, I would definitely go with macro r ** 2 mean.

@MechCoder
Copy link
Member Author

Thanks for the clarification, and do we assume the user inputs normalized data as well?

@eickenberg
Copy link
Contributor

Taking the (possibly weighted) mean of the individual separate r**2 scores on all targets (i.e. "macro-averaging") would allow us to be agnostic as to how the individual targets are scaled at the input level.

@MechCoder
Copy link
Member Author

@eickenberg : Sorry for being naive again, but according to the example that I gave, the macro-average is 0.72, however for the normalized input I get 0.75 as the result.

@eickenberg
Copy link
Contributor

@manoj-kumar-s this is one of the problems with normalizing the predictions.

To bring your normalization closer to the real world situation I described earlier, try normalizing your predictions using mean and stdev of y_true instead of those of y_pred. In this case you will find equality (normalized input r_2 will go down to .72 as well). (To be fully convincing one would need to work with an actual estimator, but the scaling by y_true is what a real linear estimator would undergo as well)

@MechCoder
Copy link
Member Author

@eickenberg : Thanks again :) . @ogrisel : Could you merge this, if you have no other comments

@ogrisel
Copy link
Member

ogrisel commented Oct 25, 2013

I won't merge if there is not at least two +1. Furthermore we should not merge API evolutions if we are not 100% confident we have the right design as users will be very annoyed if we change our mind again later and break their programs when upgrading scikit-learn. I am still not sure myself what default averaging strategy we should implement and whether we should try to make it possible to implement non macroaveraging-based strategies without breaking backward compat again in the future.

@manoj-kumar-s if you are tired of this discussion then just please say so and someone else will takeover from there (you can mute the notifications from this specific PR by clicking on the button at the end of the page).

@MechCoder
Copy link
Member Author

@ogrisel : Could you please have a look at the example that I provided and @eickenberg #2493 (comment) comments as well? It does seem that macro_average r2_score is the way to go.

No, I am not tired of this discussion. I've learnt a lot already and looking forward to learn more from other issues as well. :)

@MechCoder
Copy link
Member Author

I think I shall just wait till we come to an agreement on this.

@MechCoder
Copy link
Member Author

@arjoly , @mblondel, @eickenberg, @jaquesgrobler : Thanks all for your help in this PR and I'm sorry to be bugging you again, but could you please give a final +1 or - 1 on the averaging done in this PR (based on the example that I've given and #2493 (comment)), so that I can leave this PR for now and start looking at other issues. :)

@mblondel
Copy link
Member

My opinion hasn't changed: I vote for macro-average in all metrics.

Now regarding MSE, I see several options:

  1. in the docstring, recommend to normalize y_true prior to fitting an estimator

  2. add a normalize option which normalizes y_true and apply the same normalization to y_test (but ideally the user should do it prior to fitting an estimator, hence the doc)

  3. detect whether y_true hasn't been normalized and issue a warning if not

  4. should be done for sure, I would like to hear opinions about 2) and 3).

@GaelVaroquaux
Copy link
Member

If we don't plan to implement any other weighting schemes is the future
(arguably only the 'uniform' / macro-averaged weighting seems standard
for regression) I would rather just replace that option by a simple
boolean flag, for instance named average_outputs=True

I made the exact same comment a while ago, and this is still how I feel.
This evokes a YAGNI feeling.

@GaelVaroquaux
Copy link
Member

I think I will go ahead with making output_weights accept an array of user provided weights if it is ok with you :)

Sounds good!

@GaelVaroquaux
Copy link
Member

but could you give a final +1 or - 1 on the averaging done in this PR

I do not think that I can give a qualified answer, as I don't have these
usecaes, so I won't vote, sorry :$

@eickenberg
Copy link
Contributor

As far as I understand, the motivation for returning one single value out of the metric function is to have a quantity that is totally orderable, such that a parameter grid search may choose a best parameter based on this score.

Speaking from a setting where the number of targets is potentially vastly higher than the number of samples, features (or both taken together if that quantity is in any way useful), the hope to be able to tune one, two or three parameters in a grid search and find a satisfactory optimum for all the targets seems a rather naive one to me. Probably more often than not, the number of parameters will scale (hopefully only) linearly with the number of targets involved. And with a little luck, the parameter grid search can be performed almost independently per target.

So if ensuring total orderability is the only reason for condensing a vast number of scores into one, it may be worth the discussion of accepting arrays of scores by default (and deleting the isinstance numbers.Number clause in cross_val_score).

Of course my use case is specific, and the number of targets can be a small constant with respect to the other dimensions. In this case, condensing several r^2 scores into one may not lead to too grave a loss of information. And in this case, arithmetic mean is maybe the simplest useful option.

@ogrisel
Copy link
Member

ogrisel commented Oct 30, 2013

My opinion hasn't changed: I vote for macro-average in all metrics.
Now regarding MSE, I see several options:

  1. in the docstring, recommend to normalize y_true prior to fitting an estimator

+1. Same comment to be added in the docstring for MAE.

@ogrisel
Copy link
Member

ogrisel commented Oct 30, 2013

Trying to detect and warn about un-normalized y_true sounds unstable as the normalization will be tuned on the training set and y_true will most likely stem from the a validation or testing set where the normalization will not be perfect. So -1 on option 3).

Option 2) (normalizing the output of the estimator) sounds like a non-principled hack so -1 as well.

@mblondel
Copy link
Member

Speaking from a setting where the number of targets is potentially vastly higher than the number of samples, features (or both taken together if that quantity is in any way useful), the hope to be able to tune one, two or three parameters in a grid search and find a satisfactory optimum for all the targets seems a rather naive one to me. Probably more often than not, the number of parameters will scale (hopefully only) linearly with the number of targets involved. And with a little luck, the parameter grid search can be performed almost independently per target.

In my experience, the more tasks you have, the more you would benefit from learning shared parameters because you don't have enough data to estimate many parameters. Moreover, multi-task algorithms usually have shared hyper-parameters.

Also technically if you care about tuning hyper-parameters on a per-task basis, then you don't even need multi-output metrics: you just need to fit one estimator per task/output and use regular metrics in grid search.

@MechCoder
Copy link
Member Author

I've added a line in the docstring, to advise the user to normalize y_true.

@coveralls
Copy link

Coverage Status

Coverage remained the same when pulling b56ef15 on Manoj-Kumar-S:metrics into d82cf06 on scikit-learn:master.

@arjoly
Copy link
Member

arjoly commented Jul 19, 2014

During the sprint, we discuss (me, @eickenberg and @MechCoder ) about the blocking points of this pull request. It turns ont the difference between macro-averaging and and the current implementation could be solved using output_weights properly.

The macro-r2 / macro-explained variance correspond to uniform output_weight (= 1 / n_outputs) and the current version use output_weight proportional to the fraction of variance explained by each output.

Thus we decided to keep both version. I am also fine with changing default to macro.

@eickenberg
Copy link
Contributor

i am incorporating the changes made on a fresh branch (rebasing was too
difficult), setting the defaults the way arnaud describes.

On Saturday, July 19, 2014, Arnaud Joly notifications@github.com wrote:

During the sprint, we discuss (me, @eickenberg
https://github.com/eickenberg and @MechCoder
https://github.com/MechCoder ) about the blocking points of this pull
request. It turns ont the difference between macro-averaging and and the
current implementation could be solved using output_weights properly.

The macro-r2 / macro-explained variance correspond to uniform
output_weight (= 1 / n_outputs) and the current version use output_weight
proportionall to the fraction of variance explained by each output.

Thus we decided to keep both version. I am also fine with changing default
to macro.


Reply to this email directly or view it on GitHub
#2493 (comment)
.

@MechCoder
Copy link
Member Author

ok, so closing this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

8 participants