[MRG] Regression metrics return arrays for multi-output cases #2493

MechCoder · 2013-10-06T15:28:05Z

Hi, I tried to solve issue #2200 . It now returns arrays for multi-output cases.

- Consensus on what the default argument to output_weights should be. Micro or Macro averaging.
- On what the keyword should be output_weights or multi_output as @GaelVaroquaux as suggested.

Once there is a consensus on these two things, this can be merged.

MechCoder · 2013-10-06T16:30:56Z

Hi, @mblondel , @arjoly . Would you be able to review this in your free time?

arjoly · 2013-10-06T19:12:53Z

You need to write tests for your new functionalities see sklearn/metrics/tests/test_metrics.py. For an explanation on how invariance testing works in test_metrics see the added doc in #2460.

You have some travis failures.

MechCoder · 2013-10-06T19:24:13Z

The Travis Failure seem to be due to the doctests, that I've added.

For eg

from sklearn.metrics import mean_absolute_error
y_true = [[0.5, 1], [-1, 1], [7, -6]]
y_pred = [[0, 2], [-1, 2], [8, -5]]
mean_absolute_error(y_true, y_pred, average=False)
array([ 0.5,  1. ])

However in the doctest that I wrote array([0.5, 1]). (without the spaces in between) I'm a beginner and I'm curious as to why this causes an error/

mblondel · 2013-10-07T01:45:36Z

sklearn/metrics/metrics.py

@@ -1894,10 +1894,14 @@ def mean_absolute_error(y_true, y_pred):
    y_pred : array-like of shape = [n_samples] or [n_samples, n_outputs]
        Estimated target values.

+    average : True or False
+        Default value is True. If False, returns an array (multi-output)


Nitpick: I would write the sentence "If True, ..., if False, ... (default: True)"

MechCoder · 2013-10-07T11:21:02Z

@mblondel , @arjoly . Do you have any more comments on this PR? Any more comments are welcome :)

MechCoder · 2013-10-07T12:09:46Z

@mblondel : I'm sorry but I'm a bit new to Machine Learning. I have two queries.

1.] Could you give me an example of what macro, actually means? I understand that micro implies, you flatten the resulting array across one dimension.

2] Also does this mean for now, average can have two values, micro and False?

mblondel · 2013-10-07T12:53:25Z

1.] Could you give me an example of what macro, actually means? I understand that micro implies, you flatten the resulting array across one dimension

macro is the average of the array obtained with average=False

2] Also does this mean for now, average can have two values, micro and False?

Yep, otherwise your PR will take too long to merge. Well-focused PRs have shorter review cycles :)

MechCoder · 2013-10-07T13:30:25Z

I suppose its good to go now?

arjoly · 2013-10-07T15:15:57Z

sklearn/metrics/metrics.py

@@ -1894,10 +1894,16 @@ def mean_absolute_error(y_true, y_pred):
    y_pred : array-like of shape = [n_samples] or [n_samples, n_outputs]
        Estimated target values.

+    average : 'micro' or False


To be consistent with the other metrics, it should be None instead of False see precision_score for instance.

MechCoder · 2013-10-07T18:03:19Z

I have made the changes you've told me to. Thanks :)

MechCoder · 2013-10-08T06:47:16Z

@arjoly , @mblondel , Is it good to go now?

arjoly · 2013-10-08T07:29:27Z

sklearn/metrics/metrics.py

+            if average:
+                return 1.0
+            else:
+                return np.ones(y_true.shape[1], dtype=np.float64)


I think there is a little mistake here, since you can have a defined r2_score for some output, but not all.

A test would be needed for that case.

Would something like this do?

y_true = [[1, 1], [1, 1]] y_pred = [[1, 1], [1, 1]] assert array_equal(r2_score(y_true, y_pred, average=None), np.array([1.], [1.]))

Consider for instance,

In [2]: from sklearn.metrics import r2_score In [3]: r2_score([0, 0], [2, 1]) Out[3]: 0.0 In [4]: r2_score([-1, 1], [2, 1]) Out[4]: -3.5

I expect r2_score([[0, -1],[0, 1]], [[2, 2],[1, 1]], average=None) to be equal to np.array([0, -3.5]).

arjoly · 2013-10-08T07:35:42Z

You need to handle the case where average is not None or 'micro'.

arjoly · 2013-10-08T07:36:36Z

sklearn/metrics/metrics.py

@@ -1894,10 +1894,16 @@ def mean_absolute_error(y_true, y_pred):
    y_pred : array-like of shape = [n_samples] or [n_samples, n_outputs]
        Estimated target values.

+    average : 'micro' or None


this would be prettier
average : string, ['micro' (default), None]

MechCoder · 2013-10-08T07:38:17Z

You need to handle the case where average is not None or 'micro'

By raising an error?

arjoly · 2013-10-08T08:06:03Z

By raising an error?
Yes

Looking more into mean_absolute_error and mean_squared_error, I think we want something like weight_output and not something like averaging.

MechCoder · 2013-10-08T18:05:06Z

@arjoly , @mblondel : I did it using enumerate, I couldn't think of a more efficient numpy way of doing things, either using np.where or otherwise.

arjoly · 2013-10-09T11:05:58Z

@arjoly , @mblondel : I did it using enumerate, I couldn't think of a more efficient numpy way of doing things, either using np.where or otherwise.

I don't have time to look more in detail, but I think this could be done with numpy operation.
See how 0 denominator is handled in precision_recall_fscore function.

coveralls · 2013-10-09T16:59:06Z

Changes Unknown when pulling 097b62272805db5e0b93e3b2caa8690c7ba94c5a on Manoj-Kumar-S:metrics into * on scikit-learn:master*.

MechCoder · 2013-10-09T17:02:35Z

Hi, @arjoly I tried changing it to the NumPy way of doing things according to this, http://stackoverflow.com/questions/19274082/numpy-method-of-returning-values-based-on-two-different-arrays/19275440#comment28539446_19275440

arjoly · 2013-10-10T06:44:05Z

sklearn/metrics/metrics.py

@@ -1894,10 +1894,13 @@ def mean_absolute_error(y_true, y_pred):
    y_pred : array-like of shape = [n_samples] or [n_samples, n_outputs]
        Estimated target values.

+    average : string, ['micro' (default), None]


I'm still not convinced this is the way to go since
mean_absolute_error(y_true, y_pred, average="micro") == mean_absolute_error(y_true, y_pred, average="macro") == mean_absolute_error(y_true, y_pred, average="samples"). What is your opionion @mblondel ?

MechCoder · 2013-10-12T08:52:07Z

@arjoly : Umm.. sorry for disturbing you.. but any updates. What are the values that you think, should be supplied to average?

arjoly · 2013-10-14T07:55:38Z

@arjoly : Umm.. sorry for disturbing you.. but any updates. What are the values that you think, should be supplied to average?

In order to add your new feature, I would add an output_weights argument with option 'uniform' (current behaviour) or None (no averaging and return the statistics for each output).Thus, the computation of mean_absolute_error, mean_squared_erorr and r2_score would be based on the weighted norm:

where W would be the output_weights vector. So the uniform argument would mean that output_weights= np.ones(...).

I wouldn't use the average keyword since it's more adapted for the extension of binary measure. Later or in this pr, a custom output_weights could be provided by the used. What is you opinion @manoj-kumar-s ?

MechCoder · 2013-10-24T09:01:25Z

Ah okay, I suppose the predicted values that you get out of your model, would be automatically scaled.

eickenberg · 2013-10-24T09:29:39Z

Indeed, your estimator/predictor should adapt to the scaled input data and the outputs will automatically scale as well.

If you write a scoring metric that normalizes y_true and y_pred and sticks them into r**2, then, if my arithmetic is correct, you should get something like 2 * corr(y_true, y_pred) - 1 give or take.

Now take ridge regression as an example. If you penalize too hard, r ** 2 will tend to 0, because your coefficients will become too small and you find yourself on the wrong scale. Not so with correlation, or r ** 2 with normalized entries: If you use the r ** 2 on normalized predictions, you will get a much better score, since you will be able to compensate partially for the squeezing done by the penalty. This is not in the spirit of the r ** 2 score.

My main point though, was that your example, even though you scale predictions, corroborates the fact that after normalization of the targets, the different scoring schemes become equivalent.
(And this will still be the case if you use an actual estimator and do not normalize predictions!)

The R ** 2 from the Kolar & Xing 2010 paper is only useful in the case where all target variances are the same. But in this case this scoring comes down to doing macro r ** 2 using a mean. So macro r ** 2 using arithmetic mean is a more general way of scoring, and since the scikit learn tries to be as general as possible in the functionality it offers, I would definitely go with macro r ** 2 mean.

MechCoder · 2013-10-24T10:42:32Z

Thanks for the clarification, and do we assume the user inputs normalized data as well?

eickenberg · 2013-10-24T15:31:19Z

Taking the (possibly weighted) mean of the individual separate r**2 scores on all targets (i.e. "macro-averaging") would allow us to be agnostic as to how the individual targets are scaled at the input level.

MechCoder · 2013-10-24T17:47:53Z

@eickenberg : Sorry for being naive again, but according to the example that I gave, the macro-average is 0.72, however for the normalized input I get 0.75 as the result.

eickenberg · 2013-10-24T21:44:07Z

@manoj-kumar-s this is one of the problems with normalizing the predictions.

To bring your normalization closer to the real world situation I described earlier, try normalizing your predictions using mean and stdev of y_true instead of those of y_pred. In this case you will find equality (normalized input r_2 will go down to .72 as well). (To be fully convincing one would need to work with an actual estimator, but the scaling by y_true is what a real linear estimator would undergo as well)

MechCoder · 2013-10-25T04:05:18Z

@eickenberg : Thanks again :) . @ogrisel : Could you merge this, if you have no other comments

ogrisel · 2013-10-25T07:49:32Z

I won't merge if there is not at least two +1. Furthermore we should not merge API evolutions if we are not 100% confident we have the right design as users will be very annoyed if we change our mind again later and break their programs when upgrading scikit-learn. I am still not sure myself what default averaging strategy we should implement and whether we should try to make it possible to implement non macroaveraging-based strategies without breaking backward compat again in the future.

@manoj-kumar-s if you are tired of this discussion then just please say so and someone else will takeover from there (you can mute the notifications from this specific PR by clicking on the button at the end of the page).

MechCoder · 2013-10-25T09:01:09Z

@ogrisel : Could you please have a look at the example that I provided and @eickenberg #2493 (comment) comments as well? It does seem that macro_average r2_score is the way to go.

No, I am not tired of this discussion. I've learnt a lot already and looking forward to learn more from other issues as well. :)

MechCoder · 2013-10-25T09:10:53Z

I think I shall just wait till we come to an agreement on this.

MechCoder · 2013-10-25T15:59:42Z

@arjoly , @mblondel, @eickenberg, @jaquesgrobler : Thanks all for your help in this PR and I'm sorry to be bugging you again, but could you please give a final +1 or - 1 on the averaging done in this PR (based on the example that I've given and #2493 (comment)), so that I can leave this PR for now and start looking at other issues. :)

mblondel · 2013-10-25T23:52:44Z

My opinion hasn't changed: I vote for macro-average in all metrics.

Now regarding MSE, I see several options:

in the docstring, recommend to normalize y_true prior to fitting an estimator
add a normalize option which normalizes y_true and apply the same normalization to y_test (but ideally the user should do it prior to fitting an estimator, hence the doc)
detect whether y_true hasn't been normalized and issue a warning if not
should be done for sure, I would like to hear opinions about 2) and 3).

GaelVaroquaux · 2013-10-26T01:38:47Z

If we don't plan to implement any other weighting schemes is the future
(arguably only the 'uniform' / macro-averaged weighting seems standard
for regression) I would rather just replace that option by a simple
boolean flag, for instance named average_outputs=True

I made the exact same comment a while ago, and this is still how I feel.
This evokes a YAGNI feeling.

GaelVaroquaux · 2013-10-26T01:39:37Z

I think I will go ahead with making output_weights accept an array of user provided weights if it is ok with you :)

Sounds good!

GaelVaroquaux · 2013-10-26T01:45:32Z

but could you give a final +1 or - 1 on the averaging done in this PR

I do not think that I can give a qualified answer, as I don't have these
usecaes, so I won't vote, sorry :$

eickenberg · 2013-10-29T12:59:10Z

As far as I understand, the motivation for returning one single value out of the metric function is to have a quantity that is totally orderable, such that a parameter grid search may choose a best parameter based on this score.

Speaking from a setting where the number of targets is potentially vastly higher than the number of samples, features (or both taken together if that quantity is in any way useful), the hope to be able to tune one, two or three parameters in a grid search and find a satisfactory optimum for all the targets seems a rather naive one to me. Probably more often than not, the number of parameters will scale (hopefully only) linearly with the number of targets involved. And with a little luck, the parameter grid search can be performed almost independently per target.

So if ensuring total orderability is the only reason for condensing a vast number of scores into one, it may be worth the discussion of accepting arrays of scores by default (and deleting the isinstance numbers.Number clause in cross_val_score).

Of course my use case is specific, and the number of targets can be a small constant with respect to the other dimensions. In this case, condensing several r^2 scores into one may not lead to too grave a loss of information. And in this case, arithmetic mean is maybe the simplest useful option.

ogrisel · 2013-10-30T08:31:39Z

My opinion hasn't changed: I vote for macro-average in all metrics.
Now regarding MSE, I see several options:

in the docstring, recommend to normalize y_true prior to fitting an estimator

+1. Same comment to be added in the docstring for MAE.

ogrisel · 2013-10-30T08:35:00Z

Trying to detect and warn about un-normalized y_true sounds unstable as the normalization will be tuned on the training set and y_true will most likely stem from the a validation or testing set where the normalization will not be perfect. So -1 on option 3).

Option 2) (normalizing the output of the estimator) sounds like a non-principled hack so -1 as well.

mblondel · 2013-10-30T08:52:31Z

Speaking from a setting where the number of targets is potentially vastly higher than the number of samples, features (or both taken together if that quantity is in any way useful), the hope to be able to tune one, two or three parameters in a grid search and find a satisfactory optimum for all the targets seems a rather naive one to me. Probably more often than not, the number of parameters will scale (hopefully only) linearly with the number of targets involved. And with a little luck, the parameter grid search can be performed almost independently per target.

In my experience, the more tasks you have, the more you would benefit from learning shared parameters because you don't have enough data to estimate many parameters. Moreover, multi-task algorithms usually have shared hyper-parameters.

Also technically if you care about tuning hyper-parameters on a per-task basis, then you don't even need multi-output metrics: you just need to fit one estimator per task/output and use regular metrics in grid search.

MechCoder · 2013-10-31T20:16:11Z

I've added a line in the docstring, to advise the user to normalize y_true.

coveralls · 2013-10-31T20:28:57Z

Coverage remained the same when pulling b56ef15 on Manoj-Kumar-S:metrics into d82cf06 on scikit-learn:master.

arjoly · 2014-07-19T15:37:26Z

During the sprint, we discuss (me, @eickenberg and @MechCoder ) about the blocking points of this pull request. It turns ont the difference between macro-averaging and and the current implementation could be solved using output_weights properly.

The macro-r2 / macro-explained variance correspond to uniform output_weight (= 1 / n_outputs) and the current version use output_weight proportional to the fraction of variance explained by each output.

Thus we decided to keep both version. I am also fine with changing default to macro.

eickenberg · 2014-07-20T08:44:55Z

i am incorporating the changes made on a fresh branch (rebasing was too
difficult), setting the defaults the way arnaud describes.

On Saturday, July 19, 2014, Arnaud Joly notifications@github.com wrote:

During the sprint, we discuss (me, @eickenberg
https://github.com/eickenberg and @MechCoder
https://github.com/MechCoder ) about the blocking points of this pull
request. It turns ont the difference between macro-averaging and and the
current implementation could be solved using output_weights properly.

The macro-r2 / macro-explained variance correspond to uniform
output_weight (= 1 / n_outputs) and the current version use output_weight
proportionall to the fraction of variance explained by each output.

Thus we decided to keep both version. I am also fine with changing default
to macro.

—
Reply to this email directly or view it on GitHub
#2493 (comment)
.

MechCoder · 2014-07-20T09:28:44Z

ok, so closing this.

Regreesion metrics return arrays for multi-output cases

a33a9d1

mblondel reviewed Oct 7, 2013
View reviewed changes

Added tests, hopefully resolved tests failures.

ce91610

Changed True to micro

5f0ed74

arjoly reviewed Oct 7, 2013
View reviewed changes

Changed False to None, assert_array_equal

1829ad9

arjoly reviewed Oct 8, 2013
View reviewed changes

arjoly reviewed Oct 10, 2013
View reviewed changes

Added docstring to advise the user to normalize y_true

b56ef15

MechCoder closed this Jul 20, 2014

arjoly mentioned this pull request Jul 20, 2014

option to return an array in metrics if multi-output #2200

Closed

eickenberg mentioned this pull request Jul 20, 2014

[WIP] PR2493 cont'd (multiple output scoring) #3456

Closed

MechCoder referenced this pull request May 8, 2015

DOC whatsnew for unaveraged multi-output regression metrics.

e6b7d2c

MechCoder deleted the metrics branch May 8, 2015 17:21

[MRG] Regression metrics return arrays for multi-output cases #2493

[MRG] Regression metrics return arrays for multi-output cases #2493

Conversation

MechCoder commented Oct 6, 2013

MechCoder commented Oct 6, 2013

arjoly commented Oct 6, 2013

MechCoder commented Oct 6, 2013

Choose a reason for hiding this comment

MechCoder commented Oct 7, 2013

MechCoder commented Oct 7, 2013

mblondel commented Oct 7, 2013

MechCoder commented Oct 7, 2013

Choose a reason for hiding this comment

MechCoder commented Oct 7, 2013

MechCoder commented Oct 8, 2013

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

arjoly commented Oct 8, 2013

Choose a reason for hiding this comment

MechCoder commented Oct 8, 2013

arjoly commented Oct 8, 2013

MechCoder commented Oct 8, 2013

arjoly commented Oct 9, 2013

coveralls commented Oct 9, 2013

MechCoder commented Oct 9, 2013

Choose a reason for hiding this comment

MechCoder commented Oct 12, 2013

arjoly commented Oct 14, 2013

MechCoder commented Oct 24, 2013

eickenberg commented Oct 24, 2013

MechCoder commented Oct 24, 2013

eickenberg commented Oct 24, 2013

MechCoder commented Oct 24, 2013

eickenberg commented Oct 24, 2013

MechCoder commented Oct 25, 2013

ogrisel commented Oct 25, 2013

MechCoder commented Oct 25, 2013

MechCoder commented Oct 25, 2013

MechCoder commented Oct 25, 2013

mblondel commented Oct 25, 2013

GaelVaroquaux commented Oct 26, 2013

GaelVaroquaux commented Oct 26, 2013

GaelVaroquaux commented Oct 26, 2013

eickenberg commented Oct 29, 2013

ogrisel commented Oct 30, 2013

ogrisel commented Oct 30, 2013

mblondel commented Oct 30, 2013

MechCoder commented Oct 31, 2013

coveralls commented Oct 31, 2013

arjoly commented Jul 19, 2014

eickenberg commented Jul 20, 2014

MechCoder commented Jul 20, 2014