New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[MRG] Regression metrics return arrays for multi-output cases #2493
Conversation
You need to write tests for your new functionalities see You have some travis failures. |
The Travis Failure seem to be due to the doctests, that I've added. For eg
However in the doctest that I wrote |
@@ -1894,10 +1894,14 @@ def mean_absolute_error(y_true, y_pred): | |||
y_pred : array-like of shape = [n_samples] or [n_samples, n_outputs] | |||
Estimated target values. | |||
|
|||
average : True or False | |||
Default value is True. If False, returns an array (multi-output) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nitpick: I would write the sentence "If True, ..., if False, ... (default: True)"
@mblondel : I'm sorry but I'm a bit new to Machine Learning. I have two queries. 1.] Could you give me an example of what macro, actually means? I understand that micro implies, you flatten the resulting array across one dimension. 2] Also does this mean for now, average can have two values, micro and False? |
macro is the average of the array obtained with average=False
Yep, otherwise your PR will take too long to merge. Well-focused PRs have shorter review cycles :) |
I suppose its good to go now? |
@@ -1894,10 +1894,16 @@ def mean_absolute_error(y_true, y_pred): | |||
y_pred : array-like of shape = [n_samples] or [n_samples, n_outputs] | |||
Estimated target values. | |||
|
|||
average : 'micro' or False |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To be consistent with the other metrics, it should be None instead of False see precision_score for instance.
I have made the changes you've told me to. Thanks :) |
if average: | ||
return 1.0 | ||
else: | ||
return np.ones(y_true.shape[1], dtype=np.float64) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think there is a little mistake here, since you can have a defined r2_score for some output, but not all.
A test would be needed for that case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would something like this do?
y_true = [[1, 1], [1, 1]]
y_pred = [[1, 1], [1, 1]]
assert array_equal(r2_score(y_true, y_pred, average=None), np.array([1.], [1.]))
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Consider for instance,
In [2]: from sklearn.metrics import r2_score
In [3]: r2_score([0, 0], [2, 1])
Out[3]: 0.0
In [4]: r2_score([-1, 1], [2, 1])
Out[4]: -3.5
I expect r2_score([[0, -1],[0, 1]], [[2, 2],[1, 1]], average=None)
to be equal to np.array([0, -3.5])
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah okay.
You need to handle the case where average is not None or 'micro'. |
@@ -1894,10 +1894,16 @@ def mean_absolute_error(y_true, y_pred): | |||
y_pred : array-like of shape = [n_samples] or [n_samples, n_outputs] | |||
Estimated target values. | |||
|
|||
average : 'micro' or None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this would be prettier
average : string, ['micro' (default), None]
By raising an error? |
Looking more into |
I don't have time to look more in detail, but I think this could be done with numpy operation. |
Hi, @arjoly I tried changing it to the NumPy way of doing things according to this, http://stackoverflow.com/questions/19274082/numpy-method-of-returning-values-based-on-two-different-arrays/19275440#comment28539446_19275440 |
@@ -1894,10 +1894,13 @@ def mean_absolute_error(y_true, y_pred): | |||
y_pred : array-like of shape = [n_samples] or [n_samples, n_outputs] | |||
Estimated target values. | |||
|
|||
average : string, ['micro' (default), None] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm still not convinced this is the way to go since
mean_absolute_error(y_true, y_pred, average="micro") == mean_absolute_error(y_true, y_pred, average="macro") == mean_absolute_error(y_true, y_pred, average="samples")
. What is your opionion @mblondel ?
@arjoly : Umm.. sorry for disturbing you.. but any updates. What are the values that you think, should be supplied to average? |
In order to add your new feature, I would add an where W would be the I wouldn't use the |
Ah okay, I suppose the predicted values that you get out of your model, would be automatically scaled. |
Indeed, your estimator/predictor should adapt to the scaled input data and the outputs will automatically scale as well. If you write a scoring metric that normalizes y_true and y_pred and sticks them into r**2, then, if my arithmetic is correct, you should get something like 2 * corr(y_true, y_pred) - 1 give or take. Now take ridge regression as an example. If you penalize too hard, r ** 2 will tend to 0, because your coefficients will become too small and you find yourself on the wrong scale. Not so with correlation, or r ** 2 with normalized entries: If you use the r ** 2 on normalized predictions, you will get a much better score, since you will be able to compensate partially for the squeezing done by the penalty. This is not in the spirit of the r ** 2 score. My main point though, was that your example, even though you scale predictions, corroborates the fact that after normalization of the targets, the different scoring schemes become equivalent. The R ** 2 from the Kolar & Xing 2010 paper is only useful in the case where all target variances are the same. But in this case this scoring comes down to doing macro r ** 2 using a mean. So macro r ** 2 using arithmetic mean is a more general way of scoring, and since the scikit learn tries to be as general as possible in the functionality it offers, I would definitely go with macro r ** 2 mean. |
Thanks for the clarification, and do we assume the user inputs normalized data as well? |
Taking the (possibly weighted) mean of the individual separate r**2 scores on all targets (i.e. "macro-averaging") would allow us to be agnostic as to how the individual targets are scaled at the input level. |
@eickenberg : Sorry for being naive again, but according to the example that I gave, the macro-average is 0.72, however for the normalized input I get 0.75 as the result. |
@manoj-kumar-s this is one of the problems with normalizing the predictions. To bring your normalization closer to the real world situation I described earlier, try normalizing your predictions using mean and stdev of y_true instead of those of y_pred. In this case you will find equality (normalized input r_2 will go down to .72 as well). (To be fully convincing one would need to work with an actual estimator, but the scaling by y_true is what a real linear estimator would undergo as well) |
@eickenberg : Thanks again :) . @ogrisel : Could you merge this, if you have no other comments |
I won't merge if there is not at least two +1. Furthermore we should not merge API evolutions if we are not 100% confident we have the right design as users will be very annoyed if we change our mind again later and break their programs when upgrading scikit-learn. I am still not sure myself what default averaging strategy we should implement and whether we should try to make it possible to implement non macroaveraging-based strategies without breaking backward compat again in the future. @manoj-kumar-s if you are tired of this discussion then just please say so and someone else will takeover from there (you can mute the notifications from this specific PR by clicking on the button at the end of the page). |
@ogrisel : Could you please have a look at the example that I provided and @eickenberg #2493 (comment) comments as well? It does seem that macro_average r2_score is the way to go. No, I am not tired of this discussion. I've learnt a lot already and looking forward to learn more from other issues as well. :) |
I think I shall just wait till we come to an agreement on this. |
@arjoly , @mblondel, @eickenberg, @jaquesgrobler : Thanks all for your help in this PR and I'm sorry to be bugging you again, but could you please give a final +1 or - 1 on the averaging done in this PR (based on the example that I've given and #2493 (comment)), so that I can leave this PR for now and start looking at other issues. :) |
My opinion hasn't changed: I vote for macro-average in all metrics. Now regarding MSE, I see several options:
|
I made the exact same comment a while ago, and this is still how I feel. |
Sounds good! |
I do not think that I can give a qualified answer, as I don't have these |
As far as I understand, the motivation for returning one single value out of the metric function is to have a quantity that is totally orderable, such that a parameter grid search may choose a best parameter based on this score. Speaking from a setting where the number of targets is potentially vastly higher than the number of samples, features (or both taken together if that quantity is in any way useful), the hope to be able to tune one, two or three parameters in a grid search and find a satisfactory optimum for all the targets seems a rather naive one to me. Probably more often than not, the number of parameters will scale (hopefully only) linearly with the number of targets involved. And with a little luck, the parameter grid search can be performed almost independently per target. So if ensuring total orderability is the only reason for condensing a vast number of scores into one, it may be worth the discussion of accepting arrays of scores by default (and deleting the isinstance numbers.Number clause in cross_val_score). Of course my use case is specific, and the number of targets can be a small constant with respect to the other dimensions. In this case, condensing several r^2 scores into one may not lead to too grave a loss of information. And in this case, arithmetic mean is maybe the simplest useful option. |
+1. Same comment to be added in the docstring for MAE. |
Trying to detect and warn about un-normalized Option 2) (normalizing the output of the estimator) sounds like a non-principled hack so -1 as well. |
In my experience, the more tasks you have, the more you would benefit from learning shared parameters because you don't have enough data to estimate many parameters. Moreover, multi-task algorithms usually have shared hyper-parameters. Also technically if you care about tuning hyper-parameters on a per-task basis, then you don't even need multi-output metrics: you just need to fit one estimator per task/output and use regular metrics in grid search. |
I've added a line in the docstring, to advise the user to normalize y_true. |
During the sprint, we discuss (me, @eickenberg and @MechCoder ) about the blocking points of this pull request. It turns ont the difference between macro-averaging and and the current implementation could be solved using output_weights properly. The macro-r2 / macro-explained variance correspond to uniform output_weight (= 1 / n_outputs) and the current version use output_weight proportional to the fraction of variance explained by each output. Thus we decided to keep both version. I am also fine with changing default to macro. |
i am incorporating the changes made on a fresh branch (rebasing was too On Saturday, July 19, 2014, Arnaud Joly notifications@github.com wrote:
|
ok, so closing this. |
Hi, I tried to solve issue #2200 . It now returns arrays for multi-output cases.
Once there is a consensus on these two things, this can be merged.