Poor out-of-sample accuracy with `reg:absoluteerror` #7674

Kodiologist · 2022-02-18T19:57:19Z

(Original title: "Poor fit with absolute-error-like objective functions")

I've looked more into the problem I described in this discussion post, and I think there are two underlying problems, one easy to solve or work around, and one not so easy.

The easy part is that min_child_weight (at least at its default value, 1) seems to be pretty destructive to the accuracy of some objectives, such as pseudo-Huber. In the below example (copied from my earlier post), I get a MAE for pseudo-Huber of over 33 with the default min_child_weight = 1, but 0.9 with min_child_weight = 0. I think the default should probably be 0. Increasing min_child_weight probably isn't the first thing you'd want to try if your model came out too complex, anyway.

library(xgboost)

rmse = function(x, y)
    sqrt(mean((x - y)^2))
mae = function(x, y)
    mean(abs(x - y))

set.seed(5)

N = 1000
x = rep(c(0L, 1L, 10L), len = N)
y = x^2 + rnorm(N)^2

for (loss in c("reg:squarederror", "reg:pseudohubererror"))
    {m = xgboost::xgboost(
         verbose = 0,
         params = list(objective = loss),
         data = matrix(x),
         label = y,
         nrounds = 50)
    p = predict(m, newdata = matrix(x))
    message("RMSE ", loss, " - ", rmse(y, p))
    message("MAE ", loss, " - ", mae(y, p))}

This isn't the whole story, though; because first, a log-cosh objective, defined by

logcosh.objective = function(preds, dtrain)
   {e = preds - getinfo(dtrain, "label")
    list(grad = tanh(e), hess = 1/cosh(e)^2)}

still does quite badly in this case with min_child_weight = 0, and second, pseudo-Huber loss does better but still not good enough in the case of the trivial example:

x = c(0L, 1L)
y = x

for (loss in c("reg:squarederror", "reg:pseudohubererror"))
    {m = xgboost::xgboost(
         verbose = 0,
         params = list(
             lambda = 0, eta = 1,
             objective = loss),
         data = matrix(x),
         label = y,
         nrounds = 1)
    p = predict(m, newdata = matrix(x))
    message("Predictions for ", loss, ": ")
    print(p)}

Here, if I set min_child_weight = 0, the predictions for pseudo-Huber become -.0125 and 1.125, which are closer to but not equal to the expected answers, 0 and 1. Perhaps tellingly, if base_score is increased to 10, pseudo-Huber produces absurd predictions of [-999.9999, -728.0000], whereas the squared-error case still does the right thing.

I think the problem is not with how splits are chosen in the predictor space, but how values are assigned to the leaves. In the function CalcWeight in src/tree/param.h, the leaf values come out to (in the unregularized case) the gradient divided by the Hessian. I don't really understand the motivation for this, despite consulting the paper, and it seems to work poorly in the case of pseudo-Huber, at least.

The text was updated successfully, but these errors were encountered:

trivialfis · 2022-02-18T20:13:05Z

Hi, thank you for posting an interesting question. I can't start investigating at the moment and will come back to your examples.

I don't really understand the motivation for this, despite consulting the paper, and it seems to work poorly in the case of pseudo-Huber, at least.

I think it's proved as an optimal value for the greedy splits.

trivialfis · 2022-02-22T05:05:10Z

I tried a small random dataset, the pseudo-Huber produces really small hessian:

0.000008, 0.000003, 0.000000, 0.000001, 0.000039, 0.000002, 0.000000, 0.000023, 0.000020, 0.000009, 0.000023, 0.000008, 0.000735, 0.000013, 0.014569, 0.000003, 0.000101, 0.000001, 0.000000, 0.000036, 0.000017, 0.000001, 0.000006, 0.000010, 0.000000, 0.000032, 0.000001, 0.648050, 0.000001, 0.000000, 0.000001, 0.000089,

This limits the step size for optimization. Implementing the slope for pseudo-Huber can improve the convergence quite significantly.

trivialfis · 2022-02-22T05:17:13Z

One way to handle these types of objective is to revise the tree leaf values after each iteration, which I plan to work on after this release.

Kodiologist · 2022-02-22T12:47:15Z

Implementing the slope for pseudo-Huber

What does this mean? By the slope, do you mean the gradient? It looks like the gradient is already defined properly for this objective.

Kodiologist · 2022-02-22T13:56:25Z

By this way, this kind of investigation may be easier with the somewhat simpler log-cosh objective function, with gradient tanh(x) and Hessian 1/cosh(x)^2.

trivialfis · 2022-02-22T16:58:59Z

By the slope, do you mean the gradient?

No, I meant the delta term in pseudo Huber, which is currently set to 1.

Kodiologist · 2022-02-22T17:18:01Z

Oh, I see. Yeah, I was thinking that the δ value should be documented in the description of reg:pseudohubererror, and perhaps customizable.

trivialfis · 2022-02-23T06:56:57Z

Yeah, I have a working branch for the customizable pseudo-huber, but need some refactoring in #7640 first.

For the more general case where objectives produce small hessian, I think we will try to resolve it in the next release, which is also necessary for objectives like MAE and quantile regression.

hcho3 · 2022-03-22T23:50:20Z

FYI, XGBoost now lets you indicate the slope for the pseudo-Huber loss: #7727

hcho3 · 2022-05-14T22:14:00Z

Closing, since the pseudo-Huber loss now supports a custom slope (huber_slope)

Kodiologist · 2022-05-14T22:15:28Z

I don't think that fixes the problem. Changing the slope just makes pseudo-Huber less like absolute error.

hcho3 · 2022-05-14T22:20:22Z

@Kodiologist There is an experimental prototype for the L1 error: #7812. It implements an "adaptive tree" method where we can fit trees even when the Hessian is constant or not informative. So perhaps there is less need for the pseudo-Huber loss to behave like the absolute error?

Kodiologist · 2022-05-14T22:22:04Z

Oh, very cool, I'd hadn't heard. I'll have to look into it.

hcho3 · 2022-05-14T22:23:50Z

Reopening this issue for now, since the L1 error is still experimental.

trivialfis · 2022-05-15T06:37:50Z

I will work on optimal initialization later for L1. However, to fully resolve the issue for custom objective, we need to implement #7693 . I'm still not entirely how should we move forward regarding the interface.

Kodiologist · 2022-05-24T13:17:08Z

@trivialfis I'm getting poor fits with the new reg:absoluteerror and XGBoost 18a38f7 in simple examples like the ones I've tried before with psuedo-Huber and log-cosh; see below for a complete example. Am I misconfiguring it?

rmse = function(x, y)
    sqrt(mean((x - y)^2))
mae = function(x, y)
    mean(abs(x - y))
r = function(x)
    round(x, 2)

set.seed(5)

N = 1000
x = rep(c(0L, 1L, 10L), len = N)
y = x^2 + rnorm(N)^2

message("RMSE original - ", r(rmse(y, x^2)))
message("MAE original - ", r(mae(y, x^2)))

for (loss in c("reg:squarederror", "reg:pseudohubererror", "reg:absoluteerror"))
    {m = xgboost::xgboost(
         verbose = 0,
         params = list(
             objective = loss,
             min_child_weight = 0,
             base_score = median(y)),
         data = matrix(x),
         label = y,
         nrounds = 50)
    p = predict(m, newdata = matrix(x))
    message("RMSE ", loss, " - ", r(rmse(y, p)))
    message("MAE ", loss, " - ", r(mae(y, p)))}

The result is:

RMSE original - 1.79
MAE original - 1.02
RMSE reg:squarederror - 1.46
MAE reg:squarederror - 1
RMSE reg:pseudohubererror - 1.5
MAE reg:pseudohubererror - 0.9
RMSE reg:absoluteerror - 48.54
MAE reg:absoluteerror - 28.6

trivialfis · 2022-05-24T14:12:28Z

@Kodiologist Could you please try one of the tree methods: approx, hist, gpu_hist?

The adaptive tree is not implemented for exact tree method yet (which is the default for small data).

Using hist:

RMSE original - 1.79
MAE original - 1.02
RMSE reg:squarederror - 1.46
MAE reg:squarederror - 1
RMSE reg:pseudohubererror - 1.5
MAE reg:pseudohubererror - 0.9
RMSE reg:absoluteerror - 1.57
MAE reg:absoluteerror - 0.88

Here are a few things I need to do to complete the support for l1 error:

Automate the base_score=median(y) in your example.
Implementation for exact tree method.
Optimize the quantile function to avoid a sort.

Kodiologist · 2022-05-24T14:24:47Z

Thanks, that seems to work well in this case. Do you recommend hist over approx in most cases?

Automate the base_score=median(y) in your example

Intelligent selection of the default base_score could probably be generalized to other objectives, too, like base_score=mean(y) for reg:squarederror.

trivialfis · 2022-05-24T16:33:01Z

I wrote a brief introduction to various tree methods https://xgboost.readthedocs.io/en/stable/treemethod.html For constant Hessian objectives hist is preferred. For other objectives then it's case by case, I don't have a thorough study (which I'm working on now that approx is completely rewritten).

Intelligent selection of the default base_score could probably be generalized to other objectives, too,

Yes. Thank you for the suggestion. I have a POC implementation for this and will submit a PR after the work on categorical feature is completed.

Kodiologist · 2022-05-25T18:49:21Z

I've also noticed that my example problem has trouble with DART, even with tree_method = "hist", at least if the RNG is re-seeded in the inner loop:

rmse = function(x, y)
    sqrt(mean((x - y)^2))
mae = function(x, y)
    mean(abs(x - y))
r = function(x)
    round(x, 2)

set.seed(5)

N = 1000
x = rep(c(0L, 1L, 10L), len = N)
y = x^2 + rnorm(N)^2

message("RMSE original - ", r(rmse(y, x^2)))
message("MAE original - ", r(mae(y, x^2)))

for (loss in c("reg:squarederror", "reg:pseudohubererror", "reg:absoluteerror"))
     for (booster in c("gbtree", "dart"))
        {set.seed(8)
         m = xgboost::xgboost(
             verbose = 0,
             params = c(
                 list(
                     objective = loss,
                     min_child_weight = 0,
                     base_score = median(y),
                     tree_method = "hist",
                     booster = booster),
                 (if (booster == "dart") list(one_drop = 1))),
             data = matrix(x),
             label = y,
             nrounds = 50)
        p = predict(m, newdata = matrix(x))
        message("RMSE ", loss, " ", booster, " - ", r(rmse(y, p)))
        message("MAE ", loss, " ", booster, " - ", r(mae(y, p)))}

The result is:

RMSE original - 1.79
MAE original - 1.02
RMSE reg:squarederror gbtree - 1.46
MAE reg:squarederror gbtree - 1
RMSE reg:squarederror dart - 1.46
MAE reg:squarederror dart - 1
RMSE reg:pseudohubererror gbtree - 1.5
MAE reg:pseudohubererror gbtree - 0.9
RMSE reg:pseudohubererror dart - 56.17
MAE reg:pseudohubererror dart - 33.01
RMSE reg:absoluteerror gbtree - 1.57
MAE reg:absoluteerror gbtree - 0.88
RMSE reg:absoluteerror dart - 18.62
MAE reg:absoluteerror dart - 11.36

Kodiologist · 2022-06-01T15:38:41Z

@trivialfis I was excited to use reg:absoluteerror in some of my group's projects, but I again have performance problems: no longer catastrophically bad, but still worse than can be achieved with other objectives. Here's a simplified example using some of our data. It does 5-fold cross validation and finds worse MAE for reg:absoluteerror than reg:squarederror. I've added some tuning (selection of the lambda parameter with nested cross-validation) and it hasn't changed the situation. It's interesting that training error is better for reg:absoluteerror than reg:squarederror although test error is worse; perhaps there's a way to tune the model better so that overfitting is less of an issue. But my much fancier tuning in my real projects hasn't seemed to suffice. Maybe there's a bug.

(If you don't have the R package pbapply installed, you can replace pbapply::pbsapply with sapply to get the same result without progress bars.)

library(xgboost)

d = read.csv("example.csv")
ivs = c(
    "lon", "lat", "date.i", "aod.terra", "aod.aqua",
    "merra.pm25", "temperature.K", "vapor.pressure.Pa",
    "precipitation.mm", "atmo.height.m", "wind.u.mps", "wind.v.mps",
    "road.density.mpkm2")

set.seed(4L)
d$fold = sample.int(5, nrow(d), rep = T)
lambda.candidates = 2^seq(-10, 10, len = 11)

x.pred = function(objective, lambda, train, test)
    predict(
        xgboost(
            data = as.matrix(train[, ivs]),
            label = train$y,
            verbose = 0,
            nrounds = 50,
            objective = objective,
            base_score = median(train$y),
            tree_method = "hist",
            max_bin = 2^12,
            lambda = lambda),
        as.matrix(test[, ivs]))

for (o in list("reg:squarederror", "reg:absoluteerror"))
   {set.seed(5L)
    pred = rep(NA_real_, nrow(d))
    for (outer.fold.i in sort(unique(d$fold)))
       {message(o, " fold ", outer.fold.i)
        lambda.best = lambda.candidates[which.min(pbapply::pbsapply(
            lambda.candidates,
            function(lambda)
               {pred.tune = rep(NA_real_, nrow(d))
                for (inner.fold.i in setdiff(sort(unique(d$fold)), outer.fold.i))
                     pred.tune[d$fold == inner.fold.i] = x.pred(
                         o,
                         lambda,
                         d[!(d$fold %in% c(inner.fold.i, outer.fold.i)),],
                         d[d$fold == inner.fold.i,])
                mean(abs((pred.tune - d$y)[d$fold != outer.fold.i]))}))]
        message(o, " training MAE: ", round(d = 2, mean(abs(
            d[d$fold != outer.fold.i, "y"] - x.pred(
                o,
                lambda.best,
                d[d$fold != outer.fold.i,],
                d[d$fold != outer.fold.i,])))))
        pred[d$fold == outer.fold.i] = x.pred(
            o,
            lambda.best,
            d[d$fold != outer.fold.i,],
            d[d$fold == outer.fold.i,])}
    message(o, " test MAE: ", round(d = 2, mean(abs(pred - d$y))))}

The result (minus progress bars and progress messages) is:

reg:squarederror training MAE: 9.01
reg:squarederror training MAE: 9.33
reg:squarederror training MAE: 9.34
reg:squarederror training MAE: 9.26
reg:squarederror training MAE: 9.24
reg:squarederror test MAE: 10.42
reg:absoluteerror training MAE: 8.28
reg:absoluteerror training MAE: 8.28
reg:absoluteerror training MAE: 8.28
reg:absoluteerror training MAE: 8.4
reg:absoluteerror training MAE: 8.25
reg:absoluteerror test MAE: 10.84

Here's the input file example.csv.

trivialfis · 2023-06-27T20:36:47Z

Hi, sorry for the slow reply. I believe this is now addressed with the support for scaling with learning rate:

reg:squarederror fold 1
reg:squarederror training MAE: 9.01
reg:squarederror fold 2
reg:squarederror training MAE: 9.33
reg:squarederror fold 3
reg:squarederror training MAE: 9.34
reg:squarederror fold 4
reg:squarederror training MAE: 9.26
reg:squarederror fold 5
reg:squarederror training MAE: 9.24
reg:squarederror test MAE: 10.42
reg:absoluteerror fold 1
reg:absoluteerror training MAE: 9.13
reg:absoluteerror fold 2
reg:absoluteerror training MAE: 9.27
reg:absoluteerror fold 3
reg:absoluteerror training MAE: 9.28
reg:absoluteerror fold 4
reg:absoluteerror training MAE: 9.19
reg:absoluteerror fold 5
reg:absoluteerror training MAE: 9.24
reg:absoluteerror test MAE: 10.33

Kodiologist · 2023-07-07T12:35:39Z

@trivialfis I'm not reproducing your results on xgboost 1.7.5.1 from CRAN. Are the relevant changes newer than that, or do XGBoost's parameters need to be changed from what I put in the example to get learning-rate scaling?

trivialfis · 2023-07-07T13:15:34Z

Apologies, it's still on the master branch.

Kodiologist · 2023-07-07T13:56:16Z

Thanks, it works for me with master. Great work. The project that originally motivated me to make this issue is too advanced now for me to try using MAE again, and the example works now, so I think we're good. If I later try this with a real project, and I think there's still a problem with XGBoost, I'll open a new issue.

trivialfis added the feature-request label Mar 27, 2022

trivialfis added this to Need prioritize in 2.0 Roadmap via automation Mar 31, 2022

trivialfis moved this from Need prioritize to 2.0 TODO in 2.0 Roadmap Mar 31, 2022

trivialfis moved this from 2.0 TODO to 2.0 In Progress in 2.0 Roadmap Apr 22, 2022

hcho3 closed this as completed May 14, 2022

2.0 Roadmap automation moved this from 2.0 In Progress to 2.0 Done May 14, 2022

hcho3 reopened this May 14, 2022

trivialfis moved this from 2.0 Done to 2.0 In Progress in 2.0 Roadmap May 20, 2022

trivialfis self-assigned this Oct 20, 2022

Kodiologist changed the title ~~Poor fit with absolute-error-like objective functions~~ Poor out-of-sample accuracy with reg:absoluteerror Mar 31, 2023

Kodiologist closed this as completed Jul 7, 2023

2.0 Roadmap automation moved this from 2.0 In Progress to 2.0 Done Jul 7, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Poor out-of-sample accuracy with `reg:absoluteerror` #7674

Poor out-of-sample accuracy with `reg:absoluteerror` #7674

Kodiologist commented Feb 18, 2022 •

edited

trivialfis commented Feb 18, 2022

trivialfis commented Feb 22, 2022

trivialfis commented Feb 22, 2022

Kodiologist commented Feb 22, 2022

Kodiologist commented Feb 22, 2022

trivialfis commented Feb 22, 2022 •

edited

Kodiologist commented Feb 22, 2022

trivialfis commented Feb 23, 2022

hcho3 commented Mar 22, 2022

hcho3 commented May 14, 2022

Kodiologist commented May 14, 2022

hcho3 commented May 14, 2022 •

edited

Kodiologist commented May 14, 2022

hcho3 commented May 14, 2022

trivialfis commented May 15, 2022

Kodiologist commented May 24, 2022

trivialfis commented May 24, 2022 •

edited

Kodiologist commented May 24, 2022

trivialfis commented May 24, 2022

Kodiologist commented May 25, 2022

Kodiologist commented Jun 1, 2022

trivialfis commented Jun 27, 2023

Kodiologist commented Jul 7, 2023

trivialfis commented Jul 7, 2023

Kodiologist commented Jul 7, 2023

Poor out-of-sample accuracy with reg:absoluteerror #7674

Poor out-of-sample accuracy with reg:absoluteerror #7674

Comments

Kodiologist commented Feb 18, 2022 • edited

trivialfis commented Feb 18, 2022

trivialfis commented Feb 22, 2022

trivialfis commented Feb 22, 2022

Kodiologist commented Feb 22, 2022

Kodiologist commented Feb 22, 2022

trivialfis commented Feb 22, 2022 • edited

Kodiologist commented Feb 22, 2022

trivialfis commented Feb 23, 2022

hcho3 commented Mar 22, 2022

hcho3 commented May 14, 2022

Kodiologist commented May 14, 2022

hcho3 commented May 14, 2022 • edited

Kodiologist commented May 14, 2022

hcho3 commented May 14, 2022

trivialfis commented May 15, 2022

Kodiologist commented May 24, 2022

trivialfis commented May 24, 2022 • edited

Kodiologist commented May 24, 2022

trivialfis commented May 24, 2022

Kodiologist commented May 25, 2022

Kodiologist commented Jun 1, 2022

trivialfis commented Jun 27, 2023

Kodiologist commented Jul 7, 2023

trivialfis commented Jul 7, 2023

Kodiologist commented Jul 7, 2023

Poor out-of-sample accuracy with `reg:absoluteerror` #7674

Poor out-of-sample accuracy with `reg:absoluteerror` #7674

Kodiologist commented Feb 18, 2022 •

edited

trivialfis commented Feb 22, 2022 •

edited

hcho3 commented May 14, 2022 •

edited

trivialfis commented May 24, 2022 •

edited