Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Poor out-of-sample accuracy with reg:absoluteerror #7674

Closed
Kodiologist opened this issue Feb 18, 2022 · 25 comments
Closed

Poor out-of-sample accuracy with reg:absoluteerror #7674

Kodiologist opened this issue Feb 18, 2022 · 25 comments
Assignees
Projects

Comments

@Kodiologist
Copy link
Contributor

Kodiologist commented Feb 18, 2022

(Original title: "Poor fit with absolute-error-like objective functions")

I've looked more into the problem I described in this discussion post, and I think there are two underlying problems, one easy to solve or work around, and one not so easy.

The easy part is that min_child_weight (at least at its default value, 1) seems to be pretty destructive to the accuracy of some objectives, such as pseudo-Huber. In the below example (copied from my earlier post), I get a MAE for pseudo-Huber of over 33 with the default min_child_weight = 1, but 0.9 with min_child_weight = 0. I think the default should probably be 0. Increasing min_child_weight probably isn't the first thing you'd want to try if your model came out too complex, anyway.

library(xgboost)

rmse = function(x, y)
    sqrt(mean((x - y)^2))
mae = function(x, y)
    mean(abs(x - y))

set.seed(5)

N = 1000
x = rep(c(0L, 1L, 10L), len = N)
y = x^2 + rnorm(N)^2

for (loss in c("reg:squarederror", "reg:pseudohubererror"))
    {m = xgboost::xgboost(
         verbose = 0,
         params = list(objective = loss),
         data = matrix(x),
         label = y,
         nrounds = 50)
    p = predict(m, newdata = matrix(x))
    message("RMSE ", loss, " - ", rmse(y, p))
    message("MAE ", loss, " - ", mae(y, p))}

This isn't the whole story, though; because first, a log-cosh objective, defined by

logcosh.objective = function(preds, dtrain)
   {e = preds - getinfo(dtrain, "label")
    list(grad = tanh(e), hess = 1/cosh(e)^2)}

still does quite badly in this case with min_child_weight = 0, and second, pseudo-Huber loss does better but still not good enough in the case of the trivial example:

x = c(0L, 1L)
y = x

for (loss in c("reg:squarederror", "reg:pseudohubererror"))
    {m = xgboost::xgboost(
         verbose = 0,
         params = list(
             lambda = 0, eta = 1,
             objective = loss),
         data = matrix(x),
         label = y,
         nrounds = 1)
    p = predict(m, newdata = matrix(x))
    message("Predictions for ", loss, ": ")
    print(p)}

Here, if I set min_child_weight = 0, the predictions for pseudo-Huber become -.0125 and 1.125, which are closer to but not equal to the expected answers, 0 and 1. Perhaps tellingly, if base_score is increased to 10, pseudo-Huber produces absurd predictions of [-999.9999, -728.0000], whereas the squared-error case still does the right thing.

I think the problem is not with how splits are chosen in the predictor space, but how values are assigned to the leaves. In the function CalcWeight in src/tree/param.h, the leaf values come out to (in the unregularized case) the gradient divided by the Hessian. I don't really understand the motivation for this, despite consulting the paper, and it seems to work poorly in the case of pseudo-Huber, at least.

@trivialfis
Copy link
Member

Hi, thank you for posting an interesting question. I can't start investigating at the moment and will come back to your examples.

I don't really understand the motivation for this, despite consulting the paper, and it seems to work poorly in the case of pseudo-Huber, at least.

I think it's proved as an optimal value for the greedy splits.

@trivialfis
Copy link
Member

I tried a small random dataset, the pseudo-Huber produces really small hessian:

0.000008, 0.000003, 0.000000, 0.000001, 0.000039, 0.000002, 0.000000, 0.000023, 0.000020, 0.000009, 0.000023, 0.000008, 0.000735, 0.000013, 0.014569, 0.000003, 0.000101, 0.000001, 0.000000, 0.000036, 0.000017, 0.000001, 0.000006, 0.000010, 0.000000, 0.000032, 0.000001, 0.648050, 0.000001, 0.000000, 0.000001, 0.000089, 

This limits the step size for optimization. Implementing the slope for pseudo-Huber can improve the convergence quite significantly.

@trivialfis
Copy link
Member

One way to handle these types of objective is to revise the tree leaf values after each iteration, which I plan to work on after this release.

@Kodiologist
Copy link
Contributor Author

Implementing the slope for pseudo-Huber

What does this mean? By the slope, do you mean the gradient? It looks like the gradient is already defined properly for this objective.

@Kodiologist
Copy link
Contributor Author

By this way, this kind of investigation may be easier with the somewhat simpler log-cosh objective function, with gradient tanh(x) and Hessian 1/cosh(x)^2.

@trivialfis
Copy link
Member

trivialfis commented Feb 22, 2022

By the slope, do you mean the gradient?

No, I meant the delta term in pseudo Huber, which is currently set to 1.

@Kodiologist
Copy link
Contributor Author

Oh, I see. Yeah, I was thinking that the δ value should be documented in the description of reg:pseudohubererror, and perhaps customizable.

@trivialfis
Copy link
Member

Yeah, I have a working branch for the customizable pseudo-huber, but need some refactoring in #7640 first.

For the more general case where objectives produce small hessian, I think we will try to resolve it in the next release, which is also necessary for objectives like MAE and quantile regression.

@hcho3
Copy link
Collaborator

hcho3 commented Mar 22, 2022

FYI, XGBoost now lets you indicate the slope for the pseudo-Huber loss: #7727

@trivialfis trivialfis added this to Need prioritize in 2.0 Roadmap via automation Mar 31, 2022
@trivialfis trivialfis moved this from Need prioritize to 2.0 TODO in 2.0 Roadmap Mar 31, 2022
@trivialfis trivialfis moved this from 2.0 TODO to 2.0 In Progress in 2.0 Roadmap Apr 22, 2022
@hcho3 hcho3 closed this as completed May 14, 2022
2.0 Roadmap automation moved this from 2.0 In Progress to 2.0 Done May 14, 2022
@hcho3
Copy link
Collaborator

hcho3 commented May 14, 2022

Closing, since the pseudo-Huber loss now supports a custom slope (huber_slope)

@Kodiologist
Copy link
Contributor Author

I don't think that fixes the problem. Changing the slope just makes pseudo-Huber less like absolute error.

@hcho3
Copy link
Collaborator

hcho3 commented May 14, 2022

@Kodiologist There is an experimental prototype for the L1 error: #7812. It implements an "adaptive tree" method where we can fit trees even when the Hessian is constant or not informative. So perhaps there is less need for the pseudo-Huber loss to behave like the absolute error?

@Kodiologist
Copy link
Contributor Author

Oh, very cool, I'd hadn't heard. I'll have to look into it.

@hcho3 hcho3 reopened this May 14, 2022
@hcho3
Copy link
Collaborator

hcho3 commented May 14, 2022

Reopening this issue for now, since the L1 error is still experimental.

@trivialfis
Copy link
Member

I will work on optimal initialization later for L1. However, to fully resolve the issue for custom objective, we need to implement #7693 . I'm still not entirely how should we move forward regarding the interface.

@trivialfis trivialfis moved this from 2.0 Done to 2.0 In Progress in 2.0 Roadmap May 20, 2022
@Kodiologist
Copy link
Contributor Author

@trivialfis I'm getting poor fits with the new reg:absoluteerror and XGBoost 18a38f7 in simple examples like the ones I've tried before with psuedo-Huber and log-cosh; see below for a complete example. Am I misconfiguring it?

rmse = function(x, y)
    sqrt(mean((x - y)^2))
mae = function(x, y)
    mean(abs(x - y))
r = function(x)
    round(x, 2)

set.seed(5)

N = 1000
x = rep(c(0L, 1L, 10L), len = N)
y = x^2 + rnorm(N)^2

message("RMSE original - ", r(rmse(y, x^2)))
message("MAE original - ", r(mae(y, x^2)))

for (loss in c("reg:squarederror", "reg:pseudohubererror", "reg:absoluteerror"))
    {m = xgboost::xgboost(
         verbose = 0,
         params = list(
             objective = loss,
             min_child_weight = 0,
             base_score = median(y)),
         data = matrix(x),
         label = y,
         nrounds = 50)
    p = predict(m, newdata = matrix(x))
    message("RMSE ", loss, " - ", r(rmse(y, p)))
    message("MAE ", loss, " - ", r(mae(y, p)))}

The result is:

RMSE original - 1.79
MAE original - 1.02
RMSE reg:squarederror - 1.46
MAE reg:squarederror - 1
RMSE reg:pseudohubererror - 1.5
MAE reg:pseudohubererror - 0.9
RMSE reg:absoluteerror - 48.54
MAE reg:absoluteerror - 28.6

@trivialfis
Copy link
Member

trivialfis commented May 24, 2022

@Kodiologist Could you please try one of the tree methods: approx, hist, gpu_hist?

The adaptive tree is not implemented for exact tree method yet (which is the default for small data).

Using hist:

RMSE original - 1.79
MAE original - 1.02
RMSE reg:squarederror - 1.46
MAE reg:squarederror - 1
RMSE reg:pseudohubererror - 1.5
MAE reg:pseudohubererror - 0.9
RMSE reg:absoluteerror - 1.57
MAE reg:absoluteerror - 0.88

Here are a few things I need to do to complete the support for l1 error:

  • Automate the base_score=median(y) in your example.
  • Implementation for exact tree method.
  • Optimize the quantile function to avoid a sort.

@Kodiologist
Copy link
Contributor Author

Thanks, that seems to work well in this case. Do you recommend hist over approx in most cases?

Automate the base_score=median(y) in your example

Intelligent selection of the default base_score could probably be generalized to other objectives, too, like base_score=mean(y) for reg:squarederror.

@trivialfis
Copy link
Member

I wrote a brief introduction to various tree methods https://xgboost.readthedocs.io/en/stable/treemethod.html For constant Hessian objectives hist is preferred. For other objectives then it's case by case, I don't have a thorough study (which I'm working on now that approx is completely rewritten).

Intelligent selection of the default base_score could probably be generalized to other objectives, too,

Yes. Thank you for the suggestion. I have a POC implementation for this and will submit a PR after the work on categorical feature is completed.

@Kodiologist
Copy link
Contributor Author

I've also noticed that my example problem has trouble with DART, even with tree_method = "hist", at least if the RNG is re-seeded in the inner loop:

rmse = function(x, y)
    sqrt(mean((x - y)^2))
mae = function(x, y)
    mean(abs(x - y))
r = function(x)
    round(x, 2)

set.seed(5)

N = 1000
x = rep(c(0L, 1L, 10L), len = N)
y = x^2 + rnorm(N)^2

message("RMSE original - ", r(rmse(y, x^2)))
message("MAE original - ", r(mae(y, x^2)))

for (loss in c("reg:squarederror", "reg:pseudohubererror", "reg:absoluteerror"))
     for (booster in c("gbtree", "dart"))
        {set.seed(8)
         m = xgboost::xgboost(
             verbose = 0,
             params = c(
                 list(
                     objective = loss,
                     min_child_weight = 0,
                     base_score = median(y),
                     tree_method = "hist",
                     booster = booster),
                 (if (booster == "dart") list(one_drop = 1))),
             data = matrix(x),
             label = y,
             nrounds = 50)
        p = predict(m, newdata = matrix(x))
        message("RMSE ", loss, " ", booster, " - ", r(rmse(y, p)))
        message("MAE ", loss, " ", booster, " - ", r(mae(y, p)))}

The result is:

RMSE original - 1.79
MAE original - 1.02
RMSE reg:squarederror gbtree - 1.46
MAE reg:squarederror gbtree - 1
RMSE reg:squarederror dart - 1.46
MAE reg:squarederror dart - 1
RMSE reg:pseudohubererror gbtree - 1.5
MAE reg:pseudohubererror gbtree - 0.9
RMSE reg:pseudohubererror dart - 56.17
MAE reg:pseudohubererror dart - 33.01
RMSE reg:absoluteerror gbtree - 1.57
MAE reg:absoluteerror gbtree - 0.88
RMSE reg:absoluteerror dart - 18.62
MAE reg:absoluteerror dart - 11.36

@Kodiologist
Copy link
Contributor Author

@trivialfis I was excited to use reg:absoluteerror in some of my group's projects, but I again have performance problems: no longer catastrophically bad, but still worse than can be achieved with other objectives. Here's a simplified example using some of our data. It does 5-fold cross validation and finds worse MAE for reg:absoluteerror than reg:squarederror. I've added some tuning (selection of the lambda parameter with nested cross-validation) and it hasn't changed the situation. It's interesting that training error is better for reg:absoluteerror than reg:squarederror although test error is worse; perhaps there's a way to tune the model better so that overfitting is less of an issue. But my much fancier tuning in my real projects hasn't seemed to suffice. Maybe there's a bug.

(If you don't have the R package pbapply installed, you can replace pbapply::pbsapply with sapply to get the same result without progress bars.)

library(xgboost)

d = read.csv("example.csv")
ivs = c(
    "lon", "lat", "date.i", "aod.terra", "aod.aqua",
    "merra.pm25", "temperature.K", "vapor.pressure.Pa",
    "precipitation.mm", "atmo.height.m", "wind.u.mps", "wind.v.mps",
    "road.density.mpkm2")

set.seed(4L)
d$fold = sample.int(5, nrow(d), rep = T)
lambda.candidates = 2^seq(-10, 10, len = 11)

x.pred = function(objective, lambda, train, test)
    predict(
        xgboost(
            data = as.matrix(train[, ivs]),
            label = train$y,
            verbose = 0,
            nrounds = 50,
            objective = objective,
            base_score = median(train$y),
            tree_method = "hist",
            max_bin = 2^12,
            lambda = lambda),
        as.matrix(test[, ivs]))

for (o in list("reg:squarederror", "reg:absoluteerror"))
   {set.seed(5L)
    pred = rep(NA_real_, nrow(d))
    for (outer.fold.i in sort(unique(d$fold)))
       {message(o, " fold ", outer.fold.i)
        lambda.best = lambda.candidates[which.min(pbapply::pbsapply(
            lambda.candidates,
            function(lambda)
               {pred.tune = rep(NA_real_, nrow(d))
                for (inner.fold.i in setdiff(sort(unique(d$fold)), outer.fold.i))
                     pred.tune[d$fold == inner.fold.i] = x.pred(
                         o,
                         lambda,
                         d[!(d$fold %in% c(inner.fold.i, outer.fold.i)),],
                         d[d$fold == inner.fold.i,])
                mean(abs((pred.tune - d$y)[d$fold != outer.fold.i]))}))]
        message(o, " training MAE: ", round(d = 2, mean(abs(
            d[d$fold != outer.fold.i, "y"] - x.pred(
                o,
                lambda.best,
                d[d$fold != outer.fold.i,],
                d[d$fold != outer.fold.i,])))))
        pred[d$fold == outer.fold.i] = x.pred(
            o,
            lambda.best,
            d[d$fold != outer.fold.i,],
            d[d$fold == outer.fold.i,])}
    message(o, " test MAE: ", round(d = 2, mean(abs(pred - d$y))))}

The result (minus progress bars and progress messages) is:

reg:squarederror training MAE: 9.01
reg:squarederror training MAE: 9.33
reg:squarederror training MAE: 9.34
reg:squarederror training MAE: 9.26
reg:squarederror training MAE: 9.24
reg:squarederror test MAE: 10.42
reg:absoluteerror training MAE: 8.28
reg:absoluteerror training MAE: 8.28
reg:absoluteerror training MAE: 8.28
reg:absoluteerror training MAE: 8.4
reg:absoluteerror training MAE: 8.25
reg:absoluteerror test MAE: 10.84

Here's the input file example.csv.

@trivialfis trivialfis self-assigned this Oct 20, 2022
@Kodiologist Kodiologist changed the title Poor fit with absolute-error-like objective functions Poor out-of-sample accuracy with reg:absoluteerror Mar 31, 2023
@trivialfis
Copy link
Member

Hi, sorry for the slow reply. I believe this is now addressed with the support for scaling with learning rate:

reg:squarederror fold 1
reg:squarederror training MAE: 9.01
reg:squarederror fold 2
reg:squarederror training MAE: 9.33
reg:squarederror fold 3
reg:squarederror training MAE: 9.34
reg:squarederror fold 4
reg:squarederror training MAE: 9.26
reg:squarederror fold 5
reg:squarederror training MAE: 9.24
reg:squarederror test MAE: 10.42
reg:absoluteerror fold 1
reg:absoluteerror training MAE: 9.13
reg:absoluteerror fold 2
reg:absoluteerror training MAE: 9.27
reg:absoluteerror fold 3
reg:absoluteerror training MAE: 9.28
reg:absoluteerror fold 4
reg:absoluteerror training MAE: 9.19
reg:absoluteerror fold 5
reg:absoluteerror training MAE: 9.24
reg:absoluteerror test MAE: 10.33

@Kodiologist
Copy link
Contributor Author

@trivialfis I'm not reproducing your results on xgboost 1.7.5.1 from CRAN. Are the relevant changes newer than that, or do XGBoost's parameters need to be changed from what I put in the example to get learning-rate scaling?

@trivialfis
Copy link
Member

Apologies, it's still on the master branch.

@Kodiologist
Copy link
Contributor Author

Thanks, it works for me with master. Great work. The project that originally motivated me to make this issue is too advanced now for me to try using MAE again, and the example works now, so I think we're good. If I later try this with a real project, and I think there's still a problem with XGBoost, I'll open a new issue.

2.0 Roadmap automation moved this from 2.0 In Progress to 2.0 Done Jul 7, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
2.0 Roadmap
  
2.0 Done
Development

No branches or pull requests

3 participants