Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Docs for custom objectives could cover more details #10105

Open
david-cortes opened this issue Mar 10, 2024 · 9 comments
Open

Docs for custom objectives could cover more details #10105

david-cortes opened this issue Mar 10, 2024 · 9 comments

Comments

@david-cortes
Copy link
Contributor

From reading the docs, it's quite unclear what exactly is or isn't supported as custom objectives in XGBoost.

For example, one might wonder after reading the docs:

  • Does it support convex objectives only, or can it also solve non-convex (with potentially negative Hessian) objectives? From some experiments, it looks like it clips negative Hessians to zero but would be ideal if the docs could mention the logic.
  • Does it work with Fisher's method instead of Newton's? (It should, the more so since it clips Hessians) I think it could mention it as an alternative, especially if the objective is non-convex.
  • What is or isn't supported for multi-output objectives? I think it could have a big and clear disclaimer that the methods aren't meant for vector-valued objectives with dependencies between targets (e.g. multinomial logistic).
@trivialfis
Copy link
Member

trivialfis commented Mar 11, 2024

Does it support convex objectives only, or can it also solve non-convex

No, it doesn't. We kind of workaround it by clipping it, but there's no proof that it can converge. So, yes, XGBoost can run, whether the result is useful is a different problem.

Does it work with Fisher's method instead of Newton's?

Yes, you can use the expected Hessian. Again, I don't have a proof for it to work, but experiments with probabilistic forecasting seem promising.

What is or isn't supported for multi-output objectives

As long as you can come up with an approximation of the Hessian that's 0 or 1 dim per-sample (<= 2-dim for the whole dataset), it should be supported. For instance, the softmax cross entropy uses the diagonal of the Hessian.

@trivialfis
Copy link
Member

I can work on these things later. Thank you for raising the issue.

@david-cortes
Copy link
Contributor Author

For instance, the softmax cross entropy uses the diagonal of the Hessian.

But is that the best approach for multinomial logistic? Would such a procedure be guaranteed to converge for general optimization?

Take for example GLMNET: in its multinomial variant in which it solves for one class at a time, it recalculates the gradients and hessians after every single-class newton iteration, instead of after a full cycle over all classes like XGBoost does:
https://www.jstatsoft.org/article/view/v033i01

And in its equivalent of multiple classes solved at the same time, it uses a diagonal upper bound of the hessian instead of the true diagonal:
https://arxiv.org/pdf/1311.6529.pdf

Did some experiments and the diagonal Hessian seems to indeed perform substantially better than their upper bound, but don't know if it has the same theoretical properties around convergence.

In any event, would still be ideal to mention in the docs that multi-valued response objectives need a diagonal approximation of the hessian, and that with multi-trees approach the gradients and hessians are not recalculated after every single tree update (maybe there could be an option to make them so?).

@trivialfis
Copy link
Member

trivialfis commented Mar 11, 2024

I haven't looked into the referenced paper yet. Actually, XGB is using the diagonal multiplied by a constant 2. There's a proof based on conditional random field that this is the upper bound, then the upper bound is transferred to logistic function by removing the edge potential from CRF.
You can lookup some old related issues, XGB is actually optimizing an upper bound of the objective instead of the objective itself, it's just for many convex problems, we don't have to think about this.
I have been meaning to write a brief introduction to this, but then, it's in the infinitely long backlog.

As of regression, we simply calculate the point gradient.

@david-cortes
Copy link
Contributor Author

Actually no need to look at the paper.

After reading it in more detail, I now notice: they use a different upper bound for multinomial because their least squares algorithm only supports having the same weight/hessian per observation for each class, while XGBoost doesn't have such limitation.

And on a deeper look, I actually notice that they do derive the exact same multinomial upper bound as "twice the diagonal of the Hessian" for the more general case, which is the same that xgboost is using.

I guess in the general case for arbitrary objectives, the recommended procedure for hessians that have dependencies between targets could be to calculate a diagonal bound by summing the absolute values row-wise for each observation in the full Hessian of dimensions [k,k] to arrive at a matrix [m,k], which in the case of multinomial logistic due to its structure would give the same result as multipling the diagonal by two.

@trivialfis
Copy link
Member

I'm not sure about the intuition behind the suggestion about the summation yet. Will give it more thoughts. Could you please elaborate?

@david-cortes
Copy link
Contributor Author

david-cortes commented Mar 13, 2024

I'm not sure about the intuition behind the suggestion about the summation yet. Will give it more thoughts. Could you please elaborate?

It's straight from the definition of a diagonally dominant matrix, and is what the GLMNET paper referenced in their proof.

The idea is that such a diagonal matrix $\mathbf{H}_d$ would by definition result in an upper bound for the true Hessian $\mathbf{H}$ in a sandwiched product (thereby in a second-order approximation of an objective function):
$\mathbf{v}^T \mathbf{H} \mathbf{v} \leq \mathbf{v}^T \mathbf{H}_d \mathbf{v}$

for arbitraty $\mathbf{v}$; and

$$ \text{det}(\mathbf{H}_d - \mathbf{H}) \geq 0 $$

is guaranteed to hold.

So in the case of multinomial logistic, for a particular observation $i$, $\mathbf{H}_i = \text{diag}(\mathbf{p}) - \mathbf{p} \mathbf{p}^T$.

since $\sum_i p_i = 1$ and the sum of absolutes values for a given row of $\mathbf{p} \mathbf{p}^T$ is $p_i ( \sum_j p_j) = p_i \times ( 1 )$, the sum excluding the diagonal would be $p_i (1 - p_i)$, which is exactly the value for the diagonal entry in that same row for $\mathbf{H}$, hence summing them ($\text{abs} ( \mathbf{H} ) = \text{diag}(\mathbf{p}) + \mathbf{p} \mathbf{p}^T$) yields $2 p_i (1 - p_i)$, which is the upper bound in XGBoost.

@trivialfis
Copy link
Member

Thank you for the detailed explanation!

@david-cortes
Copy link
Contributor Author

Would also be ideal to mention in the example for custom multi-target objectives that the multinomial example it is giving right there where it says "hessian" is not the true hessian, but a diagonal upper bound on it, which otherwise can be quite confusing.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants