Add Accelerated Failure Time loss for survival analysis task #4763

avinashbarnwal · 2019-08-12T17:32:14Z

Hi,

Please find the Accelerated Failure time loss for Survival Modeling.

Survival analysis is a "censored regression" where the goal is to learn time-to-event function. This is similar to the common regression analysis where data-points are uncensored. Time-to-event modeling is critical for understanding users/companies behaviors not limited to credit, cancer, and attrition risks.

Supports

4 kinds of datasets - Left, Right, Interval Censored and Uncensored.
Normal, Logistic and Extreme Distributions for underlying error distribution.

This project is part of the Google Summer of Code - 2019. AFT-Xgboost

Compact summary of AFT loss formula

Relevant Documents

Example in Python to run -


res    = {}
dtrain = xgboost.DMatrix(X)
dtrain.set_float_info("label_lower_bound",y_lower)
dtrain.set_float_info("label_upper_bound",y_higher)

dtest  = xgboost.DMatrix(X_val)
dtest.set_float_info("label_lower_bound",y_lower_val)
dtest.set_float_info("label_upper_bound",y_higher_val)

params = {'learning_rate':0.1, 'aft_loss_distribution' : 'normal', 'aft_loss_distribution_scale': 1.0,'eval_metric':'aft-nloglik','objective':"survival:aft"}

bst    = xgboost.train(params,dtrain,num_boost_round=100,evals=[(dtrain,"train"),(dtest,"test")],evals_result=res)

For more details - avinashbarnwal#1.

Note: as part of this PR, the Metric class became a subclass of the Configurable interface.

src/common/survival_util.cc

hcho3 · 2019-08-12T18:43:58Z

@avinashbarnwal In the PR description, can you add a short one-paragraph description of what survival analysis is? Something like:

survival analysis is a new kind of learning task where we would like to predict a time to certain event. The time-to-event labels are often censored, i.e. we only know which intervals the label falls in and do not know its exact value. See https://eng.uber.com/modeling-censored-time-to-event-data-using-pyro/ for a real-world example.

src/common/survival_util.cc

tdhock · 2019-08-12T18:48:23Z

survival analysis is a new kind of learning task where we would like to predict a time to certain event

I would describe it as "censored regression" or more specifically "regression with censored outputs" because the goal is still to learn a (real-valued) regression function; this emphasizes the similarity with usual regression, where all outputs are un-censored.

hcho3 · 2019-08-12T18:52:22Z

Also add a short Python example to the description:

dtrain = xgboost.DMatrix(X)
dtrain.set_float_info("label_lower_bound", y_lower)
dtrain.set_float_info("label_upper_bound", y_higher)
    
dtest = xgboost.DMatrix(X_test)
dtest.set_float_info("label_lower_bound", y_lower_test)
dtest.set_float_info("label_upper_bound", y_higher_test)
    
bst = xgboost.train(params, dtrain, num_boost_round=100,
                    evals=[(dtrain,"train"), (dtest,"test")])

hcho3 · 2019-08-12T20:58:26Z

@tdhock Thanks for your suggestion. Yes, "censored regression" sounds reasonable.

src/common/survival_util.cc

trivialfis · 2019-08-14T02:26:56Z

I'm not familiar with survival models, just skimmed through the survey. Are there other recommended materials concentrating on theoretical part? ;-)

avinashbarnwal · 2019-08-14T04:04:37Z

Hi @trivialfis,

Please find the good lecture notes for learning survival modeling - https://www4.stat.ncsu.edu/~dzhang2/st745/index.html.

One of the motivating books- https://www.amazon.com/Applied-Survival-Analysis-Time-Event/dp/0471754994.

Prof. @tdhock and @hcho3 might give a better reference for understanding theoretical survival modeling.

tdhock · 2019-08-14T04:07:45Z

would be good if @avinashbarnwal could write a latex/PDF vignette in the xgboost R pkg describing the loss functions that he implemented

tdhock · 2019-08-14T04:08:56Z

they are the same as in R's survival::survreg, there are some docs on that man page, but the math formulas come from http://members.cbio.mines-paristech.fr/~thocking/survival.pdf

trivialfis · 2019-08-14T04:50:47Z

@avinashbarnwal @tdhock Thanks for the good references. Will try to catch up.

avinashbarnwal · 2019-08-14T13:00:49Z

Hi Prof. @tdhock and @hcho3,

I will start writing loss functions in latex/PDF vignette for the xgboost R pkg.

avinashbarnwal · 2019-08-15T14:28:32Z

Hi Prof. @tdhock,

Please let me know if it is fine to make the vignette-like this https://cran.r-project.org/web/packages/xgboost/vignettes/xgboostfromJSON.html.

tdhock · 2019-08-15T21:25:19Z

typically for vignettes with lots of math I prefer writing Rnw source which is rendered to tex / pdf. It is possible to include simple math in Rmd which is rendered on a web page using mathjax, but in my experience complex equations (e.g. optimization problems) do not render well on web pages.

Examples of both are here: https://github.com/tdhock/PeakSegDisk/tree/master/vignettes

https://github.com/tdhock/PeakSegDisk/blob/master/vignettes/Examples.Rnw has some equations including an optimization problem and it is compiled to pdf.
https://github.com/tdhock/PeakSegDisk/blob/master/vignettes/Worst_case.Rmd#L48 has some simple equations and is compiled to html

include/xgboost/data.h

avinashbarnwal · 2019-08-24T05:52:56Z

Hi Prof. @tdhock and @hcho3,

Please find R-vignette below and let me know your thoughts.
http://rpubs.com/avinashbarnwal123/aft

hcho3 · 2019-08-26T16:35:38Z

@tdhock Do the datasets follow log-normal AFT distribution? The errors are not decreasing when we choose log-logistic and log-weibull. See http://rpubs.com/avinashbarnwal123/aft

tdhock · 2019-08-26T17:48:50Z

they are real data sets so we don't know their "true" distribution. However in previous experience with linear models, I have observed that a loss function with quadratic tails (like the normal distribution) works better than linear tails (like the logistic)

…

On Mon, Aug 26, 2019 at 9:36 AM Philip Hyunsu Cho ***@***.***> wrote: @tdhock <https://github.com/tdhock> Are the datasets follow log-normal AFT distribution? The errors are not decreasing when we choose log-logistic and log-weibull. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#4763?email_source=notifications&email_token=AAHDX4SAKQD6MWWKEA6772DQGQBBTA5CNFSM4ILDM5N2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD5E5E2I#issuecomment-524931689>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAHDX4UO7L4EUGR6N2NAM73QGQBBTANCNFSM4ILDM5NQ> .

avinashbarnwal · 2019-08-26T21:00:24Z

Hi Prof. @tdhock and @hcho3,

I have updated the vignette - http://rpubs.com/avinashbarnwal123/aft. It works for the last dataset -
H3K36me3_AM_immune. Please check last fold. This might be not clear because of the scale. It works for both Logistic and Extreme. I think we need datasets like that where it works.

avinashbarnwal · 2020-03-20T13:35:43Z

@avinashbarnwal I fixed the bug. See commit e33fab1. Also note that you don't need to add @10,normal suffix to the metric name aft-nloglik.

Thanks. I will change the code accordingly for our paper.

hcho3 · 2020-03-21T00:31:12Z

@trivialfis I added a demo, as you requested. A tutorial is available. Feel free to try it out.

hcho3 · 2020-03-21T00:32:08Z

Rendered output of the tutorial:

hcho3 · 2020-03-21T00:35:24Z

tests/cpp/objective/test_aft_obj.cc

+    { 0.0384f, 0.0624f, 0.0997f, 0.1551f, 0.2316f, 0.3254f, 0.4200f, 0.4861f, 0.4962f, 0.4457f,
+      0.3567f, 0.2601f, 0.1772f, 0.1152f, 0.0726f, 0.0449f, 0.0275f, 0.0167f, 0.0101f, 0.0061f });
+  CheckGPairOverGridPoints(obj.get(), 100.0f, 100.0f, "extreme",
+    { -0.0000f, -29.0026f, -17.0031f, -9.8028f, -5.4822f, -2.8897f, -1.3340f, -0.4005f, 0.1596f,


@avinashbarnwal FYI, I applied the regularization scheme to the uncensored case as well, and now I'm getting a zero gradient here, where previously we'd get something like -50.0. I'm still looking at ways to avoid INF and NAN (in general) without strange behavior like this. For this example, clamping the gradient to a reasonable quantity like -30.0 would be a lot better than giving 0.0. I'll come back to this soon.

I’ve merged this PR for now. I’ll file a follow-up PR to make AFT more robust in edge cases like this.

hcho3 · 2020-03-25T09:34:04Z

I added a toy example to visualize how XGBoost responds to censored labels:

trivialfis

LGTM! This is exciting.

hcho3 · 2020-03-25T20:54:27Z

Merged. Thanks everyone!

avinashbarnwal mentioned this pull request Aug 12, 2019

Summary of GSOC2019 calls avinashbarnwal/xgboost#1

Open

tdhock reviewed Aug 12, 2019

View reviewed changes