[R-package] Keep row names in output from `predict` #4977

david-cortes · 2022-01-23T19:24:04Z

Calling predict on a lightgbm model object will produce a result without row names. If the data on which predictions are being made has row names, chances are that the user will want to keep those in the prediction output. This PR keeps the row names in the prediction output when the input data has them.

david-cortes · 2022-01-27T22:03:50Z

Failing CI checks on mac are due to a dependency that fails to install and not related to the changes in this PR.

jameslamb

Thanks very much for this proposal.

I don't support adding this feature. I'm not convinced that the (in my opinion) slight convenience is worth the additional code complexity and maintenance burden.

I also want to be especially protective of predict() code paths, since I expect user code creating predictions to be much more latency-sensitive than training code. (related conversation: #4909 (review)).

We can get some other opinions though. @mayer79 @StrikerRUS @Laurae2 if any of you are interested, could you give your opinion on this proposed change?

david-cortes · 2022-02-28T03:56:53Z

The latency added by this feature is around 30 microseconds in my setup (~40 when objects have names), which is very low compared to the time it takes to make a single-row prediction which can reach up to a few miliseconds depending on the choice of hyperparameters and arguments (and lower than the overhead added from syntax-related coding choices such as usage of R6).

StrikerRUS · 2022-03-01T00:11:51Z

I'm sorry, I very far away from the R-lang world, so can't write thoughtful comment here. If keeping row names is something like a standard behavior for other R packages, then we can add this feature. If not, then I agree with @jameslamb .

david-cortes · 2022-03-01T03:28:29Z

Row names are kept in base R and most recommended/core packages for decision trees (e.g. rpart, randomForest, etc.).

Did a bit of testing with popular non-core packages, and found that many decision-tree packages do not keep them (e.g. ranger, randomForestSRC), although their output is not exactly a numeric vector/matrix in the same way as in lightgbm or base R.

OTOH tried a few packages for linear models off the top of my mind, and all of them kept the row names (glmnet, picasso, ncvreg, grpreg, glmx, VGAM, lme4).

jameslamb · 2022-03-13T03:45:17Z

Sorry for the delayed response, I recently returned from traveling.

Thanks for checking other packages! Since projects like {randomForest}, {rpart}, {stats}, and the linear model projects you mentioned support this functionality, I'm convinced that it's standard enough that {lightgbm} should support it too.

However, predict() in {lightgbm} is more complex than most of those other packages...it doesn't only produce predicted values of the target, but is also used (for example) to get feature contributions or information about which leaves records match.

I'd only feel confident taking on this behavior if unit tests were added confirming that row names are handled correctly for every combination of the following:

row names in newdata (has rownames, does not have rownames)
reshape (TRUE or FALSE)
newdata (matrix, dgCMatrix, or filepath to a CSV)

task in the model (regression, binary classification, multiclass classification)

this is important because the output shape is different for multiclass classification

LightGBM/R-package/R/lgb.Booster.R

Lines 739 to 745 in 9a4e706

    
           #' @return For regression or binary classification, it returns a vector of length \code{nrows(data)}. 
        
           #'         For multiclass classification, either a \code{num_class * nrows(data)} vector or 
        
           #'         a \code{(nrows(data), num_class)} dimension matrix is returned, depending on 
        
           #'         the \code{reshape} value. 
        
           #' 
        
           #'         When \code{predleaf = TRUE}, the output is a matrix object with the 
        
           #'         number of columns corresponding to the number of trees.

type of output requested (rawscore, predleaf, predcontrib, none of those)

If you have time and interest in adding such tests, I'd support this PR.

If not, then I think this PR should be closed and we could convert it to a feature request to be added to #2302 and picked up at some point in the future.

david-cortes · 2022-03-23T20:25:21Z

Added more tests, but left out CSV tests because predicting on them is currently broken.

david-cortes · 2022-03-23T21:00:10Z

The failing checks are due to the vignette. It fails to make a prediction at L75:

summary(predict(fit, X))

Error in numeric(npred) : vector size cannot be NA

Probably coming from here:

LightGBM/R-package/R/lgb.Predictor.R

Line 151 in 6b56a90

preds <- numeric(npred)

It is not related to this PR since the error happens before reaching the code that was changed.

david-cortes · 2022-03-23T21:02:53Z

The failing checks are due to the vignette. It fails to make a prediction at L75:
summary(predict(fit, X))
Error in numeric(npred) : vector size cannot be NA
Probably coming from here:

LightGBM/R-package/R/lgb.Predictor.R

Line 151 in 6b56a90

preds <- numeric(npred)

It is not related to this PR since the error happens before reaching the code that was changed.

Nevermind this comment, there was another change I didn't notice in the merge conflict.

jameslamb

Thanks very much for adding tests and documenting the CSV issue in #5093.

The tests made me realize that for input data with row names, the value of reshape isn't sufficient to determine whether or not the output should have row names.

reshape's value only matters for multi-class classification...for regression and binary classification tasks, the number of rows in the output from predict() should be the same as the number of rows in newdata, so for those objectives row names should be preserved regardless of the value of reshape.

Could you please do the following:

move the code you've added into predict.lgb.Booster() into Predictor$predict()
ensure that for binary classification and regression, row names are always preserved

For reference, here's the relevant code in Predictor$predict() that is figuring out whether or not the predictions have the same number of elements as the input data has rows.

LightGBM/R-package/R/lgb.Predictor.R

Lines 215 to 231 in 17d4e00

    
           # Get number of cases per row 
        
           npred_per_case <- length(preds) / num_row 
        
           # Data reshaping 
        
           if (predleaf | predcontrib) { 
        
             # Predict leaves only, reshaping is mandatory 
        
             preds <- matrix(preds, ncol = npred_per_case, byrow = TRUE) 
        
           } else if (reshape && npred_per_case > 1L) { 
        
             # Predict with data reshaping 
        
             preds <- matrix(preds, ncol = npred_per_case, byrow = TRUE) 
        
           }

jameslamb · 2022-03-27T05:02:04Z

R-package/tests/testthat/test_Predictor.R

@@ -111,3 +113,117 @@ test_that("start_iteration works correctly", {
    pred_leaf2 <- predict(bst, test$data, start_iteration = 0L, num_iteration = end_iter + 1L, predleaf = TRUE)
    expect_equal(pred_leaf1, pred_leaf2)
 })
+
+test_that("predictions keep row names from the data", {


these tests look great, thank you! To speed up debugging if they break due to changes in future PRs, can you please break them into 3 test cases?

"predict() keeps row names from data (binary classification)"

"predict() keeps row names from data (multi-class classification)"

"predict() keeps row names from data (regression)"

To avoid duplicating test helper code and to make it clear that those methods were written just for these tests, please move them out of the test case and rename them to .expect_has_row_names() and .expect_doesnt_have_row_names().

R-package/tests/testthat/test_Predictor.R

david-cortes · 2022-03-27T14:38:21Z

Updated.

jameslamb

Thanks very much for the changes! But please see my suggestion about simplifying the test code. In its current state, the indirection with .expect_row_names_kept() makes it difficult to understand what is being tested.

jameslamb · 2022-03-28T03:55:36Z

R-package/tests/testthat/test_Predictor.R

+    if (!multiclass || NROW(X) == NROW(pred)) {
+        .expect_has_row_names(pred, X)
+    } else {
+        .expect_doesnt_have_row_names(pred)
+    }


I'm really struggling to understand these test cases, since it requires reasoning about what was passed in to .expect_row_names_kept() from within each of the test_that(...) calls, and since these if statements rely on the fact that pred here happens to reference the predictions produced with reshape = FALSE and none of rawscore / predleaf / predcontrib.

To make this easier to follow (both for reviewing and for future contributors), can you please remove the use of this .expect_row_names_kept() function and instead explicitly include checks in each of the test_that(...) calls? Those checks should not need to use NROW(pred) at all. For test code, I'd prefer some code duplication in exchange for it being easier to understand what is being tested.

Not sure what you want me to do there. That pred does reference predictions with different combinations of reshape / rawscore / predcontrib / predleaf (all predictions generated there are named pred).

You asked me to add tests for different values of reshape, but reshape might get ignored depending on the rest of the parameters, and thus whether the output is meant to have row names or not, depends on the shape of what's actually returned in the end by the prediction function. If I were to adjust the tests by hard-coding the reshape logic in them, they then would test both the row names and the reshape logic together, which I don't think is ideal. The most logical way of testing the row names in a way that's independent of reshape would be by checking the number of rows in the data and the predictions.

they then would test both the row names and the reshape logic together, which I don't think is ideal

"hard-coding the reshape logic", i.e. not changing the test expectation based on a condition like NROW(X) == NROW(pred), is exactly what I'm asking for.

We know, for example, that for reshape = FALSE and a multi-class objective, the output of predict() is expected to have NROW(X) * number_of_classes elements, and therefore shouldn't have row names. It's desirable for the tests to break if that changes.

As a maintainer on this project, I believe I'm able to push commits onto branches from forks associated with open PRs. Would you like me to just directly push the testing changes I'm proposing to this branch?

Yes, would be better if you modify the tests yourself.

Ok no prob! I'll push a commit in the next day or two.

Updated in 04b743f.

Thankfully, #4971 helped simplify these tests (and therefore users' code). Thanks for that!

jameslamb

Very nice change, thanks for proposing it and for helping by adding such comprehensive tests!

jameslamb · 2022-04-04T15:59:28Z

@jmoralez could you review this whenever you have time? I think it's good to be merged, but since I pushed some commits to this PR, I don't want to click "merge" without another review.

jmoralez

Could the testing logic

    # dense matrix with row names
    pred <- predict(bst, X)
    .expect_has_row_names(pred, X)
    pred <- predict(bst, X, rawscore = TRUE)
    .expect_has_row_names(pred, X)
    pred <- predict(bst, X, predleaf = TRUE)
    .expect_has_row_names(pred, X)
    pred <- predict(bst, X, predcontrib = TRUE)
    .expect_has_row_names(pred, X)

    # dense matrix without row names
    Xcopy <- X
    row.names(Xcopy) <- NULL
    pred <- predict(bst, Xcopy)
    .expect_doesnt_have_row_names(pred)

    # sparse matrix with row names
    Xcsc <- as(X, "CsparseMatrix")
    pred <- predict(bst, Xcsc)
    .expect_has_row_names(pred, Xcsc)
    pred <- predict(bst, Xcsc, rawscore = TRUE)
    .expect_has_row_names(pred, Xcsc)
    pred <- predict(bst, Xcsc, predleaf = TRUE)
    .expect_has_row_names(pred, Xcsc)
    pred <- predict(bst, Xcsc, predcontrib = TRUE)
    .expect_has_row_names(pred, Xcsc)

    # sparse matrix without row names
    Xcopy <- Xcsc
    row.names(Xcopy) <- NULL
    pred <- predict(bst, Xcopy)
    .expect_doesnt_have_row_names(pred)

be extracted into a function?

david-cortes · 2022-04-04T18:13:28Z

Could the testing logic

    # dense matrix with row names
    pred <- predict(bst, X)
    .expect_has_row_names(pred, X)
    pred <- predict(bst, X, rawscore = TRUE)
    .expect_has_row_names(pred, X)
    pred <- predict(bst, X, predleaf = TRUE)
    .expect_has_row_names(pred, X)
    pred <- predict(bst, X, predcontrib = TRUE)
    .expect_has_row_names(pred, X)

    # dense matrix without row names
    Xcopy <- X
    row.names(Xcopy) <- NULL
    pred <- predict(bst, Xcopy)
    .expect_doesnt_have_row_names(pred)

    # sparse matrix with row names
    Xcsc <- as(X, "CsparseMatrix")
    pred <- predict(bst, Xcsc)
    .expect_has_row_names(pred, Xcsc)
    pred <- predict(bst, Xcsc, rawscore = TRUE)
    .expect_has_row_names(pred, Xcsc)
    pred <- predict(bst, Xcsc, predleaf = TRUE)
    .expect_has_row_names(pred, Xcsc)
    pred <- predict(bst, Xcsc, predcontrib = TRUE)
    .expect_has_row_names(pred, Xcsc)

    # sparse matrix without row names
    Xcopy <- Xcsc
    row.names(Xcopy) <- NULL
    pred <- predict(bst, Xcopy)
    .expect_doesnt_have_row_names(pred)

be extracted into a function?

Well, it was in a function before...

jameslamb · 2022-04-04T18:14:29Z

could the testing logic be extracted into a function

@jmoralez I specifically wanted to NOT do that (#4977 (comment)) to make it more explicit exactly what the expectation is for each of these settings.

For this many combinations of possibilities, I think having each test be self-contained is preferable to the indirection introduced by putting that logic into a function.

jmoralez · 2022-04-04T18:27:19Z

I think with the reshape argument it was a bit harder but now it's identical, isn't it? I actually spent some time trying to figure out whether or not something was different in one of them.

jameslamb · 2022-04-04T18:30:10Z

I think with the reshape argument it was a bit harder but now it's identical, isn't it? I actually spent some time trying to figure out whether or not something was different in one of them.

Fair enough, you're right that with reshape stuff removed since my original comment, there's no more if branching in this, so the checks are currently identical.

I'll push a commit right now moving it back into a function. I don't want to hold up this PR just for a style consideration like this that we can change in the future.

jameslamb · 2022-04-04T18:36:17Z

Ok @jmoralez , updated in ec1c802

github-actions · 2023-08-23T00:21:55Z

This pull request has been automatically locked since there has not been any recent activity since it was closed. To start a new related discussion, open a new issue at https://github.com/microsoft/LightGBM/issues including a reference to this.

keep row names in prediction output

d4e5a02

david-cortes requested review from jameslamb and Laurae2 as code owners January 23, 2022 19:24

david-cortes added 3 commits January 27, 2022 17:01

fix test

c962a92

fix test

8a4d71f

fix test

d2470c2

jameslamb mentioned this pull request Jan 30, 2022

[ci] [R-package] macOS builds failing to install R #4988

Closed

jameslamb added awaiting review feature labels Feb 4, 2022

jameslamb requested changes Feb 28, 2022

View reviewed changes

jameslamb removed the awaiting review label Feb 28, 2022

jameslamb requested a review from jmoralez March 13, 2022 03:45

jameslamb added the awaiting response label Mar 19, 2022

more tests for row names

6e574f2

solve merge conflict

2f25779

david-cortes mentioned this pull request Mar 23, 2022

When header=0, LGBM_BoosterPredictForFile() called on a CSV with column names raises a process-crashing error #5093

Open

redo badly solved merge conflict

6ee372e

jameslamb requested changes Mar 27, 2022

View reviewed changes

comments

716ce60

update docs

dd327e7

jameslamb requested changes Mar 28, 2022

View reviewed changes

StrikerRUS removed the awaiting response label Mar 28, 2022

david-cortes and others added 4 commits April 3, 2022 18:34

solve merge conflict

521c65b

missing closing parentheses

2782379

Merge branch 'master' into Rinterface8

42e660f

explicitly list assertions in each test

04b743f

jameslamb approved these changes Apr 4, 2022

View reviewed changes

david-cortes mentioned this pull request Apr 4, 2022

[R-package] Add sparse feature contribution predictions #5108

Merged

jameslamb added the awaiting review label Apr 4, 2022

jmoralez reviewed Apr 4, 2022

View reviewed changes

jameslamb added 2 commits April 4, 2022 13:31

Merge branch 'master' into Rinterface8

896629b

move assertions back into shared function

ec1c802

jmoralez approved these changes Apr 4, 2022

View reviewed changes

jameslamb removed the awaiting review label Apr 4, 2022

jameslamb merged commit f860282 into microsoft:master Apr 5, 2022

This was referenced Apr 5, 2022

[ci] R macOS CI jobs failing on 'brew install basictex' #5128

Closed

[R-package] ensure values in params override keyword arguments to predict() (fixes #4670) #5122

Merged

jameslamb mentioned this pull request Oct 7, 2022

[DO NOT MERGE] Release v3.3.3 #5525

Closed

40 tasks

github-actions bot locked as resolved and limited conversation to collaborators Aug 23, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[R-package] Keep row names in output from `predict` #4977

[R-package] Keep row names in output from `predict` #4977

david-cortes commented Jan 23, 2022

david-cortes commented Jan 27, 2022

jameslamb left a comment

david-cortes commented Feb 28, 2022

StrikerRUS commented Mar 1, 2022

david-cortes commented Mar 1, 2022

jameslamb commented Mar 13, 2022

david-cortes commented Mar 23, 2022

david-cortes commented Mar 23, 2022 •

edited

david-cortes commented Mar 23, 2022

jameslamb left a comment

jameslamb Mar 27, 2022

david-cortes commented Mar 27, 2022

jameslamb left a comment

jameslamb Mar 28, 2022

david-cortes Mar 28, 2022

jameslamb Mar 30, 2022

david-cortes Mar 30, 2022

jameslamb Mar 30, 2022

jameslamb Apr 4, 2022

jameslamb left a comment

jameslamb commented Apr 4, 2022

jmoralez left a comment

david-cortes commented Apr 4, 2022

jameslamb commented Apr 4, 2022

jmoralez commented Apr 4, 2022

jameslamb commented Apr 4, 2022

jameslamb commented Apr 4, 2022

github-actions bot commented Aug 23, 2023

	# Get number of cases per row
	npred_per_case <- length(preds) / num_row


	# Data reshaping

	if (predleaf \| predcontrib) {

	# Predict leaves only, reshaping is mandatory
	preds <- matrix(preds, ncol = npred_per_case, byrow = TRUE)

	} else if (reshape && npred_per_case > 1L) {

	# Predict with data reshaping
	preds <- matrix(preds, ncol = npred_per_case, byrow = TRUE)

	}

[R-package] Keep row names in output from predict #4977

[R-package] Keep row names in output from predict #4977

Conversation

david-cortes commented Jan 23, 2022

david-cortes commented Jan 27, 2022

jameslamb left a comment

Choose a reason for hiding this comment

david-cortes commented Feb 28, 2022

StrikerRUS commented Mar 1, 2022

david-cortes commented Mar 1, 2022

jameslamb commented Mar 13, 2022

david-cortes commented Mar 23, 2022

david-cortes commented Mar 23, 2022 • edited

david-cortes commented Mar 23, 2022

jameslamb left a comment

Choose a reason for hiding this comment

jameslamb Mar 27, 2022

Choose a reason for hiding this comment

david-cortes commented Mar 27, 2022

jameslamb left a comment

Choose a reason for hiding this comment

jameslamb Mar 28, 2022

Choose a reason for hiding this comment

david-cortes Mar 28, 2022

Choose a reason for hiding this comment

jameslamb Mar 30, 2022

Choose a reason for hiding this comment

david-cortes Mar 30, 2022

Choose a reason for hiding this comment

jameslamb Mar 30, 2022

Choose a reason for hiding this comment

jameslamb Apr 4, 2022

Choose a reason for hiding this comment

jameslamb left a comment

Choose a reason for hiding this comment

jameslamb commented Apr 4, 2022

jmoralez left a comment

Choose a reason for hiding this comment

david-cortes commented Apr 4, 2022

jameslamb commented Apr 4, 2022

jmoralez commented Apr 4, 2022

jameslamb commented Apr 4, 2022

jameslamb commented Apr 4, 2022

github-actions bot commented Aug 23, 2023

[R-package] Keep row names in output from `predict` #4977

[R-package] Keep row names in output from `predict` #4977

david-cortes commented Mar 23, 2022 •

edited