Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[R-package] Remove reshape argument in predict #4971

Merged
merged 3 commits into from
Apr 1, 2022

Conversation

david-cortes
Copy link
Contributor

@david-cortes david-cortes commented Jan 23, 2022

ref #4968

When requesing a prediction type with several entries per row, the prediction function in the R package by default produces outputs as a 1-D vector in row-major order. This can be quite confusing and lead to user errors when used externally, especially since it is not documented that the output is row-major while R assumes that conversions to matrices are to be done assuming column-major order and uses vectors in column-major order in built-in operators such as +, *, [, or [<- and functions such as sweep.

What's more, base R's stats package and many core R packages for decision tree ensembles such as ranger or randomForest produce multi-output predictions as a matrix. Examples:

library(ranger)
data(iris)
m <- ranger(Species ~ ., data = iris)
p <- predict(m, iris, type="terminalNodes")$predictions
dim(p)
[1] 150 500
library(randomForest)
m <- randomForest(Species ~ ., data=iris)
p <- attr(predict(m, iris, nodes=TRUE), "nodes")
dim(p)
[1] 150 500

This PR changes the default towards reshape=TRUE and adds a note about the storage order in the documentation.

Copy link
Collaborator

@jameslamb jameslamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks very much for writing this up!

I'm convinced that {lightgbm} should produce a (num_observations, num_classes) matrix for multi-class classification, based on your examples of {randomForest} and {ranger}.

Instead of silently changing this behavior by altering a default, I think for the next release {lightgbm} should just remove the reshape argument from predict.lgb.Booster() and Predictor$predict() entirely. I cannot think of a reason that getting a vector would be preferable to getting a matrix like this:

# A (30x3) matrix with the predictions, use parameter reshape
# class1 class2 class3
# obs1 obs1 obs1
# obs2 obs2 obs2
# .... .... ....

Removing reshape entirely would:

  • make {lightgbm}'s behavior more consistent with other statistical packages in R (including R's standard library)
  • break existing user code in a loud way with error that makes it clear what has change
  • alleviate the inconsistency you've pointed out of the vector produced by predict() being in row-major order when R assumes column-major order
  • simplify the codebase

Before making any changes like that, let's get some opinions from others.

@Laurae2 @mayer79 @StrikerRUS @shiyu1994 @guolinke what do you think about the following proposed breaking changes for the R package?

  • remove argument reshape in lgb.predict.Booster() and Predictor$predict()
  • if reshape is found in ... for lgb.predict.Booster(), raise an informative error explaining that that parameter is no longer supported
  • for predictions like those from multi-class classification, where more than one number is produced per row in the input data, have lgb.predict.Booster() and Predictor$predict() return a numeric matrix of shape (num_observations, num_classes)

@StrikerRUS
Copy link
Collaborator

StrikerRUS commented Mar 30, 2022

what do you think about the following proposed breaking changes for the R package?

I support this.

What about similar is_reshape argument in the Python package?..

is_reshape : bool, optional (default=True)
Whether to reshape to (nrow, ncol).

@david-cortes david-cortes changed the title [R-package] Change default for reshape argument in predict [R-package] Remove reshape argument in predict Mar 30, 2022
@david-cortes
Copy link
Contributor Author

Updated with a removal of the argument instead.

Copy link
Collaborator

@jameslamb jameslamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the changes, and for being thorough by updating the docs and demos as well. I'm excited to see the prediction logic be simplified and become less-surprising at the same time 🎉

I think this will also reduce the complexity of the changes + tests in #4977.

I left a few suggestions below. Please also add a new unit test, at a minimum one like "predictions for multi-class classification are returned as a matrix with the expected dimensions". Every time user-facing behavior is changed, tests should be added to ensure that future development doesn't accidentally break that behavior.

Comment on lines 810 to 812
if ("reshape" %in% names(additional_params)) {
warning("'reshape' argument is no longer supported.")
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
if ("reshape" %in% names(additional_params)) {
warning("'reshape' argument is no longer supported.")
}
if ("reshape" %in% names(additional_params)) {
stop"'reshape' argument is no longer supported.")
}

As mentioned in #4971 (review), please make this an error, not a warning, so that users of the package are notified of this change loudly, right here at the point where they need to change something.

For breaking changes like this, I have a strong preference for them being loud and close to the source of the error, to avoid the case where this shows up somewhere further downstream in users' code (e.g., in some subsetting logic that expected a vector and is now getting a matrix).


# We can also get the predicted scores before the Sigmoid/Softmax application
my_preds <- predict(model, test[, 1L:4L], rawscore = TRUE)

# Raw score predictions as matrix instead of vector
my_preds <- predict(model, test[, 1L:4L], rawscore = TRUE, reshape = TRUE)
my_preds <- predict(model, test[, 1L:4L], rawscore = TRUE)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This line is now identical to the one above it, since the only thing that was difference about it was the use of reshape = TRUE. Please remove this line and its comment.


# We can also get the leaf index
my_preds <- predict(model, test[, 1L:4L], predleaf = TRUE)

# Predict leaf index as matrix instead of vector
my_preds <- predict(model, test[, 1L:4L], predleaf = TRUE, reshape = TRUE)
my_preds <- predict(model, test[, 1L:4L], predleaf = TRUE)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This line is now identical to the one above it, since the only thing that was difference about it was the use of reshape = TRUE. Please remove this line and its comment.

@jameslamb
Copy link
Collaborator

I support this.
What about similar is_reshape argument in the Python package?..

Thanks for the quick feedback @StrikerRUS ! I'd love to see similar changes in the Python package (though not in this PR, please), but would want to do some research to see if is_reshape is used in more places there. Want me to write it up in a feature request issue?

@StrikerRUS
Copy link
Collaborator

Want me to write it up in a feature request issue?

I would really appreciate it!

@david-cortes
Copy link
Contributor Author

Updated.

@jameslamb jameslamb self-requested a review April 1, 2022 01:06
Copy link
Collaborator

@jameslamb jameslamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changes look great to me. Thanks very much for the tests. I'm very happy to see this simplification in the prediction code.

@jmoralez , want a chance to review now that you're back?

@jameslamb
Copy link
Collaborator

@StrikerRUS I created #5115 to capture the work for making a similar change on the Python side.

Thankfully the default there was to always reshape, so I think the change there will be less disruptive to users (since I expect that the most common behavior people want is for multi-class classification and other cases like predcontrib to produces a 2D matrix).

@jameslamb
Copy link
Collaborator

Thanks again for the help, @david-cortes !

@github-actions
Copy link

This pull request has been automatically locked since there has not been any recent activity since it was closed. To start a new related discussion, open a new issue at https://github.com/microsoft/LightGBM/issues including a reference to this.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Aug 23, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants