Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[R-package] Accept factor labels and use their levels #5341

Merged
merged 23 commits into from
Feb 14, 2023

Conversation

david-cortes
Copy link
Contributor

ref #4968

Typically, modeling packages for R switch between regression and classification according to what they are passed for the response variable (example packages doing this: randomForest, ranger, e1071, caret, gbm, etc.), and typically, they require classification models to be passed factor type labels even if they do not auto-determine the objective (e.g. glm, glmnet, etc.) and use the levels of this factor when returning predictions.

This PR changes the lightgbm() interface to play along with factors:

  • Allows passing factor types for the response variable.
  • By default, determines the objective according to the type of the response variable.
  • Determines the number of classes from the response variable.
  • Uses the factor levels for class predictions and for column names in multi-output predictions.

There is one unresolved issue here however: if the prediction type is passed through params, there can be some issues if e.g. a user requests leaf predictions and the number of trees is equal to the number of classes. I left it as a warning note in the documentation but perhaps there's some better workaround that wouldn't involve looking through aliases.

@jameslamb
Copy link
Collaborator

Ok @david-cortes sorry for the VERY long delay! We've had some significant challenges with maintainer availability over the last 9 months.

Are you still interested in pursuing this PR? If you are, please update it with the latest changes from master and I'll happily provide a review. I'm generally supportive of this change, I think it's a really nice usability improvement for R users.

@david-cortes
Copy link
Contributor Author

Solved merge conflicts.

@david-cortes
Copy link
Contributor Author

Copy link
Collaborator

@jameslamb jameslamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the excellent work 🎉

I appreciate the thorough tests, and really like the implementation you came up with that has a minimal impact on the package's public interface.

I left some suggestions for changes, but overall I'm supportive of the way this is structured.

@simonpcouch @mayer79 do either of you have time to help us with a review?

And of course @jmoralez I'd love your opinions if you have time.

R-package/R/aliases.R Outdated Show resolved Hide resolved
R-package/R/lgb.DataProcessor.R Outdated Show resolved Hide resolved
R-package/R/lightgbm.R Outdated Show resolved Hide resolved
R-package/tests/testthat/test_basic.R Outdated Show resolved Hide resolved
R-package/R/lgb.DataProcessor.R Outdated Show resolved Hide resolved
R-package/R/lgb.DataProcessor.R Outdated Show resolved Hide resolved
R-package/R/lgb.DataProcessor.R Show resolved Hide resolved
R-package/R/lightgbm.R Show resolved Hide resolved
R-package/R/lgb.Booster.R Outdated Show resolved Hide resolved
@jameslamb
Copy link
Collaborator

The failing job is unrelated to this PR

Yeah sorry about that. It's a known flaky job: #5677 (comment). Sorry about that.

You can safely ignore any failures in the Azure DevOps *_gpu jobs for this PR.

@jameslamb
Copy link
Collaborator

(retagging @simonpcouch because there was a typo in my review message and i'm not sure if editing an @ in a comment sends a notification)

Copy link
Collaborator

@jameslamb jameslamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome work, thank you so much!

And thanks for improving the detection of binary vs. multiclass objectives with .BINARY_OBJECTIVES() and .MULTICLASS_OBJECTIVES(), that's helpful even for the cases where label isn't a factor.

Just one more suggestion (#5341 (comment)), and otherwise I'm good with this.

Copy link

@simonpcouch simonpcouch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think these additions are definitely an improvement and am on board for the changes requested in James’ review!

bonsai interacts with the lgb.train() interface, so (I believe) this won't end up being an issue for us if you decide to ignore, but one comment:

R-package/R/lgb.DataProcessor.R Show resolved Hide resolved
@jameslamb jameslamb self-requested a review February 14, 2023 01:57
Copy link
Collaborator

@jameslamb jameslamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me!

@jmoralez would you like a chance to review this?

@jameslamb
Copy link
Collaborator

Thanks @jmoralez for helping review this, @david-cortes for the excellent contribution, and @mayer79 and @simonpcouch for always being willing to provide thoughtful reviews when we ask for your help 😊

@jameslamb jameslamb merged commit c676a7e into microsoft:master Feb 14, 2023
@github-actions
Copy link

This pull request has been automatically locked since there has not been any recent activity since it was closed.
To start a new related discussion, open a new issue at https://github.com/microsoft/LightGBM/issues
including a reference to this.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Aug 16, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants