Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[R-package] User-friendly redesign for lightgbm() #4968

Open
wants to merge 5 commits into
base: master
Choose a base branch
from

Conversation

david-cortes
Copy link
Contributor

ref #4295

This PR redesigns the R interface for lightgbm() so as to make it similar to those of other packages and base R's, while keeping the old lgb.train interface as-is. This redesign includes changes such as:

  • Allowing a formula interface as well as an X/y interface with non-standard evaluation for arguments.
  • Allowing data.frame inputs and handling the encoding of categorical variables in the X/y interface.
  • Auto-detecting binary/multiclass classification according to the type of the target variable.
  • Renaming function arguments to make them more in line with what is used in core R packages.
  • Minor details such as ensuring that the prediction outputs have appropriate row and colum names.
  • Using base-1 indexing in the outputs.
  • Changing default arguments to match those of Python (e.g. number of threads).
  • Expanding a bit the documentation.

Copy link
Collaborator

@jameslamb jameslamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks very much for your continued interest in LightGBM!

However, I'm not willing to review this pull request in its current form.

As we have politely mentioned to you in the past, such large pull requests with multiple unrelated changes are very difficult to review. Some examples: #4207 (review), #4685 (review), .

As we've also politely mentioned to you several times in the past, making comments like "similar to those of other packages" or "more in line with what is used in core R packages" without providing reference links or specific evidence unnecessarily increases the effort required to review such proposals. Some examples: #4692 (comment), #4692 (review), #4207 (comment), #4812 (review), #4597 (comment), #4681 (comment).

This pull request is proposing several very large breaking changes to the R package's interface, each of which requires careful consideration and conversation about what the benefits of those changes are and whether they are worth the cost of breaking users' existing code. We have an obligation to LightGBM's existing users to understand and carefully consider changes to this project, especially changes that break existing code, and can't do that effectively for so many changes at one time.

We really appreciate all of your help with LightGBM, but need to be protective of the limited time that this project's small group of mostly-volunteer maintainers can devote to maintaining the project.


Could you please make each of the related sets of changes a separate pull request or proposal in an issue, which could be reviewed and discussed independently, and in each please provide specific evidence describing the benefit of the proposed change?

Looking through these changes tonight, I see the following sets of changes which should each be their own pull request / issue:

  1. Moving parameters objective, threads, and metric out of params list in examples and tests and passing it through keyword arguments.
  2. Other changes in tests that move parameters out of params and into keyword arguments of lightgbm::lightgbm().
  3. Changes to argument names in predict.lgb.Booster().
  4. Changing lightgbm::lightgbm() from a regular function to a set of S3 methods that dispatch on the type of data (WITHOUT introducing the new formula support or changing any argument names).
  5. Adding support for formulas.
  6. Automatically setting categorical_features based on data frame column types (as part of [R-package] Accept data frames as inputs #4323).
  7. Handling factor columns in data.frame inputs (as part of [R-package] Accept data frames as inputs #4323)
  8. Changing argument names data and label to X and y in lightgbm::lightgbm().
  9. non-standard evaluation of arguments using eval.parent(), deparse1(), substitute(), etc.
  10. adding column names to predictions
  11. addition of a "class" type of prediction, where predict() returns the predicted class instead of probabilities for classification tasks

@StrikerRUS
Copy link
Collaborator

StrikerRUS commented Jan 23, 2022

@david-cortes Thank you so much for your huge work!

@jameslamb I absolutely agree with you that this PR should be spit into smaller ones. I just want to add that open one big PR with all changes first is a good idea because maintainers can assess overall idea and understand a final state that the proposing changes move the project into. Big PR can be treated as a playground for initial and general discussions (demonstrative alternative to original feature request). Also, it's good to test with CI how all proposed connected changes fit together.

See for example dmlc/xgboost#7545 (comment) or dmlc/xgboost#7309 (comment).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants