Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Autologging functionality for scikit-learn integration with LightGBM (Part 1) #5130

Merged
merged 7 commits into from Dec 24, 2021

Conversation

jwyyy
Copy link
Contributor

@jwyyy jwyyy commented Nov 30, 2021

What changes are proposed in this pull request?

This PR enables the support for saving and loading all LightGBM models.

(Draft + discussion: #4296, #4885. Template: #4954)

How is this patch tested?

New test methods are added.

Does this PR change the documentation?

  • No. You can skip the rest of this section.
  • Yes. Make sure the changed pages / sections render correctly by following the steps below.
  1. Check the status of the ci/circleci: build_doc check. If it's successful, proceed to the
    next step, otherwise fix it.
  2. Click Details on the right to open the job page of CircleCI.
  3. Click the Artifacts tab.
  4. Click docs/build/html/index.html.
  5. Find the changed pages / sections and make sure they render correctly.

Release Notes

Is this a user-facing change?

  • No. You can skip the rest of this section.
  • Yes. Give a description of this change to be included in the release notes for MLflow users.

This PR will enable saving and loading LightGBM models, including sklearn models, with model class specification.
Functions save_model() / load_model() in mlflow.lightgbm can be used as before.

What component(s), interfaces, languages, and integrations does this PR affect?

Components

  • area/artifacts: Artifact stores and artifact logging
  • area/build: Build and test infrastructure for MLflow
  • area/docs: MLflow documentation pages
  • area/examples: Example code
  • area/model-registry: Model Registry service, APIs, and the fluent client calls for Model Registry
  • area/models: MLmodel format, model serialization/deserialization, flavors
  • area/projects: MLproject format, project running backends
  • area/scoring: MLflow Model server, model deployment tools, Spark UDFs
  • area/server-infra: MLflow Tracking server backend
  • area/tracking: Tracking Service, tracking client APIs, autologging

Interface

  • area/uiux: Front-end, user experience, plotting, JavaScript, JavaScript dev server
  • area/docker: Docker use across MLflow's components, such as MLflow Projects and MLflow Models
  • area/sqlalchemy: Use of SQLAlchemy in the Tracking Service or Model Registry
  • area/windows: Windows support

Language

  • language/r: R APIs and clients
  • language/java: Java APIs and clients
  • language/new: Proposals for new client languages

Integrations

  • integrations/azure: Azure and Azure ML integrations
  • integrations/sagemaker: SageMaker integrations
  • integrations/databricks: Databricks integrations

How should the PR be classified in the release notes? Choose one:

  • rn/breaking-change - The PR will be mentioned in the "Breaking Changes" section
  • rn/none - No description will be included. The PR will be mentioned only by the PR number in the "Small Bugfixes and Documentation Updates" section
  • rn/feature - A new user-facing feature worth mentioning in the release notes
  • rn/bug-fix - A user-facing bug fix worth mentioning in the release notes
  • [] rn/documentation - A user-facing documentation change worth mentioning in the release notes

Signed-off-by: Junwen Yao <jwyiao@gmail.com>
Signed-off-by: Junwen Yao <jwyiao@gmail.com>
@github-actions github-actions bot added area/models MLmodel format, model serialization/deserialization, flavors rn/feature Mention under Features in Changelogs. labels Nov 30, 2021
Signed-off-by: Junwen Yao <jwyiao@gmail.com>
@jwyyy
Copy link
Contributor Author

jwyyy commented Nov 30, 2021

Hi @dbczumar @harupy, I made the first PR to support autologging for LightGBM sklearn model. Unlike XGBoost sklearn models, LightGBM sklearn models don't support save_model() and load_model() currently. It needs to be done by obtaining internal Boosters first. So I made a tentative plan by (1) re-organizing the lightgbm folder (2) and adding utils methods to handle LightGBM sklearn model saving / loading. We can revise the logic once microsoft/LightGBM#4841 is addressed. To enable loading sklearn models, we need to load Booster / sklearn model parameters. Currently this is not fully supported either. (An ongoing PR microsoft/LightGBM#4802 is trying to fix this for Boosters.) As a workaround, this PR saves Booster / sklearn model parameters in JSON. When loading back models, these parameters will be read in to recover the attributes of saved models.

Please let me your feedback and suggestions when you have time to review it! Thank you so much!

@jwyyy
Copy link
Contributor Author

jwyyy commented Dec 13, 2021

Hi @dbczumar @harupy, it seems the LightGBM community doesn't have a plan to implement save_model() and load_model for sklearn models (it's been a while since I opened the issue but no conclusion was reached). They recommended to use pickle or joblib, which is inconsistent with Booster saving / loading but same as the current implementation of model saving in mlflow.sklearn. Please let me know what your thoughts and comments are. I will revise the current PR. Thanks a lot!

Copy link
Collaborator

@dbczumar dbczumar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jwyyy Excellent work! Apologies for the delayed response due to a busy holiday season. Regarding your question about serialization / deserialization, I think it's probably best to use pickle for now. This is less likely to break across LightGBM versions. Can we test out serialization / deserialization with custom objective functions (see docs for eval_metric here: https://lightgbm.readthedocs.io/en/latest/pythonapi/lightgbm.LGBMClassifier.html#lightgbm.LGBMClassifier.fit) and make sure that pickle works successfully? If it doesn't, we may want to use cloudpickle.

mlflow/lightgbm/__init__.py Outdated Show resolved Hide resolved
return None


def _save_lgb_model(lgb_model, model_path) -> None:
Copy link
Collaborator

@dbczumar dbczumar Dec 14, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Following from #5130 (comment), I think it's probably safest to use pickle to save / load the model, since the LightGBM developers could make breaking changes that break the serialization / deserialization code. Thank you so much for taking the time to reach out to the LightGBM community and get some insight into the recommended best practices here.

Can we test out serialization / deserialization with custom objective functions (see docs for eval_metric here: https://lightgbm.readthedocs.io/en/latest/pythonapi/lightgbm.LGBMClassifier.html#lightgbm.LGBMClassifier.fit) and make sure that pickle works successfully? If it doesn't, we may want to use cloudpickle.

@jwyyy
Copy link
Contributor Author

jwyyy commented Dec 14, 2021

Hi @dbczumar, no worries! Thank you for your review and feedback! I will update the PR accordingly.

Signed-off-by: Junwen Yao <jwyiao@gmail.com>
Signed-off-by: Junwen Yao <jwyiao@gmail.com>
@jwyyy
Copy link
Contributor Author

jwyyy commented Dec 16, 2021

Hi @dbczumar @harupy, I made some changes to the PR. For serialization / deserialization of LightGBM sklearn models, I used cloudpickle instead of pickle. Custom objective functions for the argument eval_metric work fine with pickle, because eval_metric is only called during training and not assigned to any class member. However, pickle doesn't work with the custom objective (Python lambda functions) in LightGBM sklearn model initialization. Would love to learn your feedback and comments. Thanks!

@@ -73,7 +74,7 @@ def get_default_pip_requirements():
Calls to :func:`save_model()` and :func:`log_model()` produce a pip environment
that, at minimum, contains these requirements.
"""
return [_get_pinned_requirement("lightgbm")]
return [_get_pinned_requirement("lightgbm"), _get_pinned_requirement("cloudpickle")]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we add cloudpickle conditionally because users who don't use scikit-learn estimators don't need cloudpickle?

Copy link
Contributor Author

@jwyyy jwyyy Dec 21, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @harupy, thank you for your suggestion! Does that mean we also need to provide an option to turn on / off autologging for scikit-learn estimators? I assumed mlflow.lightgbm.autolog() enables autologging for all models.

Copy link
Contributor Author

@jwyyy jwyyy Dec 22, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @harupy, I found a simple way to add cloudpickle conditionally (and automatically) based on what model is saved (please see L169-171). Please let me know your feedback and comments. Thanks a lot!

(This comment is also addressed in the latest commit.)

Comment on lines 136 to 139
if isinstance(lgb_model, lgb.Booster):
model_data_subpath = "model.lgb"
else:
model_data_subpath = "model.pkl"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
if isinstance(lgb_model, lgb.Booster):
model_data_subpath = "model.lgb"
else:
model_data_subpath = "model.pkl"
model_data_subpath = "model.lgb" if isinstance(lgb_model, lgb.Booster) else "model.pkl"

nit

Signed-off-by: Junwen Yao <jwyiao@gmail.com>
@jwyyy jwyyy requested a review from harupy December 22, 2021 17:41
Signed-off-by: Junwen Yao <jwyiao@gmail.com>
Copy link
Member

@harupy harupy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks good to me!

@jwyyy
Copy link
Contributor Author

jwyyy commented Dec 23, 2021

@harupy Thank you for your review!

@harupy harupy merged commit 4c657b3 into mlflow:master Dec 24, 2021
@jwyyy jwyyy deleted the lightgbm_save_load branch December 24, 2021 06:09
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/models MLmodel format, model serialization/deserialization, flavors rn/feature Mention under Features in Changelogs.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants