Autologging functionality for scikit-learn integration with XGBoost (Part 2) #5055

jwyyy · 2021-11-12T00:48:24Z

What changes are proposed in this pull request?

This is the second PR to add autologging for XGBoost sklearn models using mlflow.sklearn autologging routine.

(Previous PR: #4954)

(Draft + discussion: #4885)

How is this patch tested?

A new example is provided. Tests will be added later.

Does this PR change the documentation?

No. You can skip the rest of this section.
Yes. Make sure the changed pages / sections render correctly by following the steps below.

Check the status of the ci/circleci: build_doc check. If it's successful, proceed to the
next step, otherwise fix it.
Click Details on the right to open the job page of CircleCI.
Click the Artifacts tab.
Click docs/build/html/index.html.
Find the changed pages / sections and make sure they render correctly.

Release Notes

Is this a user-facing change?

No. You can skip the rest of this section.
Yes. Give a description of this change to be included in the release notes for MLflow users.

Success merge of this PR will enable autologging for XGBoost scikit-learn models.

What component(s), interfaces, languages, and integrations does this PR affect?

Components

Interface

area/uiux: Front-end, user experience, plotting, JavaScript, JavaScript dev server
area/docker: Docker use across MLflow's components, such as MLflow Projects and MLflow Models
area/sqlalchemy: Use of SQLAlchemy in the Tracking Service or Model Registry
area/windows: Windows support

Language

language/r: R APIs and clients
language/java: Java APIs and clients
language/new: Proposals for new client languages

Integrations

integrations/azure: Azure and Azure ML integrations
integrations/sagemaker: SageMaker integrations
integrations/databricks: Databricks integrations

How should the PR be classified in the release notes? Choose one:

rn/breaking-change - The PR will be mentioned in the "Breaking Changes" section
rn/none - No description will be included. The PR will be mentioned only by the PR number in the "Small Bugfixes and Documentation Updates" section
rn/feature - A new user-facing feature worth mentioning in the release notes
rn/bug-fix - A user-facing bug fix worth mentioning in the release notes
rn/documentation - A user-facing documentation change worth mentioning in the release notes

Signed-off-by: Junwen Yao <jwyiao@gmail.com>

jwyyy · 2021-11-12T06:17:08Z

mlflow/sklearn/__init__.py

@@ -365,7 +378,7 @@ def log_model(
        # log model
        mlflow.sklearn.log_model(sk_model, "sk_models")
    """
-    return Model.log(
+    Model.log(


It seems Model.log() doesn't return any value. Maybe we can remove return.

jwyyy · 2021-11-12T06:17:27Z

mlflow/sklearn/__init__.py

+
+
+def _autolog(
+    flavor_name=FLAVOR_NAME,


Internal API for sklearn autologging. The flavor_name field allows mlflow.xgboost to specify the xgboost_sklearn flavor, preventing flavor conflict with mlflow.sklearn.

mlflow/xgboost.py

jwyyy · 2021-11-12T06:20:50Z

mlflow/xgboost.py

+
+def _mlflow_xgboost_logging(
+    importance_types, autologging_client, logger, original, sklearn_estimator, *args, **kwargs,
+):


Re-organize early stopping call backs and feature importance plot. This function is re-used in mlflow.sklearn for logging XGBoost sklearn estimators.

jwyyy · 2021-11-12T06:22:05Z

tests/autologging/test_autologging_behaviors_integration.py

+
+                    safe_patch_call_count = (
+                        safe_patch_mock.call_count + xgb_sklearn_safe_patch_mock.call_count
+                    )


Since mlflow.sklearn._autolog() is called inside mlflow.xgboost, we need to count safe_patch called due to enabling sklearn autologging.

jwyyy · 2021-11-12T06:47:04Z

Hi @harupy @dbczumar, I made a new PR to complete the autologging functionality for XGBoost sklearn estimators. It is based on our previous discussion #4885. I left a few comments in the PR to highlight the changes:

Early stopping call backs and feature importance plots are re-organized inside the internal function _mlflow_xgboost_logging(). This function stays in mlflow.xgboost and is not moved to a new utils file.
_autolog() is the internal API for sklearn autologging. It is called by mlflow.sklearn.autolog() and mlflow.xgboost.autolog(). However, it seems using the sklearn flavor name for both cases will cause some flavor name conflicts. So a new flavor name xgboost_sklearn is introduced to resolve the conflict.
A short example is provided. Will add test cases later and revise doc once we finalize the PR.

Please correct me if I missed anything. Also please let me know your feedback and suggestions! Thanks a lot!

mlflow/sklearn/__init__.py

Signed-off-by: Junwen Yao <jwyiao@gmail.com>

jwyyy · 2021-11-13T23:50:40Z

Regarding the tests, I was trying to integrate XGBoost sklearn estimators tests with the existing tests: change

# current 
expected_params = {"num_boost_round": 20, "early_stopping_rounds": 5, "verbose_eval": False}
xgb.train(bst_params, dtrain, evals=[(dtrain, "train")], **expected_params)

to something like

# new
def xgb_train(mode, bst_params, data, other_kwargs):
    if mode == "xgboost_sklearn": 
        # return XGBoost sklearn model using bst_params and other_kwargs
    else:
        # mode == "xgboost"
        # return xgb.train(...)

# insider a test function
xgb_train(mode, bst_params, data, other_kwargs)

but the integration could be messy in this way.

Not all parameters passed to xgboost.train() are used to initialized XGBoost sklearn models. In fact, some are passed to the fit() method, i.e., with parameters bst_params and kwargs, we can do

xgboost.train(bst_params, dtrain, **kwargs)

but

xgb_sklearn_model = xgboost.XGBClassifier(**bst_params, **kwargs)
xgb_sklearn_model.fit(X, y) # X, y from dtrain

generally is not error proof.

Here is an example:

eval_metric is passed to fit() for XGBoost sklearn models in xgboost < 1.6.0.
In xgboost.train(), we can set num_boost_round, but it becomes n_estimator in XGBoost sklearn models.
evals and eval_results are not acceptable arguments for XGBoost sklearn estimators.
(In fact, evals is constructed inside fit() using the eval_set argument.)
The correct configuration for a xgboost.XGBClassifier (<1.6.0) in this case is:

xgb_classifier = xgb.XGBClassifier(objective="multi:softprob", num_class=3, n_estimators=20)
xgb_classifier.fit(X, y, eval_metric=["merror", "mlogloss"], eval_set=[(X1,y1),(X2,y2)])

@harupy @dbczumar Should we keep doing the integration approach? Or is it a better idea to create new separate tests for XGBoost sklearn models? What are your opinions / suggestions? Thanks!

Signed-off-by: Junwen Yao <jwyiao@gmail.com>

jwyyy added 3 commits November 11, 2021 08:19

init commit

1dd5af7

Signed-off-by: Junwen Yao <jwyiao@gmail.com>

add examples

f8f162a

Signed-off-by: Junwen Yao <jwyiao@gmail.com>

remove example mlruns folder

dd14016

Signed-off-by: Junwen Yao <jwyiao@gmail.com>

github-actions bot added area/examples Example code area/tracking Tracking service, tracking client APIs, autologging rn/documentation Mention under Documentation Changes in Changelogs. labels Nov 12, 2021

jwyyy added 2 commits November 11, 2021 18:13

fix lint + err

3acb252

Signed-off-by: Junwen Yao <jwyiao@gmail.com>

fix err caused by flavor conflict

48645c5

Signed-off-by: Junwen Yao <jwyiao@gmail.com>

jwyyy commented Nov 12, 2021

View reviewed changes

jwyyy commented Nov 13, 2021

View reviewed changes

mlflow/sklearn/__init__.py Outdated Show resolved Hide resolved

update

5ef5a7d

Signed-off-by: Junwen Yao <jwyiao@gmail.com>

resolve conflict

f3d162f

Signed-off-by: Junwen Yao <jwyiao@gmail.com>

jwyyy closed this Nov 17, 2021

jwyyy deleted the xgb_sklearn_autolog branch November 17, 2021 19:26

jwyyy mentioned this pull request Nov 17, 2021

Autologging functionality for scikit-learn integration with XGBoost (Part 2) #5078

Merged

29 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Autologging functionality for scikit-learn integration with XGBoost (Part 2) #5055

Autologging functionality for scikit-learn integration with XGBoost (Part 2) #5055

jwyyy commented Nov 12, 2021

jwyyy Nov 12, 2021

jwyyy Nov 12, 2021

jwyyy Nov 12, 2021

jwyyy Nov 12, 2021

jwyyy commented Nov 12, 2021

jwyyy commented Nov 13, 2021

Autologging functionality for scikit-learn integration with XGBoost (Part 2) #5055

Autologging functionality for scikit-learn integration with XGBoost (Part 2) #5055

Conversation

jwyyy commented Nov 12, 2021

What changes are proposed in this pull request?

How is this patch tested?

Does this PR change the documentation?

Release Notes

Is this a user-facing change?

What component(s), interfaces, languages, and integrations does this PR affect?

How should the PR be classified in the release notes? Choose one:

jwyyy Nov 12, 2021

Choose a reason for hiding this comment

jwyyy Nov 12, 2021

Choose a reason for hiding this comment

jwyyy Nov 12, 2021

Choose a reason for hiding this comment

jwyyy Nov 12, 2021

Choose a reason for hiding this comment

jwyyy commented Nov 12, 2021

jwyyy commented Nov 13, 2021