Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Autologging functionality for scikit-learn integration with XGBoost (Part 2) #5055

Closed
wants to merge 7 commits into from

Conversation

jwyyy
Copy link
Contributor

@jwyyy jwyyy commented Nov 12, 2021

What changes are proposed in this pull request?

This is the second PR to add autologging for XGBoost sklearn models using mlflow.sklearn autologging routine.

(Previous PR: #4954)

(Draft + discussion: #4885)

How is this patch tested?

A new example is provided. Tests will be added later.

Does this PR change the documentation?

  • No. You can skip the rest of this section.
  • Yes. Make sure the changed pages / sections render correctly by following the steps below.
  1. Check the status of the ci/circleci: build_doc check. If it's successful, proceed to the
    next step, otherwise fix it.
  2. Click Details on the right to open the job page of CircleCI.
  3. Click the Artifacts tab.
  4. Click docs/build/html/index.html.
  5. Find the changed pages / sections and make sure they render correctly.

Release Notes

Is this a user-facing change?

  • No. You can skip the rest of this section.
  • Yes. Give a description of this change to be included in the release notes for MLflow users.

Success merge of this PR will enable autologging for XGBoost scikit-learn models.

What component(s), interfaces, languages, and integrations does this PR affect?

Components

  • area/artifacts: Artifact stores and artifact logging
  • area/build: Build and test infrastructure for MLflow
  • area/docs: MLflow documentation pages
  • area/examples: Example code
  • area/model-registry: Model Registry service, APIs, and the fluent client calls for Model Registry
  • area/models: MLmodel format, model serialization/deserialization, flavors
  • area/projects: MLproject format, project running backends
  • area/scoring: MLflow Model server, model deployment tools, Spark UDFs
  • area/server-infra: MLflow Tracking server backend
  • area/tracking: Tracking Service, tracking client APIs, autologging

Interface

  • area/uiux: Front-end, user experience, plotting, JavaScript, JavaScript dev server
  • area/docker: Docker use across MLflow's components, such as MLflow Projects and MLflow Models
  • area/sqlalchemy: Use of SQLAlchemy in the Tracking Service or Model Registry
  • area/windows: Windows support

Language

  • language/r: R APIs and clients
  • language/java: Java APIs and clients
  • language/new: Proposals for new client languages

Integrations

  • integrations/azure: Azure and Azure ML integrations
  • integrations/sagemaker: SageMaker integrations
  • integrations/databricks: Databricks integrations

How should the PR be classified in the release notes? Choose one:

  • rn/breaking-change - The PR will be mentioned in the "Breaking Changes" section
  • rn/none - No description will be included. The PR will be mentioned only by the PR number in the "Small Bugfixes and Documentation Updates" section
  • rn/feature - A new user-facing feature worth mentioning in the release notes
  • rn/bug-fix - A user-facing bug fix worth mentioning in the release notes
  • rn/documentation - A user-facing documentation change worth mentioning in the release notes

Signed-off-by: Junwen Yao <jwyiao@gmail.com>
Signed-off-by: Junwen Yao <jwyiao@gmail.com>
Signed-off-by: Junwen Yao <jwyiao@gmail.com>
@github-actions github-actions bot added area/examples Example code area/tracking Tracking service, tracking client APIs, autologging rn/documentation Mention under Documentation Changes in Changelogs. labels Nov 12, 2021
Signed-off-by: Junwen Yao <jwyiao@gmail.com>
Signed-off-by: Junwen Yao <jwyiao@gmail.com>
@@ -365,7 +378,7 @@ def log_model(
# log model
mlflow.sklearn.log_model(sk_model, "sk_models")
"""
return Model.log(
Model.log(
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems Model.log() doesn't return any value. Maybe we can remove return.



def _autolog(
flavor_name=FLAVOR_NAME,
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Internal API for sklearn autologging. The flavor_name field allows mlflow.xgboost to specify the xgboost_sklearn flavor, preventing flavor conflict with mlflow.sklearn.

mlflow/xgboost.py Outdated Show resolved Hide resolved

def _mlflow_xgboost_logging(
importance_types, autologging_client, logger, original, sklearn_estimator, *args, **kwargs,
):
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Re-organize early stopping call backs and feature importance plot. This function is re-used in mlflow.sklearn for logging XGBoost sklearn estimators.


safe_patch_call_count = (
safe_patch_mock.call_count + xgb_sklearn_safe_patch_mock.call_count
)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since mlflow.sklearn._autolog() is called inside mlflow.xgboost, we need to count safe_patch called due to enabling sklearn autologging.

@jwyyy
Copy link
Contributor Author

jwyyy commented Nov 12, 2021

Hi @harupy @dbczumar, I made a new PR to complete the autologging functionality for XGBoost sklearn estimators. It is based on our previous discussion #4885. I left a few comments in the PR to highlight the changes:

  1. Early stopping call backs and feature importance plots are re-organized inside the internal function _mlflow_xgboost_logging(). This function stays in mlflow.xgboost and is not moved to a new utils file.
  2. _autolog() is the internal API for sklearn autologging. It is called by mlflow.sklearn.autolog() and mlflow.xgboost.autolog(). However, it seems using the sklearn flavor name for both cases will cause some flavor name conflicts. So a new flavor name xgboost_sklearn is introduced to resolve the conflict.
  3. A short example is provided. Will add test cases later and revise doc once we finalize the PR.

Please correct me if I missed anything. Also please let me know your feedback and suggestions! Thanks a lot!

mlflow/sklearn/__init__.py Outdated Show resolved Hide resolved
Signed-off-by: Junwen Yao <jwyiao@gmail.com>
@jwyyy
Copy link
Contributor Author

jwyyy commented Nov 13, 2021

Regarding the tests, I was trying to integrate XGBoost sklearn estimators tests with the existing tests: change

# current 
expected_params = {"num_boost_round": 20, "early_stopping_rounds": 5, "verbose_eval": False}
xgb.train(bst_params, dtrain, evals=[(dtrain, "train")], **expected_params)

to something like

# new
def xgb_train(mode, bst_params, data, other_kwargs):
    if mode == "xgboost_sklearn": 
        # return XGBoost sklearn model using bst_params and other_kwargs
    else:
        # mode == "xgboost"
        # return xgb.train(...)

# insider a test function
xgb_train(mode, bst_params, data, other_kwargs)

but the integration could be messy in this way.

Not all parameters passed to xgboost.train() are used to initialized XGBoost sklearn models. In fact, some are passed to the fit() method, i.e., with parameters bst_params and kwargs, we can do

xgboost.train(bst_params, dtrain, **kwargs)

but

xgb_sklearn_model = xgboost.XGBClassifier(**bst_params, **kwargs)
xgb_sklearn_model.fit(X, y) # X, y from dtrain

generally is not error proof.

Here is an example:

  1. eval_metric is passed to fit() for XGBoost sklearn models in xgboost < 1.6.0.
  2. In xgboost.train(), we can set num_boost_round, but it becomes n_estimator in XGBoost sklearn models.
  3. evals and eval_results are not acceptable arguments for XGBoost sklearn estimators.
    (In fact, evals is constructed inside fit() using the eval_set argument.)
  4. The correct configuration for a xgboost.XGBClassifier (<1.6.0) in this case is:
xgb_classifier = xgb.XGBClassifier(objective="multi:softprob", num_class=3, n_estimators=20)
xgb_classifier.fit(X, y, eval_metric=["merror", "mlogloss"], eval_set=[(X1,y1),(X2,y2)])

@harupy @dbczumar Should we keep doing the integration approach? Or is it a better idea to create new separate tests for XGBoost sklearn models? What are your opinions / suggestions? Thanks!

Signed-off-by: Junwen Yao <jwyiao@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/examples Example code area/tracking Tracking service, tracking client APIs, autologging rn/documentation Mention under Documentation Changes in Changelogs.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

1 participant