Autologging functionality for scikit-learn integration with XGBoost (Part 2) #5078

jwyyy · 2021-11-17T19:48:09Z

What changes are proposed in this pull request?

This is the second PR to add autologging for XGBoost sklearn models using mlflow.sklearn autologging routine.

(Previous PR: #4954)

(Draft + discussion: #4885)

How is this patch tested?

A new example is provided. Tests will be added later.

Does this PR change the documentation?

No. You can skip the rest of this section.
Yes. Make sure the changed pages / sections render correctly by following the steps below.

Check the status of the ci/circleci: build_doc check. If it's successful, proceed to the
next step, otherwise fix it.
Click Details on the right to open the job page of CircleCI.
Click the Artifacts tab.
Click docs/build/html/index.html.
Find the changed pages / sections and make sure they render correctly.

Release Notes

Is this a user-facing change?

No. You can skip the rest of this section.
Yes. Give a description of this change to be included in the release notes for MLflow users.

Success merge of this PR will enable autologging for XGBoost scikit-learn models.

What component(s), interfaces, languages, and integrations does this PR affect?

Components

Interface

area/uiux: Front-end, user experience, plotting, JavaScript, JavaScript dev server
area/docker: Docker use across MLflow's components, such as MLflow Projects and MLflow Models
area/sqlalchemy: Use of SQLAlchemy in the Tracking Service or Model Registry
area/windows: Windows support

Language

language/r: R APIs and clients
language/java: Java APIs and clients
language/new: Proposals for new client languages

Integrations

integrations/azure: Azure and Azure ML integrations
integrations/sagemaker: SageMaker integrations
integrations/databricks: Databricks integrations

How should the PR be classified in the release notes? Choose one:

rn/breaking-change - The PR will be mentioned in the "Breaking Changes" section
rn/none - No description will be included. The PR will be mentioned only by the PR number in the "Small Bugfixes and Documentation Updates" section
rn/feature - A new user-facing feature worth mentioning in the release notes
rn/bug-fix - A user-facing bug fix worth mentioning in the release notes
rn/documentation - A user-facing documentation change worth mentioning in the release notes

Signed-off-by: Junwen Yao <jwyiao@gmail.com>

jwyyy · 2021-11-17T19:57:49Z

Hi @dbczumar @harupy, I created a new PR and closed #5055 due to the recent merge #5039. (The conflict was too complex for GitHub to resolve automatically.) When you have time, please take a look and let me know your initial review. Then I can continue to work on improvements. Thanks!

dbczumar · 2021-11-18T06:52:13Z

@jwyyy Thank you for your updates. I'll take a look first thing tomorrow!

dbczumar

@jwyyy This is awesome! I left a few comments - let me know if you have any questions!

mlflow/sklearn/__init__.py

mlflow/xgboost/__init__.py

examples/xgboost_sklearn/train_sklearn.py

mlflow/sklearn/__init__.py

Signed-off-by: Junwen Yao <jwyiao@gmail.com>

mlflow/sklearn/__init__.py

remove additional lines Signed-off-by: Junwen Yao <jwyiao@gmail.com>

Signed-off-by: Junwen Yao <jwyiao@gmail.com>

jwyyy · 2021-11-23T04:58:01Z

Hi @dbczumar, I made some updates in this PR based on your review. The main change is what you suggested here. We reuse the train() method in mlflow.xgboost.autolog() and patch it to xgboost.sklearn.train() [L683] (the train() method is called inside XGBoost sklearn models' fit()). Some minor changes are made to enable logging tags that are not logged in mlflow.xgboost.autolog().

Please let me know your ideas and suggestions when you have to review it again. Thank you very much!

dbczumar · 2021-11-23T18:55:55Z

examples/xgboost_sklearn/train_sklearn.py

@@ -0,0 +1,48 @@
+from pprint import pprint
+


Awesome example! Can we add a brief README to this directory explaining what this example covers? E.g. Usage of XGBoost's scikit-learn integration with MLflow Tracking, particularly autologging?

mlflow/sklearn/__init__.py

dbczumar · 2021-11-23T19:04:24Z

mlflow/sklearn/__init__.py

+        # params of xgboost sklearn models are logged in train() in mlflow.xgboost.autolog()
+        if flavor_name == FLAVOR_NAME:
+            _log_posttraining_metadata(autologging_client, self, *args, **kwargs)
+            autologging_client.flush(synchronous=True)


@jwyyy Instead of special casing XGBoost logic in fit_mlflow, can we define a new method called fit_mlflow_xgboost that just calls original(self, *args, **kwargs) and then logs self using mlflow.xgboost.log_model()? This will also allow us to revert changes to XGBoost autologging's train() method, since we can control how the model gets logged here.

We can then add a parameter to patched_fit (

mlflow/mlflow/sklearn/__init__.py

Line 1372 in 6e4b64b

def patched_fit(original, self, *args, **kwargs):

) to specify either fit_mlflow (for sklearn models) or fit_mlflow_xgboost (for xgboost sklearn models). Perhaps we can call this parameter fit_fn.

Let me know if you have questions here!

@dbczumar Thank you for your suggestion! There is a small issue on model logging in the train() method when calling fit(). The fit() method calls train() internally and assigns a Booster object to the internal _Booster in a XGBoost sklearn model [see L1331]. The current train() in mlflow.xgboost.autolog() logs models before returning the model object, which means (1) the logged model is a Booster object; (2) we cannot log XGBoost sklearn models before assigning the trained Booster to _Booster. The changes in mlflow.xgboost.autolog() try to log sklearn models directly. I think we definitely can log sklearn models in fit_mlflow_xgboost() but it is extra work. Because models are logged when calling train(), and calling fit_mlflow_xgboost() just logs new information to replace old ones. However, I also think adopting your suggestion makes the code logic easier to read. Please let me know which solution sounds better to you. Thank you!

@jwyyy Ah, thanks for letting me know! Can we decompose train() into two methods - one for parameter, metric, & non-model artifact logging, and one for model logging? We can then use the former method to patch xgboost.sklearn.train().

Sounds good to me! I will make some adjustments.

tests/autologging/test_autologging_behaviors_integration.py

dbczumar · 2021-11-23T19:09:01Z

tests/autologging/test_autologging_behaviors_integration.py

+                    safe_patch_call_count = (
+                        safe_patch_mock.call_count + xgb_sklearn_safe_patch_mock.call_count
+                    )
+            else:


On the subject of test coverage, can we add a test case to https://github.com/mlflow/mlflow/blob/master/tests/xgboost/test_xgboost_autolog.py ensuring that autologging works as expected for XGBoost scikit-learn models? Feel free to use code from your excellent example above.

dbczumar

@jwyyy This is looking great! I think we're almost there! Just left a few more comments - let me know if you have questions!

Signed-off-by: Junwen Yao <jwyiao@gmail.com>

mlflow/xgboost/__init__.py

Signed-off-by: Junwen Yao <jwyiao@gmail.com>

harupy · 2021-11-25T00:27:16Z

mlflow/xgboost/__init__.py

-            early_stopping = (
-                num_pos_args >= early_stopping_index + 1 or "early_stopping_rounds" in kwargs
+            early_stopping = num_pos_args >= early_stopping_index + 1 or (
+                "early_stopping_rounds" in kwargs and kwargs["early_stopping_rounds"]


Suggested change

"early_stopping_rounds" in kwargs and kwargs["early_stopping_rounds"]

in kwargs.get("early_stopping_rounds")

Can we use get here?

Signed-off-by: Junwen Yao <jwyiao@gmail.com>

jwyyy · 2021-11-25T01:48:25Z

Hi @dbczumar @harupy, I made some updates in this PR. Thank you for your review and suggestions! A new test file was added (to highlight some differences in XGBoost sklearn model testing), and it may contain tests more than what we need. We can remove them later. I also modified doc and re-organized xgboost examples. Please let me know your suggestions when you have time to review it again.Thank you very much!

Happy Holidays to you! 😄

Signed-off-by: Junwen Yao <jwyiao@gmail.com>

harupy · 2021-11-26T03:26:47Z

@jwyyy I pulled your PR branch and ran python examples/xgboost/xgboost_sklearn/train_sklearn.py. Here's the result:

Looks like log_model is called twice, once with artifact_path="model" and once with artifact_path="log_model".

I just saw what examples/xgboost/xgboost_sklearn/train_sklearn.py does. Never mind this comment.

jwyyy · 2021-11-26T03:46:37Z

@harupy Thank you for running the example! Yes, there is a line logging the model after training. We can remove it if it looks redundant.

harupy · 2021-11-26T04:07:37Z

Let's remove it because the model is automatically logged.

harupy · 2021-11-26T04:26:17Z

tests/xgboost/test_xgboost_sklearn_autolog.py

@@ -0,0 +1,498 @@
+from packaging.version import Version


I think adding a simple test like below in tests/xgboost/test_xgboost_autolog.py should be enough.

@pytest.mark.large def test_xgb_autolog_sklearn(): mlflow.xgboost.autolog() X, y = datasets.load_iris(return_X_y=True) params = {"n_estimators": 10, "reg_lambda": 1} model = xgb.XGBRegressor(**params) with mlflow.start_run() as run: model.fit(X, y) model_uri = mlflow.get_artifact_uri("model") client = mlflow.tracking.MlflowClient() run = client.get_run(run.info.run_id) assert run.data.metrics.items() <= params.items() artifacts = set(x.path for x in client.list_artifacts(run.info.run_id)) assert artifacts >= set(["feature_importance_weight.png", "feature_importance_weight.json"]) loaded_model = mlflow.xgboost.load_model(model_uri) np.testing.assert_allclose(loaded_model.predict(X), model.predict(X))

harupy · 2021-11-26T09:55:58Z

mlflow/sklearn/__init__.py

+        # are done in `train()` in `mlflow.xgboost.autolog()`
+        fit_output = original(self, *args, **kwargs)
+        # log models after training
+        (X, _, _) = _get_args_for_metrics(self.fit, args, kwargs)


Does _get_args_for_metrics always return a tuple with 3 elements?

Never mind. It does:

mlflow/mlflow/sklearn/utils.py

Line 48 in 22e69c3

def _get_args_for_metrics(fit_func, fit_args, fit_kwargs):

X = _get_args_for_metrics(self.fit, args, kwargs)[0] might be safer.

We might want to rename _get_args_for_metrics to something like _get_X_y_and_sample_weight. I'll take care of this.

harupy · 2021-11-26T09:59:24Z

examples/xgboost/xgboost_sklearn/train_sklearn.py

+        mse = mean_squared_error(y_test, y_pred)
+        run_id = run.info.run_id
+        print("Logged data and model in run {}".format(run_id))
+        mlflow.xgboost.log_model(regressor, artifact_path="log_model")


Let's remove this line (#5078 (comment))!

harupy · 2021-11-26T10:01:04Z

examples/xgboost/xgboost_sklearn/README.md

@@ -0,0 +1,12 @@
+# XGBoost Scikit-learn Model Example
+
+This example trains an XGBoost regressor ([XGBoost.XGBRegressor](https://xgboost.readthedocs.io/en/stable/python/python_api.html#xgboost.XGBRegressor)) with the diabetes dataset and logs hyperparameters, metrics, and trained model.


Suggested change

This example trains an XGBoost regressor ([XGBoost.XGBRegressor](https://xgboost.readthedocs.io/en/stable/python/python_api.html#xgboost.XGBRegressor)) with the diabetes dataset and logs hyperparameters, metrics, and trained model.

This example trains an [`XGBoost.XGBRegressor`](https://xgboost.readthedocs.io/en/stable/python/python_api.html#xgboost.XGBRegressor)) with the diabetes dataset and logs hyperparameters, metrics, and trained model.

harupy · 2021-11-26T10:02:51Z

examples/xgboost/xgboost_sklearn/train_sklearn.py

@@ -0,0 +1,44 @@
+from pprint import pprint


Nit: can we rename this file to train.py because other examples use this convetion?

harupy · 2021-11-26T10:15:37Z

examples/xgboost/xgboost_sklearn/conda.yaml

+  - mlflow
+  - mypy-extensions==0.4.3
+  - pandas==1.3.4
+  - scikit-learn==0.24.2
+  - typing-extensions==4.0.0
+  - xgboost==1.5.0


Suggested change

- mlflow

- mypy-extensions==0.4.3

- pandas==1.3.4

- scikit-learn==0.24.2

- typing-extensions==4.0.0

- xgboost==1.5.0

- mlflow

- pandas==1.3.4

- scikit-learn==0.24.2

- xgboost==1.5.0

Can we remove mypy-extensions and typing-extensions?

harupy · 2021-11-26T10:16:51Z

examples/xgboost/xgboost_sklearn/train_sklearn.py

+    diabetes = load_diabetes()
+    X = pd.DataFrame(diabetes.data, columns=diabetes.feature_names)
+    y = pd.Series(diabetes.target)


Suggested change

diabetes = load_diabetes()

X = pd.DataFrame(diabetes.data, columns=diabetes.feature_names)

y = pd.Series(diabetes.target)

X, y = load_diabetes(return_X_y=True, as_frame=True)

Let's use return_X_y and as_frame to simplify the code.

Signed-off-by: Junwen Yao <jwyiao@gmail.com>

harupy

LGTM!

jwyyy · 2021-11-27T16:26:25Z

@harupy Thank you for the review!

dbczumar

LGTM! Thank you so much for your contribution, @jwyyy !

jwyyy · 2021-11-29T05:44:51Z

@dbczumar Thank you for your review!

jwyyy added 2 commits November 17, 2021 11:42

new commit, resolve conflict

3f9aec1

Signed-off-by: Junwen Yao <jwyiao@gmail.com>

add example

d6fd5e0

Signed-off-by: Junwen Yao <jwyiao@gmail.com>

github-actions bot added area/examples Example code area/tracking Tracking service, tracking client APIs, autologging rn/documentation Mention under Documentation Changes in Changelogs. labels Nov 17, 2021

fix lint

58ca828

Signed-off-by: Junwen Yao <jwyiao@gmail.com>

dbczumar reviewed Nov 18, 2021

View reviewed changes

mlflow/sklearn/__init__.py Outdated Show resolved Hide resolved

jwyyy added 3 commits November 19, 2021 09:32

Merge branch 'master' into xgb_sklearn_autolog

91ded1c

address review

69adc29

Signed-off-by: Junwen Yao <jwyiao@gmail.com>

fix build_doc

44f7f87

Signed-off-by: Junwen Yao <jwyiao@gmail.com>

jwyyy commented Nov 23, 2021

View reviewed changes

mlflow/sklearn/__init__.py Outdated Show resolved Hide resolved

Update mlflow/sklearn/__init__.py

6bc2a8e

remove additional lines Signed-off-by: Junwen Yao <jwyiao@gmail.com>

jwyyy force-pushed the xgb_sklearn_autolog branch from 564a2c5 to 6bc2a8e Compare November 23, 2021 04:40

remove extra lines

9b76a3b

Signed-off-by: Junwen Yao <jwyiao@gmail.com>

jwyyy requested a review from dbczumar November 23, 2021 17:30

dbczumar reviewed Nov 23, 2021

View reviewed changes

mlflow/sklearn/__init__.py Outdated Show resolved Hide resolved

dbczumar reviewed Nov 23, 2021

View reviewed changes

tests/autologging/test_autologging_behaviors_integration.py Outdated Show resolved Hide resolved

dbczumar reviewed Nov 23, 2021

View reviewed changes

tests/autologging/test_autologging_behaviors_integration.py Show resolved Hide resolved

dbczumar reviewed Nov 23, 2021

View reviewed changes

address review > TODO:(1)doc(2)test

7ed00e8

Signed-off-by: Junwen Yao <jwyiao@gmail.com>

harupy reviewed Nov 24, 2021

View reviewed changes

mlflow/xgboost/__init__.py Show resolved Hide resolved

address review + add tests > TODO:doc,README

8895906

Signed-off-by: Junwen Yao <jwyiao@gmail.com>

harupy reviewed Nov 25, 2021

View reviewed changes

jwyyy added 2 commits November 24, 2021 16:37

address review + complete doc

78832d6

Signed-off-by: Junwen Yao <jwyiao@gmail.com>

fix lint

4fc666d

Signed-off-by: Junwen Yao <jwyiao@gmail.com>

update examples + fix example tests

4eccf36

Signed-off-by: Junwen Yao <jwyiao@gmail.com>

harupy reviewed Nov 26, 2021

View reviewed changes

jwyyy added 2 commits November 26, 2021 10:16

address review

d733ddc

Signed-off-by: Junwen Yao <jwyiao@gmail.com>

address review: update example

076facf

Signed-off-by: Junwen Yao <jwyiao@gmail.com>

harupy approved these changes Nov 27, 2021

View reviewed changes

harupy requested a review from dbczumar November 29, 2021 00:23

dbczumar approved these changes Nov 29, 2021

View reviewed changes

dbczumar merged commit 5381d68 into mlflow:master Nov 29, 2021

jwyyy deleted the xgb_sklearn_autolog branch November 29, 2021 06:08

jwyyy mentioned this pull request Dec 26, 2021

Autologging functionality for scikit-learn integration with LightGBM (Part 2) #5200

Merged

29 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Autologging functionality for scikit-learn integration with XGBoost (Part 2) #5078

Autologging functionality for scikit-learn integration with XGBoost (Part 2) #5078

jwyyy commented Nov 17, 2021

jwyyy commented Nov 17, 2021

dbczumar commented Nov 18, 2021

dbczumar left a comment

jwyyy commented Nov 23, 2021 •

edited

dbczumar Nov 23, 2021

dbczumar Nov 23, 2021

jwyyy Nov 23, 2021

dbczumar Nov 23, 2021 •

edited

jwyyy Nov 23, 2021

dbczumar Nov 23, 2021

dbczumar left a comment

harupy Nov 25, 2021

jwyyy commented Nov 25, 2021

harupy commented Nov 26, 2021 •

edited

jwyyy commented Nov 26, 2021

harupy commented Nov 26, 2021 •

edited

harupy Nov 26, 2021 •

edited

harupy Nov 26, 2021

harupy Nov 26, 2021

harupy Nov 26, 2021

harupy Nov 26, 2021

harupy Nov 26, 2021 •

edited

harupy Nov 26, 2021

harupy Nov 26, 2021

harupy Nov 26, 2021

harupy Nov 26, 2021

harupy left a comment

jwyyy commented Nov 27, 2021

dbczumar left a comment

jwyyy commented Nov 29, 2021

	"early_stopping_rounds" in kwargs and kwargs["early_stopping_rounds"]
	in kwargs.get("early_stopping_rounds")

		@@ -0,0 +1,12 @@
		# XGBoost Scikit-learn Model Example

		This example trains an XGBoost regressor ([XGBoost.XGBRegressor](https://xgboost.readthedocs.io/en/stable/python/python_api.html#xgboost.XGBRegressor)) with the diabetes dataset and logs hyperparameters, metrics, and trained model.

	This example trains an XGBoost regressor ([XGBoost.XGBRegressor](https://xgboost.readthedocs.io/en/stable/python/python_api.html#xgboost.XGBRegressor)) with the diabetes dataset and logs hyperparameters, metrics, and trained model.
	This example trains an [`XGBoost.XGBRegressor`](https://xgboost.readthedocs.io/en/stable/python/python_api.html#xgboost.XGBRegressor)) with the diabetes dataset and logs hyperparameters, metrics, and trained model.

Autologging functionality for scikit-learn integration with XGBoost (Part 2) #5078

Autologging functionality for scikit-learn integration with XGBoost (Part 2) #5078

Conversation

jwyyy commented Nov 17, 2021

What changes are proposed in this pull request?

How is this patch tested?

Does this PR change the documentation?

Release Notes

Is this a user-facing change?

What component(s), interfaces, languages, and integrations does this PR affect?

How should the PR be classified in the release notes? Choose one:

jwyyy commented Nov 17, 2021

dbczumar commented Nov 18, 2021

dbczumar left a comment

Choose a reason for hiding this comment

jwyyy commented Nov 23, 2021 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dbczumar Nov 23, 2021 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dbczumar left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jwyyy commented Nov 25, 2021

harupy commented Nov 26, 2021 • edited

jwyyy commented Nov 26, 2021

harupy commented Nov 26, 2021 • edited

harupy Nov 26, 2021 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

harupy Nov 26, 2021 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

harupy left a comment

Choose a reason for hiding this comment

jwyyy commented Nov 27, 2021

dbczumar left a comment

Choose a reason for hiding this comment

jwyyy commented Nov 29, 2021

jwyyy commented Nov 23, 2021 •

edited

dbczumar Nov 23, 2021 •

edited

harupy commented Nov 26, 2021 •

edited

harupy commented Nov 26, 2021 •

edited

harupy Nov 26, 2021 •

edited

harupy Nov 26, 2021 •

edited