Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Autologging functionality for scikit-learn integration with XGBoost (Part 2) #5078

Merged
merged 15 commits into from Nov 29, 2021

Conversation

jwyyy
Copy link
Contributor

@jwyyy jwyyy commented Nov 17, 2021

What changes are proposed in this pull request?

This is the second PR to add autologging for XGBoost sklearn models using mlflow.sklearn autologging routine.

(Previous PR: #4954)

(Draft + discussion: #4885)

How is this patch tested?

A new example is provided. Tests will be added later.

Does this PR change the documentation?

  • No. You can skip the rest of this section.
  • Yes. Make sure the changed pages / sections render correctly by following the steps below.
  1. Check the status of the ci/circleci: build_doc check. If it's successful, proceed to the
    next step, otherwise fix it.
  2. Click Details on the right to open the job page of CircleCI.
  3. Click the Artifacts tab.
  4. Click docs/build/html/index.html.
  5. Find the changed pages / sections and make sure they render correctly.

Release Notes

Is this a user-facing change?

  • No. You can skip the rest of this section.
  • Yes. Give a description of this change to be included in the release notes for MLflow users.

Success merge of this PR will enable autologging for XGBoost scikit-learn models.

What component(s), interfaces, languages, and integrations does this PR affect?

Components

  • area/artifacts: Artifact stores and artifact logging
  • area/build: Build and test infrastructure for MLflow
  • area/docs: MLflow documentation pages
  • area/examples: Example code
  • area/model-registry: Model Registry service, APIs, and the fluent client calls for Model Registry
  • area/models: MLmodel format, model serialization/deserialization, flavors
  • area/projects: MLproject format, project running backends
  • area/scoring: MLflow Model server, model deployment tools, Spark UDFs
  • area/server-infra: MLflow Tracking server backend
  • area/tracking: Tracking Service, tracking client APIs, autologging

Interface

  • area/uiux: Front-end, user experience, plotting, JavaScript, JavaScript dev server
  • area/docker: Docker use across MLflow's components, such as MLflow Projects and MLflow Models
  • area/sqlalchemy: Use of SQLAlchemy in the Tracking Service or Model Registry
  • area/windows: Windows support

Language

  • language/r: R APIs and clients
  • language/java: Java APIs and clients
  • language/new: Proposals for new client languages

Integrations

  • integrations/azure: Azure and Azure ML integrations
  • integrations/sagemaker: SageMaker integrations
  • integrations/databricks: Databricks integrations

How should the PR be classified in the release notes? Choose one:

  • rn/breaking-change - The PR will be mentioned in the "Breaking Changes" section
  • rn/none - No description will be included. The PR will be mentioned only by the PR number in the "Small Bugfixes and Documentation Updates" section
  • rn/feature - A new user-facing feature worth mentioning in the release notes
  • rn/bug-fix - A user-facing bug fix worth mentioning in the release notes
  • rn/documentation - A user-facing documentation change worth mentioning in the release notes

Signed-off-by: Junwen Yao <jwyiao@gmail.com>
Signed-off-by: Junwen Yao <jwyiao@gmail.com>
@github-actions github-actions bot added area/examples Example code area/tracking Tracking service, tracking client APIs, autologging rn/documentation Mention under Documentation Changes in Changelogs. labels Nov 17, 2021
Signed-off-by: Junwen Yao <jwyiao@gmail.com>
@jwyyy
Copy link
Contributor Author

jwyyy commented Nov 17, 2021

Hi @dbczumar @harupy, I created a new PR and closed #5055 due to the recent merge #5039. (The conflict was too complex for GitHub to resolve automatically.) When you have time, please take a look and let me know your initial review. Then I can continue to work on improvements. Thanks!

@dbczumar
Copy link
Collaborator

@jwyyy Thank you for your updates. I'll take a look first thing tomorrow!

Copy link
Collaborator

@dbczumar dbczumar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jwyyy This is awesome! I left a few comments - let me know if you have any questions!

mlflow/sklearn/__init__.py Outdated Show resolved Hide resolved
mlflow/xgboost/__init__.py Outdated Show resolved Hide resolved
mlflow/xgboost/__init__.py Outdated Show resolved Hide resolved
mlflow/xgboost/__init__.py Outdated Show resolved Hide resolved
examples/xgboost_sklearn/train_sklearn.py Outdated Show resolved Hide resolved
examples/xgboost_sklearn/train_sklearn.py Outdated Show resolved Hide resolved
Signed-off-by: Junwen Yao <jwyiao@gmail.com>
Signed-off-by: Junwen Yao <jwyiao@gmail.com>
mlflow/sklearn/__init__.py Outdated Show resolved Hide resolved
remove additional lines

Signed-off-by: Junwen Yao <jwyiao@gmail.com>
Signed-off-by: Junwen Yao <jwyiao@gmail.com>
@jwyyy
Copy link
Contributor Author

jwyyy commented Nov 23, 2021

Hi @dbczumar, I made some updates in this PR based on your review. The main change is what you suggested here. We reuse the train() method in mlflow.xgboost.autolog() and patch it to xgboost.sklearn.train() [L683] (the train() method is called inside XGBoost sklearn models' fit()). Some minor changes are made to enable logging tags that are not logged in mlflow.xgboost.autolog().

Please let me know your ideas and suggestions when you have to review it again. Thank you very much!

@jwyyy jwyyy requested a review from dbczumar November 23, 2021 17:30
@@ -0,0 +1,48 @@
from pprint import pprint

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome example! Can we add a brief README to this directory explaining what this example covers? E.g. Usage of XGBoost's scikit-learn integration with MLflow Tracking, particularly autologging?

Comment on lines 1250 to 1253
# params of xgboost sklearn models are logged in train() in mlflow.xgboost.autolog()
if flavor_name == FLAVOR_NAME:
_log_posttraining_metadata(autologging_client, self, *args, **kwargs)
autologging_client.flush(synchronous=True)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jwyyy Instead of special casing XGBoost logic in fit_mlflow, can we define a new method called fit_mlflow_xgboost that just calls original(self, *args, **kwargs) and then logs self using mlflow.xgboost.log_model()? This will also allow us to revert changes to XGBoost autologging's train() method, since we can control how the model gets logged here.

We can then add a parameter to patched_fit (

def patched_fit(original, self, *args, **kwargs):
) to specify either fit_mlflow (for sklearn models) or fit_mlflow_xgboost (for xgboost sklearn models). Perhaps we can call this parameter fit_fn.

Let me know if you have questions here!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dbczumar Thank you for your suggestion! There is a small issue on model logging in the train() method when calling fit(). The fit() method calls train() internally and assigns a Booster object to the internal _Booster in a XGBoost sklearn model [see L1331]. The current train() in mlflow.xgboost.autolog() logs models before returning the model object, which means (1) the logged model is a Booster object; (2) we cannot log XGBoost sklearn models before assigning the trained Booster to _Booster. The changes in mlflow.xgboost.autolog() try to log sklearn models directly. I think we definitely can log sklearn models in fit_mlflow_xgboost() but it is extra work. Because models are logged when calling train(), and calling fit_mlflow_xgboost() just logs new information to replace old ones. However, I also think adopting your suggestion makes the code logic easier to read. Please let me know which solution sounds better to you. Thank you!

Copy link
Collaborator

@dbczumar dbczumar Nov 23, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jwyyy Ah, thanks for letting me know! Can we decompose train() into two methods - one for parameter, metric, & non-model artifact logging, and one for model logging? We can then use the former method to patch xgboost.sklearn.train().

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good to me! I will make some adjustments.

safe_patch_call_count = (
safe_patch_mock.call_count + xgb_sklearn_safe_patch_mock.call_count
)
else:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On the subject of test coverage, can we add a test case to https://github.com/mlflow/mlflow/blob/master/tests/xgboost/test_xgboost_autolog.py ensuring that autologging works as expected for XGBoost scikit-learn models? Feel free to use code from your excellent example above.

Copy link
Collaborator

@dbczumar dbczumar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jwyyy This is looking great! I think we're almost there! Just left a few more comments - let me know if you have questions!

Signed-off-by: Junwen Yao <jwyiao@gmail.com>
Signed-off-by: Junwen Yao <jwyiao@gmail.com>
early_stopping = (
num_pos_args >= early_stopping_index + 1 or "early_stopping_rounds" in kwargs
early_stopping = num_pos_args >= early_stopping_index + 1 or (
"early_stopping_rounds" in kwargs and kwargs["early_stopping_rounds"]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
"early_stopping_rounds" in kwargs and kwargs["early_stopping_rounds"]
in kwargs.get("early_stopping_rounds")

Can we use get here?

Signed-off-by: Junwen Yao <jwyiao@gmail.com>
Signed-off-by: Junwen Yao <jwyiao@gmail.com>
@jwyyy
Copy link
Contributor Author

jwyyy commented Nov 25, 2021

Hi @dbczumar @harupy, I made some updates in this PR. Thank you for your review and suggestions! A new test file was added (to highlight some differences in XGBoost sklearn model testing), and it may contain tests more than what we need. We can remove them later. I also modified doc and re-organized xgboost examples. Please let me know your suggestions when you have time to review it again.Thank you very much!

Happy Holidays to you! 😄

Signed-off-by: Junwen Yao <jwyiao@gmail.com>
@harupy
Copy link
Member

harupy commented Nov 26, 2021

@jwyyy I pulled your PR branch and ran python examples/xgboost/xgboost_sklearn/train_sklearn.py. Here's the result:

Looks like log_model is called twice, once with artifact_path="model" and once with artifact_path="log_model".


I just saw what examples/xgboost/xgboost_sklearn/train_sklearn.py does. Never mind this comment.

@jwyyy
Copy link
Contributor Author

jwyyy commented Nov 26, 2021

@harupy Thank you for running the example! Yes, there is a line logging the model after training. We can remove it if it looks redundant.

@harupy
Copy link
Member

harupy commented Nov 26, 2021

Let's remove it because the model is automatically logged.

@@ -0,0 +1,498 @@
from packaging.version import Version
Copy link
Member

@harupy harupy Nov 26, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think adding a simple test like below in tests/xgboost/test_xgboost_autolog.py should be enough.

@pytest.mark.large
def test_xgb_autolog_sklearn():

    mlflow.xgboost.autolog()

    X, y = datasets.load_iris(return_X_y=True)
    params = {"n_estimators": 10, "reg_lambda": 1}
    model = xgb.XGBRegressor(**params)

    with mlflow.start_run() as run:
        model.fit(X, y)
        model_uri = mlflow.get_artifact_uri("model")

    client = mlflow.tracking.MlflowClient()
    run = client.get_run(run.info.run_id)
    assert run.data.metrics.items() <= params.items()
    artifacts = set(x.path for x in client.list_artifacts(run.info.run_id))
    assert artifacts >= set(["feature_importance_weight.png", "feature_importance_weight.json"])
    loaded_model = mlflow.xgboost.load_model(model_uri)
    np.testing.assert_allclose(loaded_model.predict(X), model.predict(X))

# are done in `train()` in `mlflow.xgboost.autolog()`
fit_output = original(self, *args, **kwargs)
# log models after training
(X, _, _) = _get_args_for_metrics(self.fit, args, kwargs)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does _get_args_for_metrics always return a tuple with 3 elements?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Never mind. It does:

def _get_args_for_metrics(fit_func, fit_args, fit_kwargs):

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

X = _get_args_for_metrics(self.fit, args, kwargs)[0] might be safer.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We might want to rename _get_args_for_metrics to something like _get_X_y_and_sample_weight. I'll take care of this.

mse = mean_squared_error(y_test, y_pred)
run_id = run.info.run_id
print("Logged data and model in run {}".format(run_id))
mlflow.xgboost.log_model(regressor, artifact_path="log_model")
Copy link
Member

@harupy harupy Nov 26, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's remove this line (#5078 (comment))!

@@ -0,0 +1,12 @@
# XGBoost Scikit-learn Model Example

This example trains an XGBoost regressor ([XGBoost.XGBRegressor](https://xgboost.readthedocs.io/en/stable/python/python_api.html#xgboost.XGBRegressor)) with the diabetes dataset and logs hyperparameters, metrics, and trained model.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
This example trains an XGBoost regressor ([XGBoost.XGBRegressor](https://xgboost.readthedocs.io/en/stable/python/python_api.html#xgboost.XGBRegressor)) with the diabetes dataset and logs hyperparameters, metrics, and trained model.
This example trains an [`XGBoost.XGBRegressor`](https://xgboost.readthedocs.io/en/stable/python/python_api.html#xgboost.XGBRegressor)) with the diabetes dataset and logs hyperparameters, metrics, and trained model.

@@ -0,0 +1,44 @@
from pprint import pprint
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: can we rename this file to train.py because other examples use this convetion?

Comment on lines 7 to 12
- mlflow
- mypy-extensions==0.4.3
- pandas==1.3.4
- scikit-learn==0.24.2
- typing-extensions==4.0.0
- xgboost==1.5.0
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
- mlflow
- mypy-extensions==0.4.3
- pandas==1.3.4
- scikit-learn==0.24.2
- typing-extensions==4.0.0
- xgboost==1.5.0
- mlflow
- pandas==1.3.4
- scikit-learn==0.24.2
- xgboost==1.5.0

Can we remove mypy-extensions and typing-extensions?

Comment on lines 18 to 20
diabetes = load_diabetes()
X = pd.DataFrame(diabetes.data, columns=diabetes.feature_names)
y = pd.Series(diabetes.target)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
diabetes = load_diabetes()
X = pd.DataFrame(diabetes.data, columns=diabetes.feature_names)
y = pd.Series(diabetes.target)
X, y = load_diabetes(return_X_y=True, as_frame=True)

Let's use return_X_y and as_frame to simplify the code.

Signed-off-by: Junwen Yao <jwyiao@gmail.com>
Signed-off-by: Junwen Yao <jwyiao@gmail.com>
Copy link
Member

@harupy harupy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@jwyyy
Copy link
Contributor Author

jwyyy commented Nov 27, 2021

@harupy Thank you for the review!

@harupy harupy requested a review from dbczumar November 29, 2021 00:23
Copy link
Collaborator

@dbczumar dbczumar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Thank you so much for your contribution, @jwyyy !

@dbczumar dbczumar merged commit 5381d68 into mlflow:master Nov 29, 2021
@jwyyy
Copy link
Contributor Author

jwyyy commented Nov 29, 2021

@dbczumar Thank you for your review!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/examples Example code area/tracking Tracking service, tracking client APIs, autologging rn/documentation Mention under Documentation Changes in Changelogs.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants