New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Autologging functionality for scikit-learn integration with XGBoost (Part 2) #5078
Conversation
Signed-off-by: Junwen Yao <jwyiao@gmail.com>
Signed-off-by: Junwen Yao <jwyiao@gmail.com>
@jwyyy Thank you for your updates. I'll take a look first thing tomorrow! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jwyyy This is awesome! I left a few comments - let me know if you have any questions!
Signed-off-by: Junwen Yao <jwyiao@gmail.com>
Signed-off-by: Junwen Yao <jwyiao@gmail.com>
remove additional lines Signed-off-by: Junwen Yao <jwyiao@gmail.com>
564a2c5
to
6bc2a8e
Compare
Signed-off-by: Junwen Yao <jwyiao@gmail.com>
Hi @dbczumar, I made some updates in this PR based on your review. The main change is what you suggested here. We reuse the Please let me know your ideas and suggestions when you have to review it again. Thank you very much! |
@@ -0,0 +1,48 @@ | |||
from pprint import pprint | |||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Awesome example! Can we add a brief README to this directory explaining what this example covers? E.g. Usage of XGBoost's scikit-learn integration with MLflow Tracking, particularly autologging?
mlflow/sklearn/__init__.py
Outdated
# params of xgboost sklearn models are logged in train() in mlflow.xgboost.autolog() | ||
if flavor_name == FLAVOR_NAME: | ||
_log_posttraining_metadata(autologging_client, self, *args, **kwargs) | ||
autologging_client.flush(synchronous=True) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jwyyy Instead of special casing XGBoost logic in fit_mlflow
, can we define a new method called fit_mlflow_xgboost
that just calls original(self, *args, **kwargs)
and then logs self
using mlflow.xgboost.log_model()
? This will also allow us to revert changes to XGBoost autologging's train()
method, since we can control how the model gets logged here.
We can then add a parameter to patched_fit
(
mlflow/mlflow/sklearn/__init__.py
Line 1372 in 6e4b64b
def patched_fit(original, self, *args, **kwargs): |
fit_mlflow
(for sklearn models) or fit_mlflow_xgboost
(for xgboost sklearn models). Perhaps we can call this parameter fit_fn
.
Let me know if you have questions here!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@dbczumar Thank you for your suggestion! There is a small issue on model logging in the train()
method when calling fit()
. The fit()
method calls train()
internally and assigns a Booster object to the internal _Booster
in a XGBoost sklearn model [see L1331]. The current train()
in mlflow.xgboost.autolog()
logs models before returning the model
object, which means (1) the logged model is a Booster object; (2) we cannot log XGBoost sklearn models before assigning the trained Booster to _Booster
. The changes in mlflow.xgboost.autolog()
try to log sklearn models directly. I think we definitely can log sklearn models in fit_mlflow_xgboost()
but it is extra work. Because models are logged when calling train()
, and calling fit_mlflow_xgboost()
just logs new information to replace old ones. However, I also think adopting your suggestion makes the code logic easier to read. Please let me know which solution sounds better to you. Thank you!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jwyyy Ah, thanks for letting me know! Can we decompose train()
into two methods - one for parameter, metric, & non-model artifact logging, and one for model logging? We can then use the former method to patch xgboost.sklearn.train()
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds good to me! I will make some adjustments.
safe_patch_call_count = ( | ||
safe_patch_mock.call_count + xgb_sklearn_safe_patch_mock.call_count | ||
) | ||
else: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
On the subject of test coverage, can we add a test case to https://github.com/mlflow/mlflow/blob/master/tests/xgboost/test_xgboost_autolog.py ensuring that autologging works as expected for XGBoost scikit-learn models? Feel free to use code from your excellent example above.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jwyyy This is looking great! I think we're almost there! Just left a few more comments - let me know if you have questions!
Signed-off-by: Junwen Yao <jwyiao@gmail.com>
Signed-off-by: Junwen Yao <jwyiao@gmail.com>
mlflow/xgboost/__init__.py
Outdated
early_stopping = ( | ||
num_pos_args >= early_stopping_index + 1 or "early_stopping_rounds" in kwargs | ||
early_stopping = num_pos_args >= early_stopping_index + 1 or ( | ||
"early_stopping_rounds" in kwargs and kwargs["early_stopping_rounds"] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"early_stopping_rounds" in kwargs and kwargs["early_stopping_rounds"] | |
in kwargs.get("early_stopping_rounds") |
Can we use get
here?
Signed-off-by: Junwen Yao <jwyiao@gmail.com>
Hi @dbczumar @harupy, I made some updates in this PR. Thank you for your review and suggestions! A new test file was added (to highlight some differences in XGBoost sklearn model testing), and it may contain tests more than what we need. We can remove them later. I also modified doc and re-organized xgboost examples. Please let me know your suggestions when you have time to review it again.Thank you very much! Happy Holidays to you! 😄 |
Signed-off-by: Junwen Yao <jwyiao@gmail.com>
@jwyyy I pulled your PR branch and ran Looks like I just saw what |
@harupy Thank you for running the example! Yes, there is a line logging the model after training. We can remove it if it looks redundant. |
Let's remove it because the model is automatically logged. |
@@ -0,0 +1,498 @@ | |||
from packaging.version import Version |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think adding a simple test like below in tests/xgboost/test_xgboost_autolog.py
should be enough.
@pytest.mark.large
def test_xgb_autolog_sklearn():
mlflow.xgboost.autolog()
X, y = datasets.load_iris(return_X_y=True)
params = {"n_estimators": 10, "reg_lambda": 1}
model = xgb.XGBRegressor(**params)
with mlflow.start_run() as run:
model.fit(X, y)
model_uri = mlflow.get_artifact_uri("model")
client = mlflow.tracking.MlflowClient()
run = client.get_run(run.info.run_id)
assert run.data.metrics.items() <= params.items()
artifacts = set(x.path for x in client.list_artifacts(run.info.run_id))
assert artifacts >= set(["feature_importance_weight.png", "feature_importance_weight.json"])
loaded_model = mlflow.xgboost.load_model(model_uri)
np.testing.assert_allclose(loaded_model.predict(X), model.predict(X))
mlflow/sklearn/__init__.py
Outdated
# are done in `train()` in `mlflow.xgboost.autolog()` | ||
fit_output = original(self, *args, **kwargs) | ||
# log models after training | ||
(X, _, _) = _get_args_for_metrics(self.fit, args, kwargs) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does _get_args_for_metrics
always return a tuple with 3 elements?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Never mind. It does:
mlflow/mlflow/sklearn/utils.py
Line 48 in 22e69c3
def _get_args_for_metrics(fit_func, fit_args, fit_kwargs): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
X = _get_args_for_metrics(self.fit, args, kwargs)[0]
might be safer.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We might want to rename _get_args_for_metrics
to something like _get_X_y_and_sample_weight
. I'll take care of this.
mse = mean_squared_error(y_test, y_pred) | ||
run_id = run.info.run_id | ||
print("Logged data and model in run {}".format(run_id)) | ||
mlflow.xgboost.log_model(regressor, artifact_path="log_model") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's remove this line (#5078 (comment))!
@@ -0,0 +1,12 @@ | |||
# XGBoost Scikit-learn Model Example | |||
|
|||
This example trains an XGBoost regressor ([XGBoost.XGBRegressor](https://xgboost.readthedocs.io/en/stable/python/python_api.html#xgboost.XGBRegressor)) with the diabetes dataset and logs hyperparameters, metrics, and trained model. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This example trains an XGBoost regressor ([XGBoost.XGBRegressor](https://xgboost.readthedocs.io/en/stable/python/python_api.html#xgboost.XGBRegressor)) with the diabetes dataset and logs hyperparameters, metrics, and trained model. | |
This example trains an [`XGBoost.XGBRegressor`](https://xgboost.readthedocs.io/en/stable/python/python_api.html#xgboost.XGBRegressor)) with the diabetes dataset and logs hyperparameters, metrics, and trained model. |
@@ -0,0 +1,44 @@ | |||
from pprint import pprint |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: can we rename this file to train.py
because other examples use this convetion?
- mlflow | ||
- mypy-extensions==0.4.3 | ||
- pandas==1.3.4 | ||
- scikit-learn==0.24.2 | ||
- typing-extensions==4.0.0 | ||
- xgboost==1.5.0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- mlflow | |
- mypy-extensions==0.4.3 | |
- pandas==1.3.4 | |
- scikit-learn==0.24.2 | |
- typing-extensions==4.0.0 | |
- xgboost==1.5.0 | |
- mlflow | |
- pandas==1.3.4 | |
- scikit-learn==0.24.2 | |
- xgboost==1.5.0 |
Can we remove mypy-extensions
and typing-extensions
?
diabetes = load_diabetes() | ||
X = pd.DataFrame(diabetes.data, columns=diabetes.feature_names) | ||
y = pd.Series(diabetes.target) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
diabetes = load_diabetes() | |
X = pd.DataFrame(diabetes.data, columns=diabetes.feature_names) | |
y = pd.Series(diabetes.target) | |
X, y = load_diabetes(return_X_y=True, as_frame=True) |
Let's use return_X_y
and as_frame
to simplify the code.
Signed-off-by: Junwen Yao <jwyiao@gmail.com>
Signed-off-by: Junwen Yao <jwyiao@gmail.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
@harupy Thank you for the review! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! Thank you so much for your contribution, @jwyyy !
@dbczumar Thank you for your review! |
What changes are proposed in this pull request?
This is the second PR to add autologging for XGBoost sklearn models using
mlflow.sklearn
autologging routine.(Previous PR: #4954)
(Draft + discussion: #4885)
How is this patch tested?
A new example is provided. Tests will be added later.
Does this PR change the documentation?
ci/circleci: build_doc
check. If it's successful, proceed to thenext step, otherwise fix it.
Details
on the right to open the job page of CircleCI.Artifacts
tab.docs/build/html/index.html
.Release Notes
Is this a user-facing change?
Success merge of this PR will enable autologging for XGBoost scikit-learn models.
What component(s), interfaces, languages, and integrations does this PR affect?
Components
area/artifacts
: Artifact stores and artifact loggingarea/build
: Build and test infrastructure for MLflowarea/docs
: MLflow documentation pagesarea/examples
: Example codearea/model-registry
: Model Registry service, APIs, and the fluent client calls for Model Registryarea/models
: MLmodel format, model serialization/deserialization, flavorsarea/projects
: MLproject format, project running backendsarea/scoring
: MLflow Model server, model deployment tools, Spark UDFsarea/server-infra
: MLflow Tracking server backendarea/tracking
: Tracking Service, tracking client APIs, autologgingInterface
area/uiux
: Front-end, user experience, plotting, JavaScript, JavaScript dev serverarea/docker
: Docker use across MLflow's components, such as MLflow Projects and MLflow Modelsarea/sqlalchemy
: Use of SQLAlchemy in the Tracking Service or Model Registryarea/windows
: Windows supportLanguage
language/r
: R APIs and clientslanguage/java
: Java APIs and clientslanguage/new
: Proposals for new client languagesIntegrations
integrations/azure
: Azure and Azure ML integrationsintegrations/sagemaker
: SageMaker integrationsintegrations/databricks
: Databricks integrationsHow should the PR be classified in the release notes? Choose one:
rn/breaking-change
- The PR will be mentioned in the "Breaking Changes" sectionrn/none
- No description will be included. The PR will be mentioned only by the PR number in the "Small Bugfixes and Documentation Updates" sectionrn/feature
- A new user-facing feature worth mentioning in the release notesrn/bug-fix
- A user-facing bug fix worth mentioning in the release notesrn/documentation
- A user-facing documentation change worth mentioning in the release notes