Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Autologging functionality for scikit-learn integration with XGBoost (Part 2) #5078
Autologging functionality for scikit-learn integration with XGBoost (Part 2) #5078
Changes from 8 commits
3f9aec1
d6fd5e0
58ca828
91ded1c
69adc29
44f7f87
6bc2a8e
9b76a3b
7ed00e8
8895906
78832d6
4fc666d
4eccf36
d733ddc
076facf
File filter
Filter by extension
Conversations
Jump to
There are no files selected for viewing
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Awesome example! Can we add a brief README to this directory explaining what this example covers? E.g. Usage of XGBoost's scikit-learn integration with MLflow Tracking, particularly autologging?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jwyyy Instead of special casing XGBoost logic in
fit_mlflow
, can we define a new method calledfit_mlflow_xgboost
that just callsoriginal(self, *args, **kwargs)
and then logsself
usingmlflow.xgboost.log_model()
? This will also allow us to revert changes to XGBoost autologging'strain()
method, since we can control how the model gets logged here.We can then add a parameter to
patched_fit
(mlflow/mlflow/sklearn/__init__.py
Line 1372 in 6e4b64b
fit_mlflow
(for sklearn models) orfit_mlflow_xgboost
(for xgboost sklearn models). Perhaps we can call this parameterfit_fn
.Let me know if you have questions here!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@dbczumar Thank you for your suggestion! There is a small issue on model logging in the
train()
method when callingfit()
. Thefit()
method callstrain()
internally and assigns a Booster object to the internal_Booster
in a XGBoost sklearn model [see L1331]. The currenttrain()
inmlflow.xgboost.autolog()
logs models before returning themodel
object, which means (1) the logged model is a Booster object; (2) we cannot log XGBoost sklearn models before assigning the trained Booster to_Booster
. The changes inmlflow.xgboost.autolog()
try to log sklearn models directly. I think we definitely can log sklearn models infit_mlflow_xgboost()
but it is extra work. Because models are logged when callingtrain()
, and callingfit_mlflow_xgboost()
just logs new information to replace old ones. However, I also think adopting your suggestion makes the code logic easier to read. Please let me know which solution sounds better to you. Thank you!There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jwyyy Ah, thanks for letting me know! Can we decompose
train()
into two methods - one for parameter, metric, & non-model artifact logging, and one for model logging? We can then use the former method to patchxgboost.sklearn.train()
.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds good to me! I will make some adjustments.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
On the subject of test coverage, can we add a test case to https://github.com/mlflow/mlflow/blob/master/tests/xgboost/test_xgboost_autolog.py ensuring that autologging works as expected for XGBoost scikit-learn models? Feel free to use code from your excellent example above.