New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[pyspark] Make Xgboost estimator support using sparse matrix as optimization #8145
Conversation
CC @wbo4958 @trivialfis The PR is ready for review. :) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for the work on sparse data support. Could you please fix the CI errors?
python-package/xgboost/spark/core.py
Outdated
"If enable_sparse_data_optim is True, missing param != 0 is not supported." | ||
) | ||
|
||
if self.getOrDefault(self.features_cols): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@wbo4958 Would you like to take a look into this?
python-package/xgboost/spark/core.py
Outdated
unwrap_udt = _get_unwrap_udt_fn() | ||
features_unwrapped_vec_col = unwrap_udt(col(features_col_name)) | ||
|
||
# After a `pyspark.ml.linalg.VectorUDT` type column being unwrapped, it becomes |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for the detailed comments!
python-package/xgboost/spark/data.py
Outdated
n_features = vec_size | ||
assert n_features == vec_size | ||
|
||
# remove zero elements from csr_indices / csr_values |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this necessary? When missing
is set to 0, XGBoost can remove those values.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I remember in the DMatrix ctor, if data argument using csc/csr matrix, then it ignore the "missing" argument but regard all inactive element in the sparse matrix as missing values. (Ref: #341 (comment))
If so, then keep zero elements or removing them represents 2 different semantic:
Keep these zero means it will be regarded as "zero" value feature,
Remove these zero elements means it will be regarded as missing value.
Is my understanding correct ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The ctor for CSR matrix should be able to handle missing values (but not for the CSC, which would raise a warning).
xgboost/python-package/xgboost/data.py
Line 35 in 20d1bba
def _warn_unused_missing(data: DataType, missing: Optional[FloatCompatible]) -> None: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
which is a good reminder that I should clear the difference.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@trivialfis
OK, so for missing handling, CSR is the same with dense input (respect "missing" param), but CSC is different (ignore "missing" param and regard inactive elements as missing), right ?
We should document this in DMatrix doc.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will update the CSC implementation instead.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@trivialfis what does "updating CSC implementation" mean? so for current implementation, it indeed remove the whole instance if one element in the instance is missing value?
dataset, features_cols_names | ||
) | ||
select_cols.extend(features_cols) | ||
enable_sparse_data_optim = self.getOrDefault(self.enable_sparse_data_optim) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the _fit function is almost 200 lines, which is super huge, could we split this function?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am working on another Ranker estimator PR. We can do this refactor after these feature PRs merged. Otherwise fixing conflicts is annoying.
python-package/xgboost/spark/core.py
Outdated
if enable_sparse_data_optim: | ||
from pyspark.ml.linalg import VectorUDT | ||
|
||
if self.getOrDefault(self.missing) != 0.0: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
could we put this checking into the _validate_params?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@trivialfis
According to your comment: #8145 (comment)
Seemingly we don't need to add this restriction ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we need the restriction that missing
must be 0. Otherwise there will be two missing/invalid value, 0s removed by spark. missing
removed by xgboost.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just to be sure that we are on the same page. For the xgboost.DMatrix
class, CSR is the same as dense. The difference is in the spark interface. Feel free to add related documents.
BTW, pyspark 3.4 has not been released. Do we need to wait it? |
@wbo4958 No. The code still works on spark < 3.4, (but only this feature flag turn on it will raise error) Is this PR good to merge ? |
The reason we want this feature prior to spark 3.4 release is we hope to make it work on databricks released runtime #8108 (comment) |
sure, make sense. Thx |
part.featureVectorIndices, | ||
part.featureVectorValues, | ||
): | ||
if vec_type == 0: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please correct me if I'm wrong. now that the missing is 0, do we still really need the sparse vector? per my understanding, if one instance has a missing value, then the whole instance will be removed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@trivialfis Is it true ? If so then training on sparse data makes no sense. Almost all instances will be removed ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think no, @wbo4958 You can check my test case test_regressor_with_sparse_optim
and test_classifier_with_sparse_optim
, every training instance contains missing value "0", but the generated model transforming has good prediction results.
Overall, LGTM, only one question left from me. |
CC @trivialfis Thanks! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me!
Closes #8108
Make Xgboost estimator support using sparse matrix as optimization.