Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[pyspark] Make Xgboost estimator support using sparse matrix as optimization #8145

Merged
merged 18 commits into from Aug 18, 2022

Conversation

WeichenXu123
Copy link
Contributor

Closes #8108

Make Xgboost estimator support using sparse matrix as optimization.

Signed-off-by: Weichen Xu <weichen.xu@databricks.com>
Signed-off-by: Weichen Xu <weichen.xu@databricks.com>
@WeichenXu123
Copy link
Contributor Author

CC @wbo4958 @trivialfis The PR is ready for review. :)

Copy link
Member

@trivialfis trivialfis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the work on sparse data support. Could you please fix the CI errors?

"If enable_sparse_data_optim is True, missing param != 0 is not supported."
)

if self.getOrDefault(self.features_cols):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@wbo4958 Would you like to take a look into this?

unwrap_udt = _get_unwrap_udt_fn()
features_unwrapped_vec_col = unwrap_udt(col(features_col_name))

# After a `pyspark.ml.linalg.VectorUDT` type column being unwrapped, it becomes
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the detailed comments!

n_features = vec_size
assert n_features == vec_size

# remove zero elements from csr_indices / csr_values
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this necessary? When missing is set to 0, XGBoost can remove those values.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I remember in the DMatrix ctor, if data argument using csc/csr matrix, then it ignore the "missing" argument but regard all inactive element in the sparse matrix as missing values. (Ref: #341 (comment))

If so, then keep zero elements or removing them represents 2 different semantic:
Keep these zero means it will be regarded as "zero" value feature,
Remove these zero elements means it will be regarded as missing value.

Is my understanding correct ?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The ctor for CSR matrix should be able to handle missing values (but not for the CSC, which would raise a warning).

def _warn_unused_missing(data: DataType, missing: Optional[FloatCompatible]) -> None:

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

which is a good reminder that I should clear the difference.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@trivialfis
OK, so for missing handling, CSR is the same with dense input (respect "missing" param), but CSC is different (ignore "missing" param and regard inactive elements as missing), right ?
We should document this in DMatrix doc.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will update the CSC implementation instead.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@trivialfis what does "updating CSC implementation" mean? so for current implementation, it indeed remove the whole instance if one element in the instance is missing value?

@WeichenXu123 WeichenXu123 changed the title [pyspark] Make Xgboost estimator support using sparse matrix as optimization [WIP][pyspark] Make Xgboost estimator support using sparse matrix as optimization Aug 12, 2022
dataset, features_cols_names
)
select_cols.extend(features_cols)
enable_sparse_data_optim = self.getOrDefault(self.enable_sparse_data_optim)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the _fit function is almost 200 lines, which is super huge, could we split this function?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am working on another Ranker estimator PR. We can do this refactor after these feature PRs merged. Otherwise fixing conflicts is annoying.

if enable_sparse_data_optim:
from pyspark.ml.linalg import VectorUDT

if self.getOrDefault(self.missing) != 0.0:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could we put this checking into the _validate_params?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@trivialfis
According to your comment: #8145 (comment)
Seemingly we don't need to add this restriction ?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we need the restriction that missing must be 0. Otherwise there will be two missing/invalid value, 0s removed by spark. missing removed by xgboost.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just to be sure that we are on the same page. For the xgboost.DMatrix class, CSR is the same as dense. The difference is in the spark interface. Feel free to add related documents.

Signed-off-by: Weichen Xu <weichen.xu@databricks.com>
Signed-off-by: Weichen Xu <weichen.xu@databricks.com>
Signed-off-by: Weichen Xu <weichen.xu@databricks.com>
@WeichenXu123 WeichenXu123 changed the title [WIP][pyspark] Make Xgboost estimator support using sparse matrix as optimization [pyspark] Make Xgboost estimator support using sparse matrix as optimization Aug 12, 2022
WeichenXu123 and others added 10 commits August 13, 2022 12:00
Signed-off-by: Weichen Xu <weichen.xu@databricks.com>
Signed-off-by: Weichen Xu <weichen.xu@databricks.com>
@wbo4958
Copy link
Contributor

wbo4958 commented Aug 15, 2022

BTW, pyspark 3.4 has not been released. Do we need to wait it?

@WeichenXu123
Copy link
Contributor Author

@wbo4958 No. The code still works on spark < 3.4, (but only this feature flag turn on it will raise error)
And I also add test on spark 3.4 (I install a spark 3.4 dev version in CI)

Is this PR good to merge ?

@WeichenXu123
Copy link
Contributor Author

The reason we want this feature prior to spark 3.4 release is we hope to make it work on databricks released runtime #8108 (comment)

@wbo4958
Copy link
Contributor

wbo4958 commented Aug 15, 2022

sure, make sense. Thx

part.featureVectorIndices,
part.featureVectorValues,
):
if vec_type == 0:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please correct me if I'm wrong. now that the missing is 0, do we still really need the sparse vector? per my understanding, if one instance has a missing value, then the whole instance will be removed.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@trivialfis Is it true ? If so then training on sparse data makes no sense. Almost all instances will be removed ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think no, @wbo4958 You can check my test case test_regressor_with_sparse_optim and test_classifier_with_sparse_optim, every training instance contains missing value "0", but the generated model transforming has good prediction results.

@wbo4958
Copy link
Contributor

wbo4958 commented Aug 15, 2022

Overall, LGTM, only one question left from me.

@WeichenXu123
Copy link
Contributor Author

CC @trivialfis Thanks!

Copy link
Member

@trivialfis trivialfis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me!

@trivialfis trivialfis merged commit 53d2a73 into dmlc:master Aug 18, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Xgboost pyspark: Support pyspark Sparse Vector
3 participants