[pyspark] Make Xgboost estimator support using sparse matrix as optimization #8145

WeichenXu123 · 2022-08-09T10:11:58Z

Closes #8108

Make Xgboost estimator support using sparse matrix as optimization.

Signed-off-by: Weichen Xu <weichen.xu@databricks.com>

WeichenXu123 · 2022-08-10T14:17:35Z

CC @wbo4958 @trivialfis The PR is ready for review. :)

trivialfis

Thank you for the work on sparse data support. Could you please fix the CI errors?

trivialfis · 2022-08-11T07:24:46Z

python-package/xgboost/spark/core.py

+                    "If enable_sparse_data_optim is True, missing param != 0 is not supported."
+                )
+
+            if self.getOrDefault(self.features_cols):


@wbo4958 Would you like to take a look into this?

trivialfis · 2022-08-11T07:25:29Z

python-package/xgboost/spark/core.py

+            unwrap_udt = _get_unwrap_udt_fn()
+            features_unwrapped_vec_col = unwrap_udt(col(features_col_name))
+
+            # After a `pyspark.ml.linalg.VectorUDT` type column being unwrapped, it becomes


Thank you for the detailed comments!

trivialfis · 2022-08-11T07:27:16Z

python-package/xgboost/spark/data.py

+                        n_features = vec_size
+                    assert n_features == vec_size
+
+                    # remove zero elements from csr_indices / csr_values


Is this necessary? When missing is set to 0, XGBoost can remove those values.

I remember in the DMatrix ctor, if data argument using csc/csr matrix, then it ignore the "missing" argument but regard all inactive element in the sparse matrix as missing values. (Ref: #341 (comment))

If so, then keep zero elements or removing them represents 2 different semantic:
Keep these zero means it will be regarded as "zero" value feature,
Remove these zero elements means it will be regarded as missing value.

Is my understanding correct ?

The ctor for CSR matrix should be able to handle missing values (but not for the CSC, which would raise a warning).

xgboost/python-package/xgboost/data.py

Line 35 in 20d1bba

def _warn_unused_missing(data: DataType, missing: Optional[FloatCompatible]) -> None:

which is a good reminder that I should clear the difference.

@trivialfis
OK, so for missing handling, CSR is the same with dense input (respect "missing" param), but CSC is different (ignore "missing" param and regard inactive elements as missing), right ?
We should document this in DMatrix doc.

I will update the CSC implementation instead.

@trivialfis what does "updating CSC implementation" mean? so for current implementation, it indeed remove the whole instance if one element in the instance is missing value?

wbo4958 · 2022-08-12T02:55:56Z

python-package/xgboost/spark/core.py

-                dataset, features_cols_names
-            )
-            select_cols.extend(features_cols)
+        enable_sparse_data_optim = self.getOrDefault(self.enable_sparse_data_optim)


the _fit function is almost 200 lines, which is super huge, could we split this function?

I am working on another Ranker estimator PR. We can do this refactor after these feature PRs merged. Otherwise fixing conflicts is annoying.

wbo4958 · 2022-08-12T02:57:29Z

python-package/xgboost/spark/core.py

+        if enable_sparse_data_optim:
+            from pyspark.ml.linalg import VectorUDT
+
+            if self.getOrDefault(self.missing) != 0.0:


could we put this checking into the _validate_params?

@trivialfis
According to your comment: #8145 (comment)
Seemingly we don't need to add this restriction ?

I think we need the restriction that missing must be 0. Otherwise there will be two missing/invalid value, 0s removed by spark. missing removed by xgboost.

Just to be sure that we are on the same page. For the xgboost.DMatrix class, CSR is the same as dense. The difference is in the spark interface. Feel free to add related documents.

Signed-off-by: Weichen Xu <weichen.xu@databricks.com>

wbo4958 · 2022-08-15T04:13:28Z

BTW, pyspark 3.4 has not been released. Do we need to wait it?

WeichenXu123 · 2022-08-15T04:46:56Z

@wbo4958 No. The code still works on spark < 3.4, (but only this feature flag turn on it will raise error)
And I also add test on spark 3.4 (I install a spark 3.4 dev version in CI)

Is this PR good to merge ?

WeichenXu123 · 2022-08-15T04:48:22Z

The reason we want this feature prior to spark 3.4 release is we hope to make it work on databricks released runtime #8108 (comment)

wbo4958 · 2022-08-15T05:26:55Z

sure, make sense. Thx

wbo4958 · 2022-08-15T12:20:58Z

python-package/xgboost/spark/data.py

+        part.featureVectorIndices,
+        part.featureVectorValues,
+    ):
+        if vec_type == 0:


Please correct me if I'm wrong. now that the missing is 0, do we still really need the sparse vector? per my understanding, if one instance has a missing value, then the whole instance will be removed.

@trivialfis Is it true ? If so then training on sparse data makes no sense. Almost all instances will be removed ?

I think no, @wbo4958 You can check my test case test_regressor_with_sparse_optim and test_classifier_with_sparse_optim, every training instance contains missing value "0", but the generated model transforming has good prediction results.

wbo4958 · 2022-08-15T12:21:47Z

Overall, LGTM, only one question left from me.

WeichenXu123 · 2022-08-16T03:08:27Z

CC @trivialfis Thanks!

trivialfis

Looks good to me!

WeichenXu123 added 5 commits August 9, 2022 15:52

init

861d008

update

3c0b30b

update

131711d

Signed-off-by: Weichen Xu <weichen.xu@databricks.com>

update

d5ba06b

update

354660c

Signed-off-by: Weichen Xu <weichen.xu@databricks.com>

trivialfis reviewed Aug 11, 2022

View reviewed changes

WeichenXu123 changed the title ~~[pyspark] Make Xgboost estimator support using sparse matrix as optimization~~ [WIP][pyspark] Make Xgboost estimator support using sparse matrix as optimization Aug 12, 2022

wbo4958 reviewed Aug 12, 2022

View reviewed changes

WeichenXu123 added 3 commits August 12, 2022 20:26

update

466bf8e

Signed-off-by: Weichen Xu <weichen.xu@databricks.com>

support predict

cf02d60

Signed-off-by: Weichen Xu <weichen.xu@databricks.com>

update

ce95f5b

Signed-off-by: Weichen Xu <weichen.xu@databricks.com>

WeichenXu123 changed the title ~~[WIP][pyspark] Make Xgboost estimator support using sparse matrix as optimization~~ [pyspark] Make Xgboost estimator support using sparse matrix as optimization Aug 12, 2022

WeichenXu123 and others added 10 commits August 13, 2022 12:00

update

f58d7ff

update

45cf38b

update

baec186

update

0584c5d

Signed-off-by: Weichen Xu <weichen.xu@databricks.com>

update

56c72d3

Signed-off-by: Weichen Xu <weichen.xu@databricks.com>

update

1fb08e2

update

534274e

update

1408c29

fix lint

5c11a46

black

4d22623

wbo4958 reviewed Aug 15, 2022

View reviewed changes

trivialfis approved these changes Aug 18, 2022

View reviewed changes

trivialfis merged commit 53d2a73 into dmlc:master Aug 18, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[pyspark] Make Xgboost estimator support using sparse matrix as optimization #8145

[pyspark] Make Xgboost estimator support using sparse matrix as optimization #8145

WeichenXu123 commented Aug 9, 2022

WeichenXu123 commented Aug 10, 2022

trivialfis left a comment

trivialfis Aug 11, 2022

trivialfis Aug 11, 2022

trivialfis Aug 11, 2022

WeichenXu123 Aug 11, 2022 •

edited

trivialfis Aug 11, 2022

trivialfis Aug 11, 2022

WeichenXu123 Aug 12, 2022

trivialfis Aug 13, 2022

wbo4958 Aug 18, 2022

wbo4958 Aug 12, 2022

WeichenXu123 Aug 12, 2022

wbo4958 Aug 12, 2022

WeichenXu123 Aug 12, 2022

trivialfis Aug 12, 2022

trivialfis Aug 12, 2022

wbo4958 commented Aug 15, 2022

WeichenXu123 commented Aug 15, 2022

WeichenXu123 commented Aug 15, 2022

wbo4958 commented Aug 15, 2022

wbo4958 Aug 15, 2022

WeichenXu123 Aug 15, 2022

WeichenXu123 Aug 15, 2022 •

edited

wbo4958 commented Aug 15, 2022

WeichenXu123 commented Aug 16, 2022

trivialfis left a comment

[pyspark] Make Xgboost estimator support using sparse matrix as optimization #8145

[pyspark] Make Xgboost estimator support using sparse matrix as optimization #8145

Conversation

WeichenXu123 commented Aug 9, 2022

WeichenXu123 commented Aug 10, 2022

trivialfis left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

WeichenXu123 Aug 11, 2022 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

wbo4958 commented Aug 15, 2022

WeichenXu123 commented Aug 15, 2022

WeichenXu123 commented Aug 15, 2022

wbo4958 commented Aug 15, 2022

Choose a reason for hiding this comment

Choose a reason for hiding this comment

WeichenXu123 Aug 15, 2022 • edited

Choose a reason for hiding this comment

wbo4958 commented Aug 15, 2022

WeichenXu123 commented Aug 16, 2022

trivialfis left a comment

Choose a reason for hiding this comment

WeichenXu123 Aug 11, 2022 •

edited

WeichenXu123 Aug 15, 2022 •

edited