[pyspark] support a list of feature column names #8117

wbo4958 · 2022-07-26T09:14:15Z

This PR introduces a new Param storing a list of feature column names, so with it, users do not need to vectorize or create an array feature column beforehand, instead, they need to specify feature_col with a list of column names.

Eg,

feature_col_names = [x.name for x in raw_df.schema if x.name != label]
xgb = SparkXGBClassifier(features_col=feature_col_names, label_col=label)

why introduce this new parameter?

The internal of XGBoost is wrapping the feature columns into a vector or an array column on the JVM side, but it will unwrap the vector/array column into each feature column on the python side. So if we just pass each feature column, which can improve the performance.

But for now, there are some limitations that users must enable use_gpu when using this way and it may break some spark ml pipelines eg, OneVsRest

wbo4958 · 2022-07-26T09:14:40Z

@trivialfis @WeichenXu123 Could you help to review this PR? Thx

trivialfis

Thank you for the PR. Could you please:

add tests
Add a demo or test that shows how to use it with other spark ml utilities.

wbo4958 · 2022-07-27T06:43:55Z

Thank you for the PR. Could you please:

add tests

Add a demo or test that shows how to use it with other spark ml utilities.

Done @trivialfis @WeichenXu123

WeichenXu123 · 2022-07-27T12:52:54Z

python-package/xgboost/spark/core.py

+    for c in features_col_name:
+        if isinstance(dataset.schema[c].dataType, DoubleType):
+            feature_cols.append(col(c).cast(FloatType()).alias(c))
+        elif isinstance(dataset.schema[c].dataType, (FloatType, IntegralType)):


What's IntegralType type ? Do you mean IntegerType ?

@trivialfis, could you correct me if xgboost supports IntegralType (Byte/Integer/Long/Short) ?

Yes, sort of. XGBoost will convert the data to float internally.

BTW, XGBoost can convert it on the fly without creating a copy of data.

Thx @trivialfis, Yeah, if the type is short/byte, it will reduce the size of the shuffle write.

python-package/xgboost/spark/core.py

WeichenXu123 · 2022-07-27T12:55:56Z

python-package/xgboost/spark/core.py

@@ -341,6 +363,22 @@ def _validate_and_convert_feature_col_as_array_col(dataset, features_col_name):
    return features_array_col


+def _validate_and_convert_feature_col(


Is the function being used ?

Thx, removed.

python-package/xgboost/spark/core.py

WeichenXu123 · 2022-07-27T13:02:11Z

python-package/xgboost/spark/core.py

-                    pandas_df_iter,
-                    None,
-                    dmatrix_kwargs,
+                    pandas_df_iter, features_cols_names, dmatrix_kwargs


What if features_cols_names conflicts with label / weight / base_margin column name ?

Thx, I tried, it worked, no exception happened.

trivialfis · 2022-07-28T01:59:16Z

~~Could you please add a test for cross validation fro sparkml as well?~~
NVM. I don't think it's needed

the gpu dask tests may corrupt the whole gpu env

wbo4958 · 2022-08-02T09:45:59Z

@trivialfis could you help to review again. Thx

trivialfis

Thank you for the PR! Some questions in the comments.

python-package/xgboost/spark/core.py

trivialfis · 2022-08-03T05:37:37Z

python-package/xgboost/spark/core.py

+        feature_col_names = self.getOrDefault(self.features_cols)
+        features_col = []
+        if (
+            len(feature_col_names)


This condition looks weird?

len(feature_col_names) > 0 >= len(...)

is it correct?

yes. the pylint required this

all(c in dataset.columns for c in feature_col_names)

Done, I use another set way to check.

trivialfis · 2022-08-03T05:41:20Z

python-package/xgboost/spark/core.py

-        select_cols = [features_array_col, label_col]
+        select_cols = [label_col]
+        features_cols_names = None
+        if len(self.getOrDefault(self.features_cols)):


Suggested change

if len(self.getOrDefault(self.features_cols)):

if self.getOrDefault(self.features_cols):

trivialfis · 2022-08-03T10:48:14Z

note to myself: need to update the linter script.

WeichenXu123

LGTM, leave one question though.

WeichenXu123 · 2022-08-08T08:58:14Z

@trivialfis @wbo4958
I will file PR for #8108 once this PR merged.
And hope my next PR will go into xgboost 2.0 .
Thanks!

wbo4958 · 2022-08-08T09:07:02Z

@trivialfis @WeichenXu123 Thx for reviewing.

[pyspark] support a list of feature column names

a74b472

trivialfis reviewed Jul 26, 2022

View reviewed changes

wbo4958 added 3 commits July 27, 2022 10:32

add tests

dbfdf04

add spark ml pipeline test

bc58b4b

format

4a0d399

WeichenXu123 reviewed Jul 27, 2022

View reviewed changes

python-package/xgboost/spark/core.py Outdated Show resolved Hide resolved

WeichenXu123 reviewed Jul 27, 2022

View reviewed changes

python-package/xgboost/spark/core.py Outdated Show resolved Hide resolved

WeichenXu123 reviewed Jul 27, 2022

View reviewed changes

python-package/xgboost/spark/core.py Outdated Show resolved Hide resolved

WeichenXu123 reviewed Jul 27, 2022

View reviewed changes

python-package/xgboost/spark/core.py Outdated Show resolved Hide resolved

WeichenXu123 reviewed Jul 27, 2022

View reviewed changes

wbo4958 added 9 commits July 28, 2022 13:06

resolve comments

727dadb

fix bug

494d80d

update

af3dcd1

fix gpu id

32dad71

fix

1da666f

format

5e3382a

format

a1766c0

add max_bin support

2096997

move the gpu dask tests running behind spark

0f388d9

the gpu dask tests may corrupt the whole gpu env

wbo4958 force-pushed the features_cols branch from c0b1a99 to 0f388d9 Compare August 2, 2022 03:15

wbo4958 added 5 commits August 2, 2022 11:27

pylint

6dbfd01

update cv parameters

b002cbc

pylint issue

e32a638

format

6880928

Merge remote-tracking branch 'upstream/master' into features_cols

366fdb0

wbo4958 requested review from WeichenXu123 and trivialfis and removed request for WeichenXu123 August 3, 2022 01:17

trivialfis mentioned this pull request Aug 3, 2022

Conflicts between dask and pyspark GPU tests. #8134

Closed

trivialfis reviewed Aug 3, 2022

View reviewed changes

resolve comments

832989f

Merge remote-tracking branch 'upstream/master' into features_cols

f907ec0

trivialfis approved these changes Aug 8, 2022

View reviewed changes

WeichenXu123 approved these changes Aug 8, 2022

View reviewed changes

trivialfis merged commit 03cc3b3 into dmlc:master Aug 8, 2022

wbo4958 deleted the features_cols branch August 8, 2022 09:07

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[pyspark] support a list of feature column names #8117

[pyspark] support a list of feature column names #8117

wbo4958 commented Jul 26, 2022

wbo4958 commented Jul 26, 2022

trivialfis left a comment •

edited

wbo4958 commented Jul 27, 2022

WeichenXu123 Jul 27, 2022

wbo4958 Jul 28, 2022

trivialfis Jul 28, 2022

trivialfis Jul 28, 2022

wbo4958 Jul 28, 2022

WeichenXu123 Jul 27, 2022

wbo4958 Jul 28, 2022

WeichenXu123 Jul 27, 2022

wbo4958 Jul 28, 2022

trivialfis commented Jul 28, 2022 •

edited

wbo4958 commented Aug 2, 2022

trivialfis left a comment

trivialfis Aug 3, 2022

wbo4958 Aug 3, 2022

trivialfis Aug 3, 2022

wbo4958 Aug 3, 2022

trivialfis Aug 3, 2022

wbo4958 Aug 3, 2022

trivialfis commented Aug 3, 2022

WeichenXu123 left a comment

WeichenXu123 commented Aug 8, 2022

wbo4958 commented Aug 8, 2022

		@@ -341,6 +363,22 @@ def _validate_and_convert_feature_col_as_array_col(dataset, features_col_name):
		return features_array_col


		def _validate_and_convert_feature_col(

	if len(self.getOrDefault(self.features_cols)):
	if self.getOrDefault(self.features_cols):

[pyspark] support a list of feature column names #8117

[pyspark] support a list of feature column names #8117

Conversation

wbo4958 commented Jul 26, 2022

wbo4958 commented Jul 26, 2022

trivialfis left a comment • edited

Choose a reason for hiding this comment

wbo4958 commented Jul 27, 2022

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

trivialfis commented Jul 28, 2022 • edited

wbo4958 commented Aug 2, 2022

trivialfis left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

trivialfis commented Aug 3, 2022

WeichenXu123 left a comment

Choose a reason for hiding this comment

WeichenXu123 commented Aug 8, 2022

wbo4958 commented Aug 8, 2022

trivialfis left a comment •

edited

trivialfis commented Jul 28, 2022 •

edited