[pyspark] Cleanup data processing. #8344

trivialfis · 2022-10-15T06:23:17Z

Enable additional combinations of ctor parameters.
Unify procedures for QuantileDMatrix and DMatrix.

@wbo4958 Could you please share why is this necessary? Is it possible to use the normal features column with gpu_hist at least in theory (maybe with lower initialization performance)?

xgboost/python-package/xgboost/spark/core.py

Line 282 in 80e10e0

if not self.getOrDefault(self.use_gpu):

* Enable additional combinations of ctor parameters. * Unify procedures for QuantileDMatrix and DMatrix.

wbo4958 · 2022-10-17T09:04:13Z

python-package/xgboost/spark/data.py

@@ -208,10 +208,14 @@ def create_dmatrix_from_partitions(  # pylint: disable=too-many-arguments

    def append_m(part: pd.DataFrame, name: str, is_valid: bool) -> None:
        nonlocal n_features
-        if name in part.columns and part[name].shape[0] > 0:


why get rid of the check "part[name].shape[0] > 0"

The empty data tests are passing.

The empty partition bug seems to be fixed in the nightly spark build. I couldn't reproduce the error.

what? really? I will check it, please hold on.

We must add check part[feature_cols].shape[0] > 0, or else, stack_series will throw exception. The latest change is ok.

wbo4958 · 2022-10-17T09:12:00Z

one comment.

trivialfis · 2022-10-17T18:15:42Z

@rongou Have you seen this error before https://buildkite.com/xgboost/xgboost-ci/builds/497#0183e6a1-75bf-4402-b6cd-8a50084f3067 ?

rongou · 2022-10-17T18:37:15Z

Not sure, maybe the server didn't have the time to start? The CI instances may be pretty overloaded. I wonder if we should add a wait in the test SetUp.

rongou · 2022-10-17T22:11:48Z

Sent PR #8351 that might help with this failure.

wbo4958 · 2022-10-17T22:43:33Z

Could you please share why is this necessary? Is it possible to use the normal features column with gpu_hist at least in theory (maybe with lower initialization performance)?

@trivialfis Once databricks suggested only enabling feature_cols when use_gpu is enabled. I think it's ok to enable it right now..

wbo4958 · 2022-10-18T03:30:32Z

LGTM

trivialfis · 2022-10-18T04:38:12Z

Once databricks suggested only enabling feature_cols when use_gpu is enabled. I think it's ok to enable it right now..

We will have to leave it to the next release. Making extensive tests for various combinations of parameters is not trivial.

trivialfis added 2 commits October 15, 2022 14:20

[pyspark] Cleanup data processing.

4bd5094

* Enable additional combinations of ctor parameters. * Unify procedures for QuantileDMatrix and DMatrix.

Doc.

293996a

wbo4958 reviewed Oct 17, 2022

View reviewed changes

Bring back the guard.

3992d97

RAMitchell approved these changes Oct 17, 2022

View reviewed changes

Merge branch 'master' into pyspark-dmatrix

55f06b3

wbo4958 approved these changes Oct 18, 2022

View reviewed changes

Merge branch 'master' into pyspark-dmatrix

e626412

trivialfis merged commit 3901f5d into dmlc:master Oct 18, 2022

trivialfis deleted the pyspark-dmatrix branch October 18, 2022 06:56

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[pyspark] Cleanup data processing. #8344

[pyspark] Cleanup data processing. #8344

trivialfis commented Oct 15, 2022

wbo4958 Oct 17, 2022

trivialfis Oct 17, 2022

trivialfis Oct 17, 2022

wbo4958 Oct 17, 2022

wbo4958 Oct 18, 2022

wbo4958 commented Oct 17, 2022

trivialfis commented Oct 17, 2022

rongou commented Oct 17, 2022

rongou commented Oct 17, 2022

wbo4958 commented Oct 17, 2022

wbo4958 commented Oct 18, 2022

trivialfis commented Oct 18, 2022 •

edited

[pyspark] Cleanup data processing. #8344

[pyspark] Cleanup data processing. #8344

Conversation

trivialfis commented Oct 15, 2022

wbo4958 Oct 17, 2022

Choose a reason for hiding this comment

trivialfis Oct 17, 2022

Choose a reason for hiding this comment

trivialfis Oct 17, 2022

Choose a reason for hiding this comment

wbo4958 Oct 17, 2022

Choose a reason for hiding this comment

wbo4958 Oct 18, 2022

Choose a reason for hiding this comment

wbo4958 commented Oct 17, 2022

trivialfis commented Oct 17, 2022

rongou commented Oct 17, 2022

rongou commented Oct 17, 2022

wbo4958 commented Oct 17, 2022

wbo4958 commented Oct 18, 2022

trivialfis commented Oct 18, 2022 • edited

trivialfis commented Oct 18, 2022 •

edited