[pyspark] Fix xgboost spark estimator dataset repartition issues #8231

WeichenXu123 · 2022-09-08T00:30:54Z

Signed-off-by: Weichen Xu weichen.xu@databricks.com

Fix xgboost spark estimator dataset repartition issues

Fix issue of repartition generating empty partitions, via repartition on rand(1) column.
For num_worker=1, if setting force_repartition=True, does not raise error but still repartition to 1 partition.
When validationIndicatorCol set, always trigger repartition.

Closes #8221

Signed-off-by: Weichen Xu <weichen.xu@databricks.com>

WeichenXu123 · 2022-09-08T01:52:35Z

CC @trivialfis @wbo4958 Thanks!

python-package/xgboost/spark/core.py

Signed-off-by: Weichen Xu <weichen.xu@databricks.com>

WeichenXu123 · 2022-09-14T06:43:06Z

CC @wbo4958

wbo4958 · 2022-09-14T09:37:56Z

python-package/xgboost/spark/core.py

+            # Repartition on `rand` column to avoid repartition
+            # result unbalance. Directly using `.repartition(N)` might result in some
+            # empty partitions.
+            dataset = dataset.repartition(num_workers, rand(1))


I still prefer to add a parameter to control if random shuffle. Since my PR #8245 can detect if the training data is empty DMatrix, so we can add some log printing about data skew and how to resolve it by enabling xxx parameter. Does that make sense?

hcho3 · 2022-09-14T20:35:37Z

Retriggering the CI, as we recently transitioned away from Jenkins

Signed-off-by: Weichen Xu <weichen.xu@databricks.com>

WeichenXu123 · 2022-09-16T14:06:16Z

@wbo4958 Is the PR good to merge ? (except the lint errors)

trivialfis · 2022-09-16T16:14:21Z

Hi, could you please share the current status of this fix? Is it still necessary after @wbo4958 's PR?

wbo4958 · 2022-09-17T00:08:03Z

No, we can file the following PR for it. This PR looks good to me.

WeichenXu123 · 2022-09-20T09:32:43Z

@trivialfis @wbo4958 Ready for merging ! :)

WeichenXu123 · 2022-09-22T04:52:55Z

@trivialfis @wbo4958 Ready for merging ! :)

wbo4958 · 2022-09-22T08:58:54Z

+1 ed

update

9854131

Signed-off-by: Weichen Xu <weichen.xu@databricks.com>

WeichenXu123 mentioned this pull request Sep 8, 2022

[pyspark] SparkXGBClassifier failed to train with early_stopping_rounds and validation_indicator_col #8221

Closed

wbo4958 reviewed Sep 8, 2022

View reviewed changes

python-package/xgboost/spark/core.py Outdated Show resolved Hide resolved

WeichenXu123 force-pushed the fix-repartition branch 2 times, most recently from 0828de4 to 6701cd3 Compare September 9, 2022 14:36

add debug log

7bcdda3

Signed-off-by: Weichen Xu <weichen.xu@databricks.com>

WeichenXu123 force-pushed the fix-repartition branch from 6701cd3 to 7bcdda3 Compare September 9, 2022 15:31

WeichenXu123 added 3 commits September 10, 2022 21:24

update

ec05b26

Signed-off-by: Weichen Xu <weichen.xu@databricks.com>

Merge branch 'master' into fix-repartition

9f8f64e

remove debug logging

8596b89

Signed-off-by: Weichen Xu <weichen.xu@databricks.com>

WeichenXu123 force-pushed the fix-repartition branch from 002676f to 8596b89 Compare September 13, 2022 02:48

remove debug log

5f89620

Signed-off-by: Weichen Xu <weichen.xu@databricks.com>

wbo4958 reviewed Sep 14, 2022

View reviewed changes

update

9410a89

Signed-off-by: Weichen Xu <weichen.xu@databricks.com>

WeichenXu123 force-pushed the fix-repartition branch from 19f7602 to 9410a89 Compare September 15, 2022 10:39

wbo4958 approved these changes Sep 17, 2022

View reviewed changes

fix lint

ba45f07

wbo4958 approved these changes Sep 22, 2022

View reviewed changes

trivialfis approved these changes Sep 22, 2022

View reviewed changes

trivialfis merged commit ab342af into dmlc:master Sep 22, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[pyspark] Fix xgboost spark estimator dataset repartition issues #8231

[pyspark] Fix xgboost spark estimator dataset repartition issues #8231

WeichenXu123 commented Sep 8, 2022 •

edited

WeichenXu123 commented Sep 8, 2022

WeichenXu123 commented Sep 14, 2022

wbo4958 Sep 14, 2022

hcho3 commented Sep 14, 2022

WeichenXu123 commented Sep 16, 2022

trivialfis commented Sep 16, 2022

wbo4958 commented Sep 17, 2022

WeichenXu123 commented Sep 20, 2022

WeichenXu123 commented Sep 22, 2022

wbo4958 commented Sep 22, 2022

[pyspark] Fix xgboost spark estimator dataset repartition issues #8231

[pyspark] Fix xgboost spark estimator dataset repartition issues #8231

Conversation

WeichenXu123 commented Sep 8, 2022 • edited

WeichenXu123 commented Sep 8, 2022

WeichenXu123 commented Sep 14, 2022

wbo4958 Sep 14, 2022

Choose a reason for hiding this comment

hcho3 commented Sep 14, 2022

WeichenXu123 commented Sep 16, 2022

trivialfis commented Sep 16, 2022

wbo4958 commented Sep 17, 2022

WeichenXu123 commented Sep 20, 2022

WeichenXu123 commented Sep 22, 2022

wbo4958 commented Sep 22, 2022

WeichenXu123 commented Sep 8, 2022 •

edited