Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[pyspark] Fix xgboost spark estimator dataset repartition issues #8231

Merged
merged 8 commits into from Sep 22, 2022

Conversation

WeichenXu123
Copy link
Contributor

@WeichenXu123 WeichenXu123 commented Sep 8, 2022

Signed-off-by: Weichen Xu weichen.xu@databricks.com

Fix xgboost spark estimator dataset repartition issues

  1. Fix issue of repartition generating empty partitions, via repartition on rand(1) column.
  2. For num_worker=1, if setting force_repartition=True, does not raise error but still repartition to 1 partition.
  3. When validationIndicatorCol set, always trigger repartition.

Closes #8221

Signed-off-by: Weichen Xu <weichen.xu@databricks.com>
@WeichenXu123
Copy link
Contributor Author

CC @trivialfis @wbo4958 Thanks!

@WeichenXu123 WeichenXu123 force-pushed the fix-repartition branch 2 times, most recently from 0828de4 to 6701cd3 Compare September 9, 2022 14:36
Signed-off-by: Weichen Xu <weichen.xu@databricks.com>
Signed-off-by: Weichen Xu <weichen.xu@databricks.com>
Signed-off-by: Weichen Xu <weichen.xu@databricks.com>
Signed-off-by: Weichen Xu <weichen.xu@databricks.com>
@WeichenXu123
Copy link
Contributor Author

CC @wbo4958

# Repartition on `rand` column to avoid repartition
# result unbalance. Directly using `.repartition(N)` might result in some
# empty partitions.
dataset = dataset.repartition(num_workers, rand(1))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I still prefer to add a parameter to control if random shuffle. Since my PR #8245 can detect if the training data is empty DMatrix, so we can add some log printing about data skew and how to resolve it by enabling xxx parameter. Does that make sense?

@hcho3
Copy link
Collaborator

hcho3 commented Sep 14, 2022

Retriggering the CI, as we recently transitioned away from Jenkins

Signed-off-by: Weichen Xu <weichen.xu@databricks.com>
@WeichenXu123
Copy link
Contributor Author

@wbo4958 Is the PR good to merge ? (except the lint errors)

@trivialfis
Copy link
Member

Hi, could you please share the current status of this fix? Is it still necessary after @wbo4958 's PR?

@wbo4958
Copy link
Contributor

wbo4958 commented Sep 17, 2022

No, we can file the following PR for it. This PR looks good to me.

@WeichenXu123
Copy link
Contributor Author

@trivialfis @wbo4958 Ready for merging ! :)

1 similar comment
@WeichenXu123
Copy link
Contributor Author

@trivialfis @wbo4958 Ready for merging ! :)

@wbo4958
Copy link
Contributor

wbo4958 commented Sep 22, 2022

+1 ed

@trivialfis trivialfis merged commit ab342af into dmlc:master Sep 22, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[pyspark] SparkXGBClassifier failed to train with early_stopping_rounds and validation_indicator_col
4 participants