New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[pyspark] Fix xgboost spark estimator dataset repartition issues #8231
Conversation
CC @trivialfis @wbo4958 Thanks! |
0828de4
to
6701cd3
Compare
Signed-off-by: Weichen Xu <weichen.xu@databricks.com>
6701cd3
to
7bcdda3
Compare
Signed-off-by: Weichen Xu <weichen.xu@databricks.com>
002676f
to
8596b89
Compare
Signed-off-by: Weichen Xu <weichen.xu@databricks.com>
CC @wbo4958 |
python-package/xgboost/spark/core.py
Outdated
# Repartition on `rand` column to avoid repartition | ||
# result unbalance. Directly using `.repartition(N)` might result in some | ||
# empty partitions. | ||
dataset = dataset.repartition(num_workers, rand(1)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I still prefer to add a parameter to control if random shuffle. Since my PR #8245 can detect if the training data is empty DMatrix, so we can add some log printing about data skew and how to resolve it by enabling xxx parameter. Does that make sense?
Retriggering the CI, as we recently transitioned away from Jenkins |
19f7602
to
9410a89
Compare
@wbo4958 Is the PR good to merge ? (except the lint errors) |
Hi, could you please share the current status of this fix? Is it still necessary after @wbo4958 's PR? |
No, we can file the following PR for it. This PR looks good to me. |
@trivialfis @wbo4958 Ready for merging ! :) |
1 similar comment
@trivialfis @wbo4958 Ready for merging ! :) |
+1 ed |
Signed-off-by: Weichen Xu weichen.xu@databricks.com
Fix xgboost spark estimator dataset repartition issues
rand(1)
column.validationIndicatorCol
set, always trigger repartition.Closes #8221