Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[jvm-packages] Fix partition related issue #9491

Open
wants to merge 2 commits into
base: master
Choose a base branch
from

Conversation

jinmfeng001
Copy link
Contributor

@jinmfeng001 jinmfeng001 commented Aug 16, 2023

In jvm package, there are some partition related issues.

  1. For ranking case(group column defined), it assumes that the input data is partitioned by group column and the data for the same group is in the same partition as aggByGroupInfo. But the assumptionis not guaranteed. In our real prodution use case, we have to repartition the input dataset by the group column and set the partition number same as the number of worker to bypass the repartition. Otherwise our ranking model perform extremely bad.
  2. When checkpoint is enabled, needDeterministicRepartitioning is set to true. And the data set will be repartitioned base on the attached key. But for the ranking case the repartition key should be the group column, otherwise the data with the same group will be shuffled to different partitions which broken the ranking model assumption. I don't see any reasons to only make xgboost deterministic when checkpoint is enabled, i think we can make xgboost always being deterministic, and partition the data based on the group column when it's defined and on the row hash when it's not defined.

After this fix, it's gauranteed that data with same group are in the same partition, and xgboost is always deterministic no matter checkpoint is enabled or not.

@trivialfis trivialfis changed the title Fix partition related issue [jvm-packages] Fix partition related issue Aug 16, 2023
@jinmfeng001 jinmfeng001 force-pushed the fix_partition_related_issue branch 2 times, most recently from 2106c43 to 51322bf Compare August 16, 2023 09:11
@jinmfeng001 jinmfeng001 force-pushed the fix_partition_related_issue branch 2 times, most recently from a6cde12 to 780cf7d Compare August 16, 2023 12:48
@trivialfis trivialfis added this to In progress in 2.1 Roadmap Aug 17, 2023
2. Make XGBoost always repartition input to num of works.
3. Make XGBoost partition key as group column if group column exists and row hash if no group column exists.
@trivialfis
Copy link
Member

Hi @jinmfeng001 , thank you for working on the ranking implementation (and other JVM package issues). I will look into this after 2.0 is released. Please note that we have revised the LTR implementation in 2.0, which doesn't affect the current status of the JVM package, but it brings a couple to-do items for us to explore and might create conflicts with changes made in this PR:

  • XGBoost might group and sort the samples internally based on query IDs.
  • XGBoost might change the strategy for training LTR models on distributed systems.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
2.1 Roadmap
In progress
Development

Successfully merging this pull request may close these issues.

None yet

2 participants