Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[jvm-packages] move the dmatrix building into rabit context #7823

Merged
merged 3 commits into from Apr 22, 2022

Conversation

wbo4958
Copy link
Contributor

@wbo4958 wbo4958 commented Apr 20, 2022

For multi-node training for the whole gpu pipeline, the dmatrix building needs to be put in the rabit context.

@trivialfis
Copy link
Member

Stopped the build due to hang.

@wbo4958
Copy link
Contributor Author

wbo4958 commented Apr 20, 2022

Headache, I can run it on two different bare meta machines and 1 docker env locally without any issue.

@wbo4958
Copy link
Contributor Author

wbo4958 commented Apr 22, 2022

@trivialfis, for some reason, one test hangs in CI env. So I just skip this test in order to make this PR merge.

I will add barrier execution mode which launch all the xgboost training task at the same time and add barrier when training to fix the sparkcontext being killed issue in the following PR.

@trivialfis trivialfis merged commit c45665a into dmlc:master Apr 22, 2022
@wbo4958 wbo4958 deleted the dmatrix-in-rabit branch April 22, 2022 22:23
@trivialfis trivialfis mentioned this pull request Apr 25, 2022
7 tasks
trivialfis pushed a commit to trivialfis/xgboost that referenced this pull request Apr 29, 2022
This fixes the QuantileDeviceDMatrix in distributed environment.
trivialfis added a commit that referenced this pull request Apr 29, 2022
* [jvm-packages] move the dmatrix building into rabit context (#7823)

This fixes the QuantileDeviceDMatrix in distributed environment.

* [doc] update the jvm tutorial to 1.6.1 [skip ci] (#7834)

* [Breaking][jvm-packages] Use barrier execution mode (#7836)

With the introduction of the barrier execution mode. we don't need to kill SparkContext when some xgboost tasks failed. Instead, Spark will handle the errors for us. So in this PR, `killSparkContextOnWorkerFailure` parameter is deleted.

* [doc] remove the doc about killing SparkContext [skip ci] (#7840)

* [jvm-package] remove the coalesce in barrier mode (#7846)

* [jvm-packages] Fix model compatibility (#7845)

* Ignore all Java exceptions when looking for Linux musl support (#7844)

Co-authored-by: Bobby Wang <wbo4958@gmail.com>
Co-authored-by: Michael Allman <msa@allman.ms>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants