[jvm-packages] bridge the gaps between jvm package and native xgboost #7802

wbo4958 · 2022-04-13T03:46:43Z

JVM-packages is far behind the native XGBoost. I would like to file this issue to track some missing features or bugs that should be fixed in the incoming 2.0.0 release. Please feel free to add some.

New Features

Bugs

Don't kill SparkContext when XGBoost failed to train. By introducing barrier execution mode, xgboost will not kill SparkContext anymore. [FEA] [JVM-Packages] Add barrier execution mode for support #7835
Tracker started, with env={} in Docker. [jvm-packages] add hostIp and python exec for rabit tracker #7808)
When objective is "count:poisson", getting error "java.lang.NumberFormatException: For input string: "inf" #7632
XGBoost4j-spark-GPU dose not support multi-worker training. [jvm-packages] move the dmatrix building into rabit context #7823

The text was updated successfully, but these errors were encountered:

trivialfis · 2022-04-22T06:50:27Z

Related #4793

mallman · 2022-05-20T20:48:47Z

[X] XGBoost4j-spark-GPU dose not support multi-worker training.

Since this is checked off does this mean xgboost4j-spark-gpu supports multi-worker training? I have not been able to get anything other than 1 worker to work. Is there a particular configuration that needs to be applied to enable multi-worker training?

FYI I'm using XGBoost 1.6.1 and Spark 3.2.1.

wbo4958 · 2022-05-23T01:38:34Z

@mallman, Thx for testing xgboost4j-spark-gpu. XGBoost 1.6.1 and Spark 3.2.1 is ok for testing multi-worker.

Please note that each xgboost worker requires 1 GPU for 1 process, so if you are trying multi-worker, please be sure that you have multi-gpus. And you should also configure your spark cluster with GPU support, please refer to https://nvidia.github.io/spark-rapids/Getting-Started/

And as to how to submit the xgboost job, please follow up https://xgboost.readthedocs.io/en/latest/jvm/xgboost4j_spark_gpu_tutorial.html#submit-the-application.

Please feel free to feed back. Thx very much.

wbo4958 · 2022-05-23T01:40:45Z

BTW, @mallman have you seen the obvious speed up?

mallman · 2022-05-23T16:17:25Z

Hi @wbo4958. I think there's some ambiguity in my question. Let me clarify.

What I want to do is run distributed training with a single worker per executor, like we can do in CPU mode. I have been able to make it work if I configure my Spark job with spark.task.resource.gpu.amount set to 1. But then I can only run one task per executor at a time. This severely limits data-parallelism, and we are working with a very large training set, ~100,000,000 to ~1,000,000,000 records.

I'm starting to think that what I want is not achievable, at least not with ordinary Spark configuration. I think that maybe what I need is to use Spark's stage-level scheduling, introduced in Spark 3.1. We're using the standalone scheduler, which does not support this capability yet. So we may be stuck unless we switch to YARN or Kubernetes.

So my question is, is it possible to run distributed-mode training in GPU mode without limiting the number of running tasks per executor to 1? Cheers.

wbo4958 · 2022-05-24T00:19:21Z

@mallman, I got you.

Hmm. If, at any time, there is only 1 xgboost application running on your cluster (without any other spark application), then it's okay to set spark.task.resource.gpu.amount to the fraction. Eg, if your executor cpu cores = 12, and per task cpu core = 1, then spark.task.resource.gpu.amount should be set to 1/12 = 0.08

mallman · 2022-05-25T15:44:57Z

Hi @wbo4958. If I do that, all of the xgboost tasks run on a single executor, but no progress is made. I don't get an error either. It just waits.

wbo4958 · 2022-05-26T01:44:18Z

@mallman Could we file an issue to describe your issue, including env, script and so on?

mallman · 2022-05-27T18:08:19Z

@wbo4958 I'm sorry, but I don't know when I'll return to this effort. But basically the question is whether one can run distributed xgboost with gpus without sacrificing task-parallelism in non-xgboost stages.

wbo4958 · 2022-05-29T03:25:52Z

@wbo4958 I'm sorry, but I don't know when I'll return to this effort. But basically the question is whether one can run distributed xgboost with gpus without sacrificing task-parallelism in non-xgboost stages.

The answer is yes just like #7802 (comment). So if you can't make it, I mean, you can file an issue with detailed information, so we can figure out why you can't run it successfully.

shadyelgewily-slimstock · 2023-01-26T13:37:03Z

We have a strong appetite for categorical feature support for the jvm package and willing to contribute, but it would help to get a bit more granular overview what still needs to happen, and which components we can contribute to in order to get this feature in. @wbo4958 any chance that we could extend the list of action points to get clarity what is done and what still needs to happen? "Support categorical data in jvm" is a bit vaguely defined for me, as a new contributor, to see where I can help.

wbo4958 · 2023-01-30T00:46:14Z

Hi @shadyelgewily-slimstock, according to #8727 (comment), seems you'd like to use java APIs to handle the categorical data instead of spark? if that is so, I think current the xgboost4j package has covered your requirement, please see https://github.com/dmlc/xgboost/pull/7966/files#diff-303feb16c30765909c132d10a2a38788c0a5e6cce038eed115e58322c0016f2fR268-R270 and https://github.com/dmlc/xgboost/pull/7966/files#diff-303feb16c30765909c132d10a2a38788c0a5e6cce038eed115e58322c0016f2fR286-R288.

And you can refer this test https://github.com/dmlc/xgboost/pull/7966/files#diff-350a33aa9a66e2d51e745c5dc6a190113d2f0a2853a5974878686a30a2b0e47cR408-R430 for the usage. Currently, the item to support categorical data in xgboost4j-spark has not been implemented, you're welcome to contribute it. Thx

hcho3 pinned this issue Apr 13, 2022

trivialfis added this to 2.0 TODO in 2.0 Roadmap via automation Apr 13, 2022

trivialfis added the feature-request label Apr 13, 2022

trivialfis unpinned this issue Apr 25, 2022

wbo4958 changed the title ~~[jvm-packages] Make up the gaps between jvm package and native xgboost~~ [jvm-packages] bridge the gaps between jvm package and native xgboost Jun 8, 2022

trivialfis moved this from 2.0 TODO to Need prioritize in 2.0 Roadmap Dec 17, 2022

shadyelgewily-slimstock mentioned this issue Jan 26, 2023

New feature? Java binding for categorical feature support #8727

Open

vruusmann mentioned this issue Jan 30, 2023

Troubleshooting XGBoost model performance jpmml/jpmml-sparkml#128

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[jvm-packages] bridge the gaps between jvm package and native xgboost #7802

[jvm-packages] bridge the gaps between jvm package and native xgboost #7802

wbo4958 commented Apr 13, 2022 •

edited

trivialfis commented Apr 22, 2022

mallman commented May 20, 2022

wbo4958 commented May 23, 2022

wbo4958 commented May 23, 2022

mallman commented May 23, 2022

wbo4958 commented May 24, 2022

mallman commented May 25, 2022

wbo4958 commented May 26, 2022

mallman commented May 27, 2022

wbo4958 commented May 29, 2022

shadyelgewily-slimstock commented Jan 26, 2023 •

edited

wbo4958 commented Jan 30, 2023

[jvm-packages] bridge the gaps between jvm package and native xgboost #7802

[jvm-packages] bridge the gaps between jvm package and native xgboost #7802

Comments

wbo4958 commented Apr 13, 2022 • edited

New Features

Bugs

trivialfis commented Apr 22, 2022

mallman commented May 20, 2022

wbo4958 commented May 23, 2022

wbo4958 commented May 23, 2022

mallman commented May 23, 2022

wbo4958 commented May 24, 2022

mallman commented May 25, 2022

wbo4958 commented May 26, 2022

mallman commented May 27, 2022

wbo4958 commented May 29, 2022

shadyelgewily-slimstock commented Jan 26, 2023 • edited

wbo4958 commented Jan 30, 2023

wbo4958 commented Apr 13, 2022 •

edited

shadyelgewily-slimstock commented Jan 26, 2023 •

edited