Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

XGBoost4j-spark CrossValidation train FAILED on multi-GPU environment: : Multiple processes running on same CUDA device is not supported! #10200

Open
NvTimLiu opened this issue Apr 17, 2024 · 1 comment

Comments

@NvTimLiu
Copy link

Run latest 2.1.0-SNAPSHOT XGBoost4j-spark CrossValidation train along with spark plgun rapids-4-spark 24.06.0-SNAPSHOT on multi-GPU environment,

got train FAILed : Multiple processes running on same CUDA device is not supported!

ENV:

  • 2.1.0-SNAPSHOT XGBoost4j-spark
  • rapids-4-spark 24.06.0-SNAPSHOT
  • 4 GPU with 2 workers(2X2 GPUs) spark standalone cluster
  • App: Mortgage JVM CrossValidation train.

Detailed log attached :

driver.log

executor.log

/workspace/src/collective/nccl_device_communicator.cu:45: Check failed: n_uniques == world_size_ (2 vs. 4) : Multiple processes within communication group running on same CUDA device is not supported. 63ecb99c4c7bb16d1cf28df4f220cdc9

Stack trace:
(0) /raid/tmp/libxgboost4j563501428309033116.so(dmlc::LogMessageFatal::~LogMessageFatal()+0x6e) [0x7f54c71103ae]
(1) /raid/tmp/libxgboost4j563501428309033116.so(xgboost::collective::NcclDeviceCommunicator::NcclDeviceCommunicator(int, bool, xgboost::StringView)+0x7ba) [0x7f54c78eafea]
(2) /raid/tmp/libxgboost4j563501428309033116.so(xgboost::collective::Communicator::GetDevice(int)+0xf1) [0x7f54c78e58d1]
(3) /raid/tmp/libxgboost4j563501428309033116.so(xgboost::common::SketchContainer::AllReduce(xgboost::Context const*, bool)+0x3cb) [0x7f54c79530cb]
(4) /raid/tmp/libxgboost4j563501428309033116.so(xgboost::common::SketchContainer::MakeCuts(xgboost::Context const*, xgboost::common::HistogramCuts*, bool)+0xc1) [0x7f54c7953c01]
(5) /raid/tmp/libxgboost4j563501428309033116.so(xgboost::data::IterativeDMatrix::InitFromCUDA(xgboost::Context const*, xgboost::BatchParam const&, void*, float, std::shared_ptr<xgboost::DMatrix>)+0x1d9b) [0x7f54c79eb64b]
(6) /raid/tmp/libxgboost4j563501428309033116.so(xgboost::data::IterativeDMatrix::IterativeDMatrix(void*, void*, std::shared_ptr<xgboost::DMatrix>, void (*)(void*), int (*)(void*), float, int, int)+0x584) [0x7f54c7501164]
(7) /raid/tmp/libxgboost4j563501428309033116.so(xgboost::DMatrix* xgboost::DMatrix::Create<void*, void*, void (void*), int (void*)>(void*, void*, std::shared_ptr<xgboost::DMatrix>, void (*)(void*), int (*)(void*), float, int, int)+0x77) [0x7f54c74ab7a7]
(8) /raid/tmp/libxgboost4j563501428309033116.so(XGQuantileDMatrixCreateFromCallback+0x1c8) [0x7f54c7142e28]


        at ml.dmlc.xgboost4j.java.XGBoostJNI.checkCall(XGBoostJNI.java:48)
        at ml.dmlc.xgboost4j.java.QuantileDMatrix.<init>(QuantileDMatrix.java:26)
        at ml.dmlc.xgboost4j.scala.QuantileDMatrix.<init>(QuantileDMatrix.scala:36)
        at ml.dmlc.xgboost4j.scala.rapids.spark.GpuPreXGBoost$.buildDMatrix(GpuPreXGBoost.scala:552)
        at ml.dmlc.xgboost4j.scala.rapids.spark.GpuPreXGBoost$.$anonfun$buildWatches$1(GpuPreXGBoost.scala:507)
        at ml.dmlc.xgboost4j.scala.rapids.spark.GpuUtils$.time(GpuUtils.scala:140)
        at ml.dmlc.xgboost4j.scala.rapids.spark.GpuPreXGBoost$.buildWatches(GpuPreXGBoost.scala:507)
        at ml.dmlc.xgboost4j.scala.rapids.spark.GpuPreXGBoost$.$anonfun$buildRDDWatches$4(GpuPreXGBoost.scala:484)
        at ml.dmlc.xgboost4j.scala.spark.XGBoost$.buildWatchesAndCheck(XGBoost.scala:409)
        at ml.dmlc.xgboost4j.scala.spark.XGBoost$.buildDistributedBooster(XGBoost.scala:440)
        at ml.dmlc.xgboost4j.scala.spark.XGBoost$.$anonfun$trainDistributed$3(XGBoost.scala:540)
        at scala.Option.map(Option.scala:230)
        at ml.dmlc.xgboost4j.scala.spark.XGBoost$.$anonfun$trainDistributed$2(XGBoost.scala:539)
        at org.apache.spark.rdd.RDDBarrier.$anonfun$mapPartitions$2(RDDBarrier.scala:51)
        at org.apache.spark.rdd.RDDBarrier.$anonfun$mapPartitions$2$adapted(RDDBarrier.scala:51)
        at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
        at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373)
        at org.apache.spark.rdd.RDD.iterator(RDD.scala:337)
        at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
        at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373)
        at org.apache.spark.rdd.RDD.iterator(RDD.scala:337)
        at org.apache.spark.shuffle.ShuffleWriteProcessor.write(ShuffleWriteProcessor.scala:59)
        at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99)
        at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:52)
        at org.apache.spark.scheduler.Task.run(Task.scala:131)
        at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:497)
        at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1439)
        at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:500)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
        at java.lang.Thread.run(Thread.java:748)

@NvTimLiu
Copy link
Author

@wbo4958 @trivialfis

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant