You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
/workspace/src/collective/nccl_device_communicator.cu:45: Check failed: n_uniques == world_size_ (2 vs. 4) : Multiple processes within communication group running on same CUDA device is not supported. 63ecb99c4c7bb16d1cf28df4f220cdc9
Stack trace:
(0) /raid/tmp/libxgboost4j563501428309033116.so(dmlc::LogMessageFatal::~LogMessageFatal()+0x6e) [0x7f54c71103ae]
(1) /raid/tmp/libxgboost4j563501428309033116.so(xgboost::collective::NcclDeviceCommunicator::NcclDeviceCommunicator(int, bool, xgboost::StringView)+0x7ba) [0x7f54c78eafea]
(2) /raid/tmp/libxgboost4j563501428309033116.so(xgboost::collective::Communicator::GetDevice(int)+0xf1) [0x7f54c78e58d1]
(3) /raid/tmp/libxgboost4j563501428309033116.so(xgboost::common::SketchContainer::AllReduce(xgboost::Context const*, bool)+0x3cb) [0x7f54c79530cb]
(4) /raid/tmp/libxgboost4j563501428309033116.so(xgboost::common::SketchContainer::MakeCuts(xgboost::Context const*, xgboost::common::HistogramCuts*, bool)+0xc1) [0x7f54c7953c01]
(5) /raid/tmp/libxgboost4j563501428309033116.so(xgboost::data::IterativeDMatrix::InitFromCUDA(xgboost::Context const*, xgboost::BatchParam const&, void*, float, std::shared_ptr<xgboost::DMatrix>)+0x1d9b) [0x7f54c79eb64b]
(6) /raid/tmp/libxgboost4j563501428309033116.so(xgboost::data::IterativeDMatrix::IterativeDMatrix(void*, void*, std::shared_ptr<xgboost::DMatrix>, void (*)(void*), int (*)(void*), float, int, int)+0x584) [0x7f54c7501164]
(7) /raid/tmp/libxgboost4j563501428309033116.so(xgboost::DMatrix* xgboost::DMatrix::Create<void*, void*, void (void*), int (void*)>(void*, void*, std::shared_ptr<xgboost::DMatrix>, void (*)(void*), int (*)(void*), float, int, int)+0x77) [0x7f54c74ab7a7]
(8) /raid/tmp/libxgboost4j563501428309033116.so(XGQuantileDMatrixCreateFromCallback+0x1c8) [0x7f54c7142e28]
at ml.dmlc.xgboost4j.java.XGBoostJNI.checkCall(XGBoostJNI.java:48)
at ml.dmlc.xgboost4j.java.QuantileDMatrix.<init>(QuantileDMatrix.java:26)
at ml.dmlc.xgboost4j.scala.QuantileDMatrix.<init>(QuantileDMatrix.scala:36)
at ml.dmlc.xgboost4j.scala.rapids.spark.GpuPreXGBoost$.buildDMatrix(GpuPreXGBoost.scala:552)
at ml.dmlc.xgboost4j.scala.rapids.spark.GpuPreXGBoost$.$anonfun$buildWatches$1(GpuPreXGBoost.scala:507)
at ml.dmlc.xgboost4j.scala.rapids.spark.GpuUtils$.time(GpuUtils.scala:140)
at ml.dmlc.xgboost4j.scala.rapids.spark.GpuPreXGBoost$.buildWatches(GpuPreXGBoost.scala:507)
at ml.dmlc.xgboost4j.scala.rapids.spark.GpuPreXGBoost$.$anonfun$buildRDDWatches$4(GpuPreXGBoost.scala:484)
at ml.dmlc.xgboost4j.scala.spark.XGBoost$.buildWatchesAndCheck(XGBoost.scala:409)
at ml.dmlc.xgboost4j.scala.spark.XGBoost$.buildDistributedBooster(XGBoost.scala:440)
at ml.dmlc.xgboost4j.scala.spark.XGBoost$.$anonfun$trainDistributed$3(XGBoost.scala:540)
at scala.Option.map(Option.scala:230)
at ml.dmlc.xgboost4j.scala.spark.XGBoost$.$anonfun$trainDistributed$2(XGBoost.scala:539)
at org.apache.spark.rdd.RDDBarrier.$anonfun$mapPartitions$2(RDDBarrier.scala:51)
at org.apache.spark.rdd.RDDBarrier.$anonfun$mapPartitions$2$adapted(RDDBarrier.scala:51)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:337)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:337)
at org.apache.spark.shuffle.ShuffleWriteProcessor.write(ShuffleWriteProcessor.scala:59)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:52)
at org.apache.spark.scheduler.Task.run(Task.scala:131)
at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:497)
at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1439)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:500)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
The text was updated successfully, but these errors were encountered:
Run latest 2.1.0-SNAPSHOT XGBoost4j-spark CrossValidation train along with spark plgun rapids-4-spark 24.06.0-SNAPSHOT on multi-GPU environment,
got train FAILed : Multiple processes running on same CUDA device is not supported!
ENV:
Detailed log attached :
driver.log
executor.log
The text was updated successfully, but these errors were encountered: