Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ERROR ml.dmlc.xgboost4j.java.RabitTracker:[AutoML] Uncaught exception thrown by worker: java.lang.InterruptedException: null #10082

Open
Bishop-Cui opened this issue Feb 29, 2024 · 1 comment

Comments

@Bishop-Cui
Copy link

I was running the XGBoost training on my server with spark mode as local.
I read some similar closed issues but their problem is that the tracker conf is empty, but my ERROR indicates that my tracker started with a full-filled conf. We have encountered this problem for a long time on Windows, the dummy solution we use is to kill the Python procedure or just reboot your PC, this is the first time that we met it in Linux, the cause may be we changed scala from 2.1.1to2.1.2, the xgboost4j we are using is 1.1.2.
The log is below, any detective for help?

Tracker started, with env={DMLC_NUM_SERVER=0, DMLC_TRACKER_URI=10.90.50.89, DMLC_TRACKER_PORT=9091, DMLC_NUM_WORKER=1}
24/02/29 11:07:02 [ T548] ERROR ml.dmlc.xgboost4j.java.RabitTracker:[AutoML] Uncaught exception thrown by worker:
java.lang.InterruptedException: null
at java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedInterruptibly(AbstractQueuedSynchronizer.java:1000) ~[?:1.8.0_302]
at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireSharedInterruptibly(AbstractQueuedSynchronizer.java:1306) ~[?:1.8.0_302]
at scala.concurrent.impl.Promise$DefaultPromise.tryAwait(Promise.scala:242) ~[scala-library-2.12.10.jar:?]
at scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:258) ~[scala-library-2.12.10.jar:?]
at scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:187) ~[scala-library-2.12.10.jar:?]
at org.apache.spark.util.ThreadUtils$.awaitReady(ThreadUtils.scala:334) ~[spark-core_2.12-3.1.1.jar:3.1.1]
at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:859) ~[spark-core_2.12-3.1.1.jar:3.1.1]
at org.apache.spark.SparkContext.runJob(SparkContext.scala:2202) ~[spark-core_2.12-3.1.1.jar:3.1.1]
at org.apache.spark.SparkContext.runJob(SparkContext.scala:2223) ~[spark-core_2.12-3.1.1.jar:3.1.1]
at org.apache.spark.SparkContext.runJob(SparkContext.scala:2242) ~[spark-core_2.12-3.1.1.jar:3.1.1]
at org.apache.spark.SparkContext.runJob(SparkContext.scala:2267) ~[spark-core_2.12-3.1.1.jar:3.1.1]
at org.apache.spark.rdd.RDD.$anonfun$foreachPartition$1(RDD.scala:1020) ~[spark-core_2.12-3.1.1.jar:3.1.1]
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) ~[spark-core_2.12-3.1.1.jar:3.1.1]
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112) ~[spark-core_2.12-3.1.1.jar:3.1.1]
at org.apache.spark.rdd.RDD.withScope(RDD.scala:414) ~[spark-core_2.12-3.1.1.jar:3.1.1]
at org.apache.spark.rdd.RDD.foreachPartition(RDD.scala:1018) ~[spark-core_2.12-3.1.1.jar:3.1.1]
at ml.dmlc.xgboost4j.scala.spark.XGBoost$$anon$1.run(XGBoost.scala:565) ~[xgboost4j-spark_2.12-1.1.2.jar:?]
24/02/29 11:07:02 [ T1] ERROR XGBoostSpark:[AutoML] the job was aborted due to
ml.dmlc.xgboost4j.java.XGBoostError: XGBoostModel training failed
at ml.dmlc.xgboost4j.scala.spark.XGBoost$.postTrackerReturnProcessing(XGBoost.scala:697) ~[xgboost4j-spark_2.12-1.1.2.jar:?]
at ml.dmlc.xgboost4j.scala.spark.XGBoost$.trainDistributed(XGBoost.scala:573) ~[xgboost4j-spark_2.12-1.1.2.jar:?]
at ml.dmlc.xgboost4j.scala.spark.XGBoostClassifier.train(XGBoostClassifier.scala:191) ~[xgboost4j-spark_2.12-1.1.2.jar:?]
at ml.dmlc.xgboost4j.scala.spark.XGBoostClassifier.train(XGBoostClassifier.scala:40) ~[xgboost4j-spark_2.12-1.1.2.jar:?]
at org.apache.spark.ml.Predictor.fit(Predictor.scala:151) ~[spark-mllib_2.12-3.1.1.jar:3.1.1]
at org.apache.spark.ml.Predictor.fit(Predictor.scala:115) ~[spark-mllib_2.12-3.1.1.jar:3.1.1]
at org.apache.spark.ml.Pipeline.$anonfun$fit$5(Pipeline.scala:151) ~[spark-mllib_2.12-3.1.1.jar:3.1.1]
at org.apache.spark.ml.MLEvents.withFitEvent(events.scala:130) ~[spark-mllib_2.12-3.1.1.jar:3.1.1]
at org.apache.spark.ml.MLEvents.withFitEvent$(events.scala:123) ~[spark-mllib_2.12-3.1.1.jar:3.1.1]
at org.apache.spark.ml.util.Instrumentation.withFitEvent(Instrumentation.scala:42) ~[spark-mllib_2.12-3.1.1.jar:3.1.1]
at org.apache.spark.ml.Pipeline.$anonfun$fit$4(Pipeline.scala:151) ~[spark-mllib_2.12-3.1.1.jar:3.1.1]
at scala.collection.Iterator.foreach(Iterator.scala:941) ~[scala-library-2.12.10.jar:?]
at scala.collection.Iterator.foreach$(Iterator.scala:941) ~[scala-library-2.12.10.jar:?]
at scala.collection.AbstractIterator.foreach(Iterator.scala:1429) ~[scala-library-2.12.10.jar:?]
at org.apache.spark.ml.Pipeline.$anonfun$fit$2(Pipeline.scala:147) ~[spark-mllib_2.12-3.1.1.jar:3.1.1]
at org.apache.spark.ml.MLEvents.withFitEvent(events.scala:130) ~[spark-mllib_2.12-3.1.1.jar:3.1.1]
at org.apache.spark.ml.MLEvents.withFitEvent$(events.scala:123) ~[spark-mllib_2.12-3.1.1.jar:3.1.1]
at org.apache.spark.ml.util.Instrumentation.withFitEvent(Instrumentation.scala:42) ~[spark-mllib_2.12-3.1.1.jar:3.1.1]
at org.apache.spark.ml.Pipeline.$anonfun$fit$1(Pipeline.scala:133) ~[spark-mllib_2.12-3.1.1.jar:3.1.1]
at org.apache.spark.ml.util.Instrumentation$.$anonfun$instrumented$1(Instrumentation.scala:191) ~[spark-mllib_2.12-3.1.1.jar:3.1.1]
at scala.util.Try$.apply(Try.scala:213) [scala-library-2.12.10.jar:?]
at org.apache.spark.ml.util.Instrumentation$.instrumented(Instrumentation.scala:191) [spark-mllib_2.12-3.1.1.jar:3.1.1]
at org.apache.spark.ml.Pipeline.fit(Pipeline.scala:133) [spark-mllib_2.12-3.1.1.jar:3.1.1]
at com.huawei.tech.ueba.tree.TreeModelTrainAndPredict$.trainModel(TreeModelTrainAndPredict.scala:630) [smartDecision-1.0.T2-SNAPSHOT.jar:?]
at com.huawei.tech.ueba.tree.TreeModelTrainAndPredict$.trainAndPredict(TreeModelTrainAndPredict.scala:122) [smartDecision-1.0.T2-SNAPSHOT.jar:?]
at com.huawei.tech.ueba.fastpoc.model.TreeModelEst.fitModel(TreeModelEst.scala:69) [smartDecision-1.0.T2-SNAPSHOT.jar:?]
at com.huawei.tech.ueba.rca.explainchange.ExplainUtils.ModelAndShap(ExplainUtils.scala:216) [smartDecision-1.0.T2-SNAPSHOT.jar:?]
at com.huawei.tech.ueba.rca.explainchange.PipelineDemo.trainProcess(PipelineDemo.scala:40) [smartDecision-1.0.T2-SNAPSHOT.jar:?]
at com.huawei.tech.ueba.rca.explainchange.PipelineDemo.process(PipelineDemo.scala:132) [smartDecision-1.0.T2-SNAPSHOT.jar:?]
at com.huawei.tech.ueba.rca.explainchange.PipelineDemo$.main(PipelineDemo.scala:161) [smartDecision-1.0.T2-SNAPSHOT.jar:?]
at com.huawei.tech.ueba.rca.explainchange.PipelineDemo.main(PipelineDemo.scala) [smartDecision-1.0.T2-SNAPSHOT.jar:?]
24/02/29 11:07:02 [ T72] ERROR org.apache.spark.scheduler.TaskSchedulerImpl:[AutoML] Exception in statusUpdate
java.util.concurrent.RejectedExecutionException: Task org.apache.spark.scheduler.TaskResultGetter$$anon$3@e9b91bc rejected from java.util.concurrent.ThreadPoolExecutor@504cbd30[Terminated, pool size = 0, active threads = 0, queued tasks = 0, completed tasks = 4647]
at java.util.concurrent.ThreadPoolExecutor$AbortPolicy.rejectedExecution(ThreadPoolExecutor.java:2063) ~[?:1.8.0_302]
at java.util.concurrent.ThreadPoolExecutor.reject(ThreadPoolExecutor.java:830) ~[?:1.8.0_302]
at java.util.concurrent.ThreadPoolExecutor.execute(ThreadPoolExecutor.java:1379) ~[?:1.8.0_302]
at org.apache.spark.scheduler.TaskResultGetter.enqueueSuccessfulTask(TaskResultGetter.scala:61) ~[spark-core_2.12-3.1.1.jar:3.1.1]
at org.apache.spark.scheduler.TaskSchedulerImpl.liftedTree2$1(TaskSchedulerImpl.scala:769) ~[spark-core_2.12-3.1.1.jar:3.1.1]
at org.apache.spark.scheduler.TaskSchedulerImpl.statusUpdate(TaskSchedulerImpl.scala:745) ~[spark-core_2.12-3.1.1.jar:3.1.1]
at org.apache.spark.scheduler.local.LocalEndpoint$$anonfun$receive$1.applyOrElse(LocalSchedulerBackend.scala:71) ~[spark-core_2.12-3.1.1.jar:3.1.1]
at org.apache.spark.rpc.netty.Inbox.$anonfun$process$1(Inbox.scala:115) ~[spark-core_2.12-3.1.1.jar:3.1.1]
at org.apache.spark.rpc.netty.Inbox.safelyCall(Inbox.scala:213) [spark-core_2.12-3.1.1.jar:3.1.1]
at org.apache.spark.rpc.netty.Inbox.process(Inbox.scala:100) [spark-core_2.12-3.1.1.jar:3.1.1]
at org.apache.spark.rpc.netty.MessageLoop.org$apache$spark$rpc$netty$MessageLoop$$receiveLoop(MessageLoop.scala:75) [spark-core_2.12-3.1.1.jar:3.1.1]
at org.apache.spark.rpc.netty.MessageLoop$$anon$1.run(MessageLoop.scala:41) [spark-core_2.12-3.1.1.jar:3.1.1]
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) [?:1.8.0_302]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [?:1.8.0_302]
at java.lang.Thread.run(Thread.java:748) [?:1.8.0_302]
24/02/29 11:07:02 [ T1] ERROR org.apache.spark.ml.util.Instrumentation:[AutoML] ml.dmlc.xgboost4j.java.XGBoostError: XGBoostModel training failed
at ml.dmlc.xgboost4j.scala.spark.XGBoost$.postTrackerReturnProcessing(XGBoost.scala:697)
at ml.dmlc.xgboost4j.scala.spark.XGBoost$.trainDistributed(XGBoost.scala:573)
at ml.dmlc.xgboost4j.scala.spark.XGBoostClassifier.train(XGBoostClassifier.scala:191)
at ml.dmlc.xgboost4j.scala.spark.XGBoostClassifier.train(XGBoostClassifier.scala:40)
at org.apache.spark.ml.Predictor.fit(Predictor.scala:151)
at org.apache.spark.ml.Predictor.fit(Predictor.scala:115)
at org.apache.spark.ml.Pipeline.$anonfun$fit$5(Pipeline.scala:151)
at org.apache.spark.ml.MLEvents.withFitEvent(events.scala:130)
at org.apache.spark.ml.MLEvents.withFitEvent$(events.scala:123)
at org.apache.spark.ml.util.Instrumentation.withFitEvent(Instrumentation.scala:42)
at org.apache.spark.ml.Pipeline.$anonfun$fit$4(Pipeline.scala:151)
at scala.collection.Iterator.foreach(Iterator.scala:941)
at scala.collection.Iterator.foreach$(Iterator.scala:941)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1429)
at org.apache.spark.ml.Pipeline.$anonfun$fit$2(Pipeline.scala:147)
at org.apache.spark.ml.MLEvents.withFitEvent(events.scala:130)
at org.apache.spark.ml.MLEvents.withFitEvent$(events.scala:123)
at org.apache.spark.ml.util.Instrumentation.withFitEvent(Instrumentation.scala:42)
at org.apache.spark.ml.Pipeline.$anonfun$fit$1(Pipeline.scala:133)
at org.apache.spark.ml.util.Instrumentation$.$anonfun$instrumented$1(Instrumentation.scala:191)
at scala.util.Try$.apply(Try.scala:213)
at org.apache.spark.ml.util.Instrumentation$.instrumented(Instrumentation.scala:191)
at org.apache.spark.ml.Pipeline.fit(Pipeline.scala:133)
at com.huawei.tech.ueba.tree.TreeModelTrainAndPredict$.trainModel(TreeModelTrainAndPredict.scala:630)
at com.huawei.tech.ueba.tree.TreeModelTrainAndPredict$.trainAndPredict(TreeModelTrainAndPredict.scala:122)
at com.huawei.tech.ueba.fastpoc.model.TreeModelEst.fitModel(TreeModelEst.scala:69)
at com.huawei.tech.ueba.rca.explainchange.ExplainUtils.ModelAndShap(ExplainUtils.scala:216)
at com.huawei.tech.ueba.rca.explainchange.PipelineDemo.trainProcess(PipelineDemo.scala:40)
at com.huawei.tech.ueba.rca.explainchange.PipelineDemo.process(PipelineDemo.scala:132)
at com.huawei.tech.ueba.rca.explainchange.PipelineDemo$.main(PipelineDemo.scala:161)
at com.huawei.tech.ueba.rca.explainchange.PipelineDemo.main(PipelineDemo.scala)

Exception in thread "main" ml.dmlc.xgboost4j.java.XGBoostError: XGBoostModel training failed
at ml.dmlc.xgboost4j.scala.spark.XGBoost$.postTrackerReturnProcessing(XGBoost.scala:697)
at ml.dmlc.xgboost4j.scala.spark.XGBoost$.trainDistributed(XGBoost.scala:573)
at ml.dmlc.xgboost4j.scala.spark.XGBoostClassifier.train(XGBoostClassifier.scala:191)
at ml.dmlc.xgboost4j.scala.spark.XGBoostClassifier.train(XGBoostClassifier.scala:40)
at org.apache.spark.ml.Predictor.fit(Predictor.scala:151)
at org.apache.spark.ml.Predictor.fit(Predictor.scala:115)
at org.apache.spark.ml.Pipeline.$anonfun$fit$5(Pipeline.scala:151)
at org.apache.spark.ml.MLEvents.withFitEvent(events.scala:130)
at org.apache.spark.ml.MLEvents.withFitEvent$(events.scala:123)
at org.apache.spark.ml.util.Instrumentation.withFitEvent(Instrumentation.scala:42)
at org.apache.spark.ml.Pipeline.$anonfun$fit$4(Pipeline.scala:151)
at scala.collection.Iterator.foreach(Iterator.scala:941)
at scala.collection.Iterator.foreach$(Iterator.scala:941)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1429)
at org.apache.spark.ml.Pipeline.$anonfun$fit$2(Pipeline.scala:147)
at org.apache.spark.ml.MLEvents.withFitEvent(events.scala:130)
at org.apache.spark.ml.MLEvents.withFitEvent$(events.scala:123)
at org.apache.spark.ml.util.Instrumentation.withFitEvent(Instrumentation.scala:42)
at org.apache.spark.ml.Pipeline.$anonfun$fit$1(Pipeline.scala:133)
at org.apache.spark.ml.util.Instrumentation$.$anonfun$instrumented$1(Instrumentation.scala:191)
at scala.util.Try$.apply(Try.scala:213)
at org.apache.spark.ml.util.Instrumentation$.instrumented(Instrumentation.scala:191)
at org.apache.spark.ml.Pipeline.fit(Pipeline.scala:133)
at com.huawei.tech.ueba.tree.TreeModelTrainAndPredict$.trainModel(TreeModelTrainAndPredict.scala:630)
at com.huawei.tech.ueba.tree.TreeModelTrainAndPredict$.trainAndPredict(TreeModelTrainAndPredict.scala:122)
at com.huawei.tech.ueba.fastpoc.model.TreeModelEst.fitModel(TreeModelEst.scala:69)
at com.huawei.tech.ueba.rca.explainchange.ExplainUtils.ModelAndShap(ExplainUtils.scala:216)
at com.huawei.tech.ueba.rca.explainchange.PipelineDemo.trainProcess(PipelineDemo.scala:40)
at com.huawei.tech.ueba.rca.explainchange.PipelineDemo.process(PipelineDemo.scala:132)
at com.huawei.tech.ueba.rca.explainchange.PipelineDemo$.main(PipelineDemo.scala:161)
at com.huawei.tech.ueba.rca.explainchange.PipelineDemo.main(PipelineDemo.scala)

@Bishop-Cui
Copy link
Author

An Update, I changed the env from python3.10 to python 3.8, and it works. I still want to know the reason for this error to avoid it forever, I would appreciate any idea.

@Bishop-Cui Bishop-Cui changed the title RROR ml.dmlc.xgboost4j.java.RabitTracker:[AutoML] Uncaught exception thrown by worker: java.lang.InterruptedException: null ERROR ml.dmlc.xgboost4j.java.RabitTracker:[AutoML] Uncaught exception thrown by worker: java.lang.InterruptedException: null Mar 20, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant