[JVM Packages] ERROR XGBoostTaskFailedListener: Training Task Failed during XGBoost Training #5661

xulunfan · 2020-05-13T12:18:41Z

My environment is spark 2.4.0 and xgboost4j-spark_2.11 1.0.0. When I run a pipeline contains vectorAssembler and ml.dmlc.xgboost4j.scala.spark.XGBoostClassifier to fit Dataset type data, I get a error like this:

Tracker started, with env={DMLC_NUM_SERVER=0, DMLC_TRACKER_URI=10.90.128.79, DMLC_TRACKER_PORT=9091, DMLC_NUM_WORKER=1}

20/05/12 12:06:02 ERROR XGBoostTaskFailedListener: Training Task Failed during XGBoost Training: TaskKilled(another attempt succeeded,Vector(AccumulableInfo(73,None,Some(332),None,false,true,None), AccumulableInfo(75,None,Some(0),None,false,true,None)),Vector(LongAccumulator(id: 73, name: Some(internal.metrics.executorRunTime), value: 332), LongAccumulator(id: 75, name: Some(internal.metrics.resultSize), value: 0))), stopping SparkContext

20/05/12 12:06:02 ERROR RabitTracker: Uncaught exception thrown by worker:

org.apache.spark.SparkException: Job 1 cancelled because SparkContext was shut down

at org.apache.spark.scheduler.DAGScheduler$$anonfun$cleanUpAfterSchedulerStop$1.apply(DAGScheduler.scala:935)

at org.apache.spark.scheduler.DAGScheduler$$anonfun$cleanUpAfterSchedulerStop$1.apply(DAGScheduler.scala:933)

at scala.collection.mutable.HashSet.foreach(HashSet.scala:78)

at org.apache.spark.scheduler.DAGScheduler.cleanUpAfterSchedulerStop(DAGScheduler.scala:933)

at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onStop(DAGScheduler.scala:2139)

at org.apache.spark.util.EventLoop.stop(EventLoop.scala:84)

at org.apache.spark.scheduler.DAGScheduler.stop(DAGScheduler.scala:2052)

at org.apache.spark.SparkContext$$anonfun$stop$6.apply$mcV$sp(SparkContext.scala:1963)

at org.apache.spark.util.Utils$.tryLogNonFatalError(Utils.scala:1340)

at org.apache.spark.SparkContext.stop(SparkContext.scala:1962)

at org.apache.spark.TaskFailedListener$$anon$1$$anonfun$run$1.apply$mcV$sp(SparkParallelismTracker.scala:119)

at org.apache.spark.TaskFailedListener$$anon$1$$anonfun$run$1.apply(SparkParallelismTracker.scala:119)

at scala.util.DynamicVariable.withValue(DynamicVariable.scala:58)

at org.apache.spark.TaskFailedListener$$anon$1.run(SparkParallelismTracker.scala:118)

at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:740)

at org.apache.spark.SparkContext.runJob(SparkContext.scala:2075)

at org.apache.spark.SparkContext.runJob(SparkContext.scala:2096)

at org.apache.spark.SparkContext.runJob(SparkContext.scala:2115)

at org.apache.spark.SparkContext.runJob(SparkContext.scala:2140)

at org.apache.spark.rdd.RDD$$anonfun$foreachPartition$1.apply(RDD.scala:935)

at org.apache.spark.rdd.RDD$$anonfun$foreachPartition$1.apply(RDD.scala:933)

at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)

at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)

at org.apache.spark.rdd.RDD.withScope(RDD.scala:363)

at org.apache.spark.rdd.RDD.foreachPartition(RDD.scala:933)

20/05/12 12:06:07 ERROR XGBoostSpark: the job was aborted due to

ml.dmlc.xgboost4j.java.XGBoostError: XGBoostModel training failed

at ml.dmlc.xgboost4j.scala.spark.XGBoost$.postTrackerReturnProcessing(XGBoost.scala:697)

at ml.dmlc.xgboost4j.scala.spark.XGBoost$.trainDistributed(XGBoost.scala:572)

at ml.dmlc.xgboost4j.scala.spark.XGBoostClassifier.train(XGBoostClassifier.scala:190)

at ml.dmlc.xgboost4j.scala.spark.XGBoostClassifier.train(XGBoostClassifier.scala:40)

at org.apache.spark.ml.Predictor.fit(Predictor.scala:118)

at org.apache.spark.ml.Predictor.fit(Predictor.scala:82)

at org.apache.spark.ml.Pipeline$$anonfun$fit$2.apply(Pipeline.scala:153)

at org.apache.spark.ml.Pipeline$$anonfun$fit$2.apply(Pipeline.scala:149)

at scala.collection.Iterator$class.foreach(Iterator.scala:891)

at scala.collection.AbstractIterator.foreach(Iterator.scala:1334)

at scala.collection.IterableViewLike$Transformed$class.foreach(IterableViewLike.scala:44)

at scala.collection.SeqViewLike$AbstractTransformed.foreach(SeqViewLike.scala:37)

at org.apache.spark.ml.Pipeline.fit(Pipeline.scala:149)

Can anyone help me??
Thank you so much

anon-wt · 2020-06-09T17:13:19Z

请问解决了吗？我也遇到了相同的问题

Williams-Hao · 2020-07-24T02:49:45Z

me too.

hcho3 · 2020-09-08T02:16:43Z

#6019 added an option to avoid killing SparkContext, by setting kill_spark_context_on_worker_failure to true.

mangolzy · 2023-09-22T07:29:26Z

#6019 added an option to avoid killing SparkContext, by setting kill_spark_context_on_worker_failure to true.

where to set this parameter? when initialize the XGBoostClassifier or other places?
I've reproduced this problem for many times stably failed, but success with smaller subset of my dataset,
version:
spark 2.4
xgboost4j 0.90
jdk 1.8

thanks for the advice!

trivialfis changed the title ~~ERROR XGBoostTaskFailedListener: Training Task Failed during XGBoost Training~~ [JVM Packages] ERROR XGBoostTaskFailedListener: Training Task Failed during XGBoost Training May 13, 2020

hcho3 closed this as completed Sep 8, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[JVM Packages] ERROR XGBoostTaskFailedListener: Training Task Failed during XGBoost Training #5661

[JVM Packages] ERROR XGBoostTaskFailedListener: Training Task Failed during XGBoost Training #5661

xulunfan commented May 13, 2020 •

edited

anon-wt commented Jun 9, 2020

Williams-Hao commented Jul 24, 2020

hcho3 commented Sep 8, 2020

mangolzy commented Sep 22, 2023

[JVM Packages] ERROR XGBoostTaskFailedListener: Training Task Failed during XGBoost Training #5661

[JVM Packages] ERROR XGBoostTaskFailedListener: Training Task Failed during XGBoost Training #5661

Comments

xulunfan commented May 13, 2020 • edited

anon-wt commented Jun 9, 2020

Williams-Hao commented Jul 24, 2020

hcho3 commented Sep 8, 2020

mangolzy commented Sep 22, 2023

xulunfan commented May 13, 2020 •

edited