Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[JVM Packages] ERROR XGBoostTaskFailedListener: Training Task Failed during XGBoost Training #5661

Closed
xulunfan opened this issue May 13, 2020 · 4 comments

Comments

@xulunfan
Copy link

xulunfan commented May 13, 2020

My environment is spark 2.4.0 and xgboost4j-spark_2.11 1.0.0. When I run a pipeline contains vectorAssembler and ml.dmlc.xgboost4j.scala.spark.XGBoostClassifier to fit Dataset type data, I get a error like this:

Tracker started, with env={DMLC_NUM_SERVER=0, DMLC_TRACKER_URI=10.90.128.79, DMLC_TRACKER_PORT=9091, DMLC_NUM_WORKER=1}

20/05/12 12:06:02 ERROR XGBoostTaskFailedListener: Training Task Failed during XGBoost Training: TaskKilled(another attempt succeeded,Vector(AccumulableInfo(73,None,Some(332),None,false,true,None), AccumulableInfo(75,None,Some(0),None,false,true,None)),Vector(LongAccumulator(id: 73, name: Some(internal.metrics.executorRunTime), value: 332), LongAccumulator(id: 75, name: Some(internal.metrics.resultSize), value: 0))), stopping SparkContext

20/05/12 12:06:02 ERROR RabitTracker: Uncaught exception thrown by worker:

org.apache.spark.SparkException: Job 1 cancelled because SparkContext was shut down

at org.apache.spark.scheduler.DAGScheduler$$anonfun$cleanUpAfterSchedulerStop$1.apply(DAGScheduler.scala:935)

at org.apache.spark.scheduler.DAGScheduler$$anonfun$cleanUpAfterSchedulerStop$1.apply(DAGScheduler.scala:933)

at scala.collection.mutable.HashSet.foreach(HashSet.scala:78)

at org.apache.spark.scheduler.DAGScheduler.cleanUpAfterSchedulerStop(DAGScheduler.scala:933)

at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onStop(DAGScheduler.scala:2139)

at org.apache.spark.util.EventLoop.stop(EventLoop.scala:84)

at org.apache.spark.scheduler.DAGScheduler.stop(DAGScheduler.scala:2052)

at org.apache.spark.SparkContext$$anonfun$stop$6.apply$mcV$sp(SparkContext.scala:1963)

at org.apache.spark.util.Utils$.tryLogNonFatalError(Utils.scala:1340)

at org.apache.spark.SparkContext.stop(SparkContext.scala:1962)

at org.apache.spark.TaskFailedListener$$anon$1$$anonfun$run$1.apply$mcV$sp(SparkParallelismTracker.scala:119)

at org.apache.spark.TaskFailedListener$$anon$1$$anonfun$run$1.apply(SparkParallelismTracker.scala:119)

at org.apache.spark.TaskFailedListener$$anon$1$$anonfun$run$1.apply(SparkParallelismTracker.scala:119)

at scala.util.DynamicVariable.withValue(DynamicVariable.scala:58)

at org.apache.spark.TaskFailedListener$$anon$1.run(SparkParallelismTracker.scala:118)

at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:740)

at org.apache.spark.SparkContext.runJob(SparkContext.scala:2075)

at org.apache.spark.SparkContext.runJob(SparkContext.scala:2096)

at org.apache.spark.SparkContext.runJob(SparkContext.scala:2115)

at org.apache.spark.SparkContext.runJob(SparkContext.scala:2140)

at org.apache.spark.rdd.RDD$$anonfun$foreachPartition$1.apply(RDD.scala:935)

at org.apache.spark.rdd.RDD$$anonfun$foreachPartition$1.apply(RDD.scala:933)

at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)

at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)

at org.apache.spark.rdd.RDD.withScope(RDD.scala:363)

at org.apache.spark.rdd.RDD.foreachPartition(RDD.scala:933)

20/05/12 12:06:07 ERROR XGBoostSpark: the job was aborted due to

ml.dmlc.xgboost4j.java.XGBoostError: XGBoostModel training failed

at ml.dmlc.xgboost4j.scala.spark.XGBoost$.postTrackerReturnProcessing(XGBoost.scala:697)

at ml.dmlc.xgboost4j.scala.spark.XGBoost$.trainDistributed(XGBoost.scala:572)

at ml.dmlc.xgboost4j.scala.spark.XGBoostClassifier.train(XGBoostClassifier.scala:190)

at ml.dmlc.xgboost4j.scala.spark.XGBoostClassifier.train(XGBoostClassifier.scala:40)

at org.apache.spark.ml.Predictor.fit(Predictor.scala:118)

at org.apache.spark.ml.Predictor.fit(Predictor.scala:82)

at org.apache.spark.ml.Pipeline$$anonfun$fit$2.apply(Pipeline.scala:153)

at org.apache.spark.ml.Pipeline$$anonfun$fit$2.apply(Pipeline.scala:149)

at scala.collection.Iterator$class.foreach(Iterator.scala:891)

at scala.collection.AbstractIterator.foreach(Iterator.scala:1334)

at scala.collection.IterableViewLike$Transformed$class.foreach(IterableViewLike.scala:44)

at scala.collection.SeqViewLike$AbstractTransformed.foreach(SeqViewLike.scala:37)

at org.apache.spark.ml.Pipeline.fit(Pipeline.scala:149)

Can anyone help me??
Thank you so much

@trivialfis trivialfis changed the title ERROR XGBoostTaskFailedListener: Training Task Failed during XGBoost Training [JVM Packages] ERROR XGBoostTaskFailedListener: Training Task Failed during XGBoost Training May 13, 2020
@anon-wt
Copy link

anon-wt commented Jun 9, 2020

请问解决了吗? 我 也遇到了相同的问题

@Williams-Hao
Copy link

me too.

@hcho3
Copy link
Collaborator

hcho3 commented Sep 8, 2020

#6019 added an option to avoid killing SparkContext, by setting kill_spark_context_on_worker_failure to true.

@hcho3 hcho3 closed this as completed Sep 8, 2020
@mangolzy
Copy link

#6019 added an option to avoid killing SparkContext, by setting kill_spark_context_on_worker_failure to true.

where to set this parameter? when initialize the XGBoostClassifier or other places?
I've reproduced this problem for many times stably failed, but success with smaller subset of my dataset,
version:
spark 2.4
xgboost4j 0.90
jdk 1.8

thanks for the advice!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants