XGBoost-Spark killing SparkContext on Task Failure #4826

lnmohankumar · 2019-09-03T09:59:01Z

Hi Team,

While running Machine Learning training models using xgboost in cluster and sparkContaxt is always getting shutdown after encountering any training task failure exception. So every time we need to restart the cluster to bring it back to normal state.

After looking for the root cause We found the code which causing the sparkcontext to close. I am not sure why sparkContext has to shutdown for any task failure, This is causing the other training models job to end which is not required.

https://github.com/dmlc/xgboost/blob/master/jvm-packages/xgboost4j-spark/src/main/scala/org/apache/spark/SparkParallelismTracker.scala#L127

above code is rolled out in 0.82 and 0.9 versions, Is it possible to fix it or any reason for this change in the new versions.

hanyucui · 2019-09-03T21:43:59Z

To add to what @lnmohankumar said, I understand there are times when the entire Spark application should exit upon when XGBoost fails, but there are also a lot of cases where a user would like Spark to stay alive. For example, in a Jupyter notebook, users might play with different scenarios and want to continue using Spark even if some XGBoost tasks fail. I can also imagine users would like other models in the code to continue to train even if XGBoost models fail. I think there should be a way to enable/disable this behavior in user code.

CodingCat · 2019-09-03T21:47:34Z

The current version of xgb does not behave normally when there is a failed task, i.e. the application would hang forever in that case

This is the reason we have to kill the entire application in the case of a failure. @cq is working on fixing the fault recovery strategy in xgb

CodingCat · 2019-09-03T21:48:40Z

Additionally, train multiple models in parallel is an undefined behavior in xgb, rabit has some problem to fully support it

hanyucui · 2019-09-03T22:03:33Z

@CodingCat Thanks for your response and it makes a lot of sense. Although, when xgboost fails, a user might still want to train models from other frameworks, say, scikit-learn or tensorflow, which is independent of xgboost. Any suggestion on how to do that on the current version of xgboost?

CodingCat · 2019-09-03T22:04:46Z

While I didn’t try by myself, I think fork a new process to start spark-submit should work

hanyucui · 2019-09-03T22:29:40Z

Thanks, @CodingCat. This is essentially what we are doing now. Let me clarify one last thing. Is it true that, when there is a failed task, xgboost would just hang and subsequent code in the main process will not be able to run? If that's the case, I agree we can only wait for the fault recovery fix @cq is working on.

CodingCat · 2019-09-03T22:46:01Z

Yes

…

On Tue, Sep 3, 2019 at 3:29 PM Hanyu Cui ***@***.***> wrote: Thanks, @CodingCat <https://github.com/CodingCat>. This is essentially what we are doing now. Let me clarify one last thing. Is it true that, when there is a failed task, xgboost would just hang and subsequent code in the main process will not be able to run? If that's the case, I agree we can only wait for the fault recovery fix @cq <https://github.com/cq> is working on. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#4826?email_source=notifications&email_token=AAFFQ6AUDLNLDKH3LLCI3E3QH3QNRA5CNFSM4ITEKOY2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD5ZYSFA#issuecomment-527665428>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAFFQ6HYP3YQPRSXIMXXFDLQH3QNRANCNFSM4ITEKOYQ> .

hanyucui · 2019-09-04T00:16:17Z

Thanks, @CodingCat. Curious if there is an ETA for the fix.

a-whitej · 2020-01-20T09:45:40Z

Same issue met. any update on how to fix?

a-whitej · 2020-01-20T10:02:59Z

failure on GridSearch and error message:

2020-01-16 03:15:55,167 [ dispatcher-event-loop-2:47767420 ] - [ ERROR ] Lost executor 43 on ip-.cn-northwest-1.compute.internal: Container marked as failed: container_1577262625342_113301_01_000064 on host: ip-10-84-31-201.cn-northwest-1.compute.internal. Exit status: -100. Diagnostics: Container released on a lost node
2020-01-16 03:15:55,167 [ dispatcher-event-loop-7:47767420 ] - [ WARN ] Requesting driver to remove executor 43 for reason Container marked as failed: container_1577262625342_113301_01_000064 on host: ip--northwest-1.compute.internal. Exit status: -100. Diagnostics: Container released on a lost node
2020-01-16 03:15:55,167 [ dispatcher-event-loop-3:47767420 ] - [ WARN ] No more replicas available for rdd_461_1045 !
2020-01-16 03:15:55,167 [ spark-listener-group-executorManagement:47767420 ] - [ INFO ] Existing executor 43 has been removed (new total is 62)
2020-01-16 03:15:55,167 [ dispatcher-event-loop-2:47767420 ] - [ INFO ] Removal of executor 43 requested
2020-01-16 03:15:55,167 [ dispatcher-event-loop-2:47767420 ] - [ INFO ] Asked to remove non-existent executor 43
2020-01-16 03:15:55,167 [ dispatcher-event-loop-3:47767420 ] - [ WARN ] No more replicas available for rdd_461_2177 !

hcho3 · 2020-09-09T04:22:06Z

Fixed in #6019. #6097 documents the behavior of new parameter kill_spark_context_on_worker_failure.

FantDing · 2021-01-08T07:02:48Z

Yes
…
On Tue, Sep 3, 2019 at 3:29 PM Hanyu Cui @.***> wrote: Thanks, @CodingCat https://github.com/CodingCat. This is essentially what we are doing now. Let me clarify one last thing. Is it true that, when there is a failed task, xgboost would just hang and subsequent code in the main process will not be able to run? If that's the case, I agree we can only wait for the fault recovery fix @cq https://github.com/cq is working on. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#4826?email_source=notifications&email_token=AAFFQ6AUDLNLDKH3LLCI3E3QH3QNRA5CNFSM4ITEKOY2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD5ZYSFA#issuecomment-527665428>, or mute the thread https://github.com/notifications/unsubscribe-auth/AAFFQ6HYP3YQPRSXIMXXFDLQH3QNRANCNFSM4ITEKOYQ .

@CodingCat but I found that, when I killed one executor where the xgb task works on, xgb can train normally. xgboost did not hang. I use the code to just cancel the job rather than kill sparkContext

thvasilo mentioned this issue Jun 8, 2020

[Roadmap] 1.2.0 Roadmap #5734

Closed

14 tasks

trivialfis mentioned this issue Aug 4, 2020

Remove stop process. dmlc/rabit#143

Merged

CodingCat mentioned this issue Aug 4, 2020

Rabit update. #5978

Merged

wbo4958 mentioned this issue Aug 17, 2020

[jvm-packages] cancel job instead of killing SparkContext #6019

Merged

trivialfis mentioned this issue Aug 18, 2020

[Roadmap] 1.3.0 Roadmap #6031

Closed

14 tasks

hcho3 closed this as completed Sep 9, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

XGBoost-Spark killing SparkContext on Task Failure #4826

XGBoost-Spark killing SparkContext on Task Failure #4826

lnmohankumar commented Sep 3, 2019

hanyucui commented Sep 3, 2019

CodingCat commented Sep 3, 2019

CodingCat commented Sep 3, 2019

hanyucui commented Sep 3, 2019

CodingCat commented Sep 3, 2019

hanyucui commented Sep 3, 2019

CodingCat commented Sep 3, 2019 via email

hanyucui commented Sep 4, 2019

a-whitej commented Jan 20, 2020

a-whitej commented Jan 20, 2020

hcho3 commented Sep 9, 2020 •

edited

FantDing commented Jan 8, 2021

XGBoost-Spark killing SparkContext on Task Failure #4826

XGBoost-Spark killing SparkContext on Task Failure #4826

Comments

lnmohankumar commented Sep 3, 2019

hanyucui commented Sep 3, 2019

CodingCat commented Sep 3, 2019

CodingCat commented Sep 3, 2019

hanyucui commented Sep 3, 2019

CodingCat commented Sep 3, 2019

hanyucui commented Sep 3, 2019

CodingCat commented Sep 3, 2019 via email

hanyucui commented Sep 4, 2019

a-whitej commented Jan 20, 2020

a-whitej commented Jan 20, 2020

hcho3 commented Sep 9, 2020 • edited

FantDing commented Jan 8, 2021

hcho3 commented Sep 9, 2020 •

edited