Remove stop process. #143

trivialfis · 2020-08-04T05:00:34Z

No description provided.

trivialfis · 2020-08-04T05:01:15Z

@chenqin

trivialfis · 2020-08-04T05:01:25Z

@hcho3

CodingCat · 2020-08-04T05:05:25Z

Why we want to do this? Will this break some xgb jvm unit test?

trivialfis · 2020-08-04T05:16:21Z

@CodingCat Not sure, is there any test that's expecting rabit to bring down the all workers in the process? I will submit a PR on XGBoost to run the full test instead.

Also you proposed that we might subsume rabit into XGBoost and I refused to do so in before. I'm starting to think that you are right, what do you think of it now?

trivialfis · 2020-08-04T05:22:15Z

Why we want to do this?

I'm working on dmlc/xgboost#4826, it's still on early planning. I need to cleanup the rabit codebase first.

chenqin · 2020-08-04T13:46:39Z

I think error on exit were default implementation until we realized this is causing trouble in jvm side, we added throw exception with this flag. Please test if it will break any unit test but feature wise seems okay ASFAIK.

trivialfis · 2020-08-04T14:07:47Z

@chenqin I revisited your doc around bootstrapping cache recently. Do you have suggestions on handling duplicated allreduce function calls?

chenqin · 2020-08-04T14:23:05Z

@chenqin I revisited your doc around bootstrapping cache recently. Do you have suggestions on handling duplicated allreduce function calls?

Sorry about late reply, can you share new approaches. Regarding to duplicated allreduce function calls, I think the issue unresolved is if caller try to pass in from a encapsulated method where we may not be able to generate a unique caller footprint. It will leave bootstrap cache hard to match historical cache. I used to have some thoughts to fix this..

fallback to seq no approach as it used to be and just apply to before init() phase. I think I used to observe issue with fast hist implementation back then not sure if that were no longer issue.
fallback to seq no approach only if we observe duplicated footprint (this is likely due to a encapsulated helper function) and merge bootstrap phase into rest of cache.
remove bootstrap phase, stop support single point failure recover.

trivialfis · 2020-08-04T14:38:06Z

So far I want to remove single point recovery. I tried to think about falling back to using seq number as the calls should have strict ordering and deterministic. But with async programming, there might be infinite edge cases and violates the ordering for multiple DMatrix construction. Feel free to correct me if I'm wrong.

My expectation is XGBoost can fail gracefully on both spark and dask.

chenqin · 2020-08-04T15:06:56Z

So far I want to remove single point recovery. I tried to think about falling back to using seq number as the calls should have strict ordering and deterministic. But with async programming, there might be infinite edge cases and violates the ordering for multiple DMatrix construction. Feel free to correct me if I'm wrong.

My expectation is XGBoost can fail gracefully on both spark and dask.

sounds good, if we know for sure compute layer can't offer single point recovery, we should consider remove all caches and make implementation much simple

trivialfis · 2020-08-04T16:27:45Z

@chenqin Thanks for the replies, let me think about it.

CodingCat · 2020-08-04T17:01:53Z

@CodingCat Not sure, is there any test that's expecting rabit to bring down the all workers in the process? I will submit a PR on XGBoost to run the full test instead.

I think there are one or more tests relying on this tag to handle missing value or invalid metrics or something in task layer to avoid shutting down JVM process and let SparkContextKiller to handle the post action

if you remove this, the test may not be runnable

Also you proposed that we might subsume rabit into XGBoost and I refused to do so in before. I'm starting to think that you are right, what do you think of it now?

I think it makes sense to move rabit under XGB, as I do not think anyone else are relying on this module and XGB has a tight dependency with Rabit

trivialfis · 2020-08-05T02:08:18Z

Can we merge this now that all tests on XGBoost are passing?

Remove stop process.

c61f69d

trivialfis requested a review from CodingCat August 4, 2020 05:01

trivialfis mentioned this pull request Aug 4, 2020

Rabit update. dmlc/xgboost#5978

Merged

CodingCat merged commit 4acdd7c into dmlc:master Aug 5, 2020

trivialfis deleted the remove-stop-process branch August 5, 2020 18:23

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Remove stop process. #143

Remove stop process. #143

trivialfis commented Aug 4, 2020

trivialfis commented Aug 4, 2020

trivialfis commented Aug 4, 2020

CodingCat commented Aug 4, 2020

trivialfis commented Aug 4, 2020 •

edited

trivialfis commented Aug 4, 2020 •

edited

chenqin commented Aug 4, 2020

trivialfis commented Aug 4, 2020 •

edited

chenqin commented Aug 4, 2020

trivialfis commented Aug 4, 2020 •

edited

chenqin commented Aug 4, 2020

trivialfis commented Aug 4, 2020

CodingCat commented Aug 4, 2020

trivialfis commented Aug 5, 2020

Remove stop process. #143

Remove stop process. #143

Conversation

trivialfis commented Aug 4, 2020

trivialfis commented Aug 4, 2020

trivialfis commented Aug 4, 2020

CodingCat commented Aug 4, 2020

trivialfis commented Aug 4, 2020 • edited

trivialfis commented Aug 4, 2020 • edited

chenqin commented Aug 4, 2020

trivialfis commented Aug 4, 2020 • edited

chenqin commented Aug 4, 2020

trivialfis commented Aug 4, 2020 • edited

chenqin commented Aug 4, 2020

trivialfis commented Aug 4, 2020

CodingCat commented Aug 4, 2020

trivialfis commented Aug 5, 2020

trivialfis commented Aug 4, 2020 •

edited

trivialfis commented Aug 4, 2020 •

edited

trivialfis commented Aug 4, 2020 •

edited

trivialfis commented Aug 4, 2020 •

edited