Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remove stop process. #143

Merged
merged 1 commit into from Aug 5, 2020
Merged

Conversation

trivialfis
Copy link
Member

No description provided.

@trivialfis
Copy link
Member Author

@chenqin

@trivialfis
Copy link
Member Author

@hcho3

@CodingCat
Copy link
Member

Why we want to do this? Will this break some xgb jvm unit test?

@trivialfis
Copy link
Member Author

trivialfis commented Aug 4, 2020

@CodingCat Not sure, is there any test that's expecting rabit to bring down the all workers in the process? I will submit a PR on XGBoost to run the full test instead.

Also you proposed that we might subsume rabit into XGBoost and I refused to do so in before. I'm starting to think that you are right, what do you think of it now?

@trivialfis
Copy link
Member Author

trivialfis commented Aug 4, 2020

Why we want to do this?

I'm working on dmlc/xgboost#4826, it's still on early planning. I need to cleanup the rabit codebase first.

@chenqin
Copy link
Contributor

chenqin commented Aug 4, 2020

I think error on exit were default implementation until we realized this is causing trouble in jvm side, we added throw exception with this flag. Please test if it will break any unit test but feature wise seems okay ASFAIK.

@trivialfis
Copy link
Member Author

trivialfis commented Aug 4, 2020

@chenqin I revisited your doc around bootstrapping cache recently. Do you have suggestions on handling duplicated allreduce function calls?

@chenqin
Copy link
Contributor

chenqin commented Aug 4, 2020

@chenqin I revisited your doc around bootstrapping cache recently. Do you have suggestions on handling duplicated allreduce function calls?

Sorry about late reply, can you share new approaches. Regarding to duplicated allreduce function calls, I think the issue unresolved is if caller try to pass in from a encapsulated method where we may not be able to generate a unique caller footprint. It will leave bootstrap cache hard to match historical cache. I used to have some thoughts to fix this..

  1. fallback to seq no approach as it used to be and just apply to before init() phase. I think I used to observe issue with fast hist implementation back then not sure if that were no longer issue.
  2. fallback to seq no approach only if we observe duplicated footprint (this is likely due to a encapsulated helper function) and merge bootstrap phase into rest of cache.
  3. remove bootstrap phase, stop support single point failure recover.

@trivialfis
Copy link
Member Author

trivialfis commented Aug 4, 2020

So far I want to remove single point recovery. I tried to think about falling back to using seq number as the calls should have strict ordering and deterministic. But with async programming, there might be infinite edge cases and violates the ordering for multiple DMatrix construction. Feel free to correct me if I'm wrong.

My expectation is XGBoost can fail gracefully on both spark and dask.

@chenqin
Copy link
Contributor

chenqin commented Aug 4, 2020

So far I want to remove single point recovery. I tried to think about falling back to using seq number as the calls should have strict ordering and deterministic. But with async programming, there might be infinite edge cases and violates the ordering for multiple DMatrix construction. Feel free to correct me if I'm wrong.

My expectation is XGBoost can fail gracefully on both spark and dask.

sounds good, if we know for sure compute layer can't offer single point recovery, we should consider remove all caches and make implementation much simple

@trivialfis
Copy link
Member Author

@chenqin Thanks for the replies, let me think about it.

@CodingCat
Copy link
Member

@CodingCat Not sure, is there any test that's expecting rabit to bring down the all workers in the process? I will submit a PR on XGBoost to run the full test instead.

I think there are one or more tests relying on this tag to handle missing value or invalid metrics or something in task layer to avoid shutting down JVM process and let SparkContextKiller to handle the post action

if you remove this, the test may not be runnable

Also you proposed that we might subsume rabit into XGBoost and I refused to do so in before. I'm starting to think that you are right, what do you think of it now?

I think it makes sense to move rabit under XGB, as I do not think anyone else are relying on this module and XGB has a tight dependency with Rabit

@trivialfis
Copy link
Member Author

Can we merge this now that all tests on XGBoost are passing?

@CodingCat CodingCat merged commit 4acdd7c into dmlc:master Aug 5, 2020
@trivialfis trivialfis deleted the remove-stop-process branch August 5, 2020 18:23
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants