New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix barrier seg fault and added test to mix it with multiple collectives #3313
Conversation
Signed-off-by: TJ <tix@uber.com>
e1405db
to
5efb865
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for looking into this! Just a couple of questions.
@@ -893,6 +893,10 @@ void Controller::FuseResponses(std::deque<Response>& responses, | |||
while (!responses.empty()) { | |||
|
|||
auto& new_response = responses.front(); | |||
if (new_response.response_type() == Response::ResponseType::BARRIER || | |||
new_response.response_type() == Response::ResponseType::JOIN) { | |||
break; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would it be safer to have a continue
here, rather than a break
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually I think using break is safer here than continue since using continue will keep this fusion logic going. I think once we see a barrier response, it means we have reached an end of a control block, so we don't want to fuse the responses after the barrier(if there's any). Let me know if this makes sense.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point, and this applies to join as well.
This loop is specific to allgather. Would it make sense to break the fusion loop in the same way for allreduce and adasum? (Even if it's not necessary there to shield us from accessing invalid pointers)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The logic for determining output size of allreduce operations are fairly straightforward, they are directly using the tensor_sizes field in the response object which is safe. Allgather is a special case since we need to inspect each dimensionality, so it needs a reference to the tensor itself.
We need to re-visit this logic once we support fusion for other ops.
Unit Test Results (with flaky tests) 930 files ± 0 930 suites ±0 9h 45m 28s ⏱️ + 17m 0s For more details on these failures, see this check. Results for commit cac7df3. ± Comparison against base commit df18797. ♻️ This comment has been updated with latest results. |
Yay, this fixes the macOS issues in #3301. |
Signed-off-by: TJ <tix@uber.com>
Head test failures seem to be related to this |
Yeah, the head issues should be fine to ignore here. @maxhgerlach are you happy to approve? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! Thanks for going over my questions.
I've added an extra comment/question, but it's not urgent.
@@ -893,6 +893,10 @@ void Controller::FuseResponses(std::deque<Response>& responses, | |||
while (!responses.empty()) { | |||
|
|||
auto& new_response = responses.front(); | |||
if (new_response.response_type() == Response::ResponseType::BARRIER || | |||
new_response.response_type() == Response::ResponseType::JOIN) { | |||
break; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point, and this applies to join as well.
This loop is specific to allgather. Would it make sense to break the fusion loop in the same way for allreduce and adasum? (Even if it's not necessary there to shield us from accessing invalid pointers)
…ves (horovod#3313) Signed-off-by: TJ <tix@uber.com>
…llowing horovod#3300, horovod#3313 Signed-off-by: Max H. Gerlach <git@maxgerlach.de>
Checklist before submitting
Description
This is to fix the possible seg fault that happens when mixing barrier op with allgather. In TotalByteSizeOfAllgatherOutput function, it calculates the output size by accessing the tensor pointer of the response entry object, for ops like barrier or join whose tensor pointer is empty, horovod will seg fault.
Since the function only needs to be called when fusion is enabled, we simply skip the fusion calculation when op is join or barrier.
Added a new test for barrier that runs with multiple collectives to validate this change.
Fixes # (issue).
3308
Review process to land