Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix barrier seg fault and added test to mix it with multiple collectives #3313

Merged
merged 2 commits into from Dec 14, 2021

Conversation

Tixxx
Copy link
Collaborator

@Tixxx Tixxx commented Dec 13, 2021

Checklist before submitting

  • Did you read the contributor guide?
  • Did you update the docs?
  • Did you write any tests to validate this change?
  • Did you update the CHANGELOG, if this change affects users?

Description

This is to fix the possible seg fault that happens when mixing barrier op with allgather. In TotalByteSizeOfAllgatherOutput function, it calculates the output size by accessing the tensor pointer of the response entry object, for ops like barrier or join whose tensor pointer is empty, horovod will seg fault.
Since the function only needs to be called when fusion is enabled, we simply skip the fusion calculation when op is join or barrier.
Added a new test for barrier that runs with multiple collectives to validate this change.
Fixes # (issue).
3308

Review process to land

  1. All tests and other checks must succeed.
  2. At least one member of the technical steering committee must review and approve.
  3. If any member of the technical steering committee requests changes, they must be addressed.

@Tixxx Tixxx linked an issue Dec 13, 2021 that may be closed by this pull request
Copy link
Collaborator

@maxhgerlach maxhgerlach left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for looking into this! Just a couple of questions.

@@ -893,6 +893,10 @@ void Controller::FuseResponses(std::deque<Response>& responses,
while (!responses.empty()) {

auto& new_response = responses.front();
if (new_response.response_type() == Response::ResponseType::BARRIER ||
new_response.response_type() == Response::ResponseType::JOIN) {
break;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it be safer to have a continue here, rather than a break?

Copy link
Collaborator Author

@Tixxx Tixxx Dec 13, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually I think using break is safer here than continue since using continue will keep this fusion logic going. I think once we see a barrier response, it means we have reached an end of a control block, so we don't want to fuse the responses after the barrier(if there's any). Let me know if this makes sense.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point, and this applies to join as well.

This loop is specific to allgather. Would it make sense to break the fusion loop in the same way for allreduce and adasum? (Even if it's not necessary there to shield us from accessing invalid pointers)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The logic for determining output size of allreduce operations are fairly straightforward, they are directly using the tensor_sizes field in the response object which is safe. Allgather is a special case since we need to inspect each dimensionality, so it needs a reference to the tensor itself.
We need to re-visit this logic once we support fusion for other ops.

test/parallel/test_torch.py Outdated Show resolved Hide resolved
horovod/common/controller.cc Show resolved Hide resolved
@github-actions
Copy link

github-actions bot commented Dec 13, 2021

Unit Test Results

     786 files  +  28       786 suites  +28   8h 9m 43s ⏱️ - 15m 33s
     717 tests +    1       668 ✔️ +    1       49 💤 ±    0  0 ±0 
17 022 runs  +700  11 908 ✔️ +442  5 114 💤 +258  0 ±0 

Results for commit cac7df3. ± Comparison against base commit df18797.

♻️ This comment has been updated with latest results.

@github-actions
Copy link

github-actions bot commented Dec 13, 2021

Unit Test Results (with flaky tests)

     930 files  ±    0       930 suites  ±0   9h 45m 28s ⏱️ + 17m 0s
     717 tests +    1       667 ✔️ +  3       49 💤 ±  0  1  - 2 
20 228 runs   - 106  13 862 ✔️  - 85  6 362 💤  - 20  4  - 1 

For more details on these failures, see this check.

Results for commit cac7df3. ± Comparison against base commit df18797.

♻️ This comment has been updated with latest results.

@EnricoMi
Copy link
Collaborator

Yay, this fixes the macOS issues in #3301.

Signed-off-by: TJ <tix@uber.com>
@Tixxx Tixxx requested a review from tgaddair December 14, 2021 02:16
@Tixxx
Copy link
Collaborator Author

Tixxx commented Dec 14, 2021

Head test failures seem to be related to this

@EnricoMi
Copy link
Collaborator

Yeah, the head issues should be fine to ignore here. @maxhgerlach are you happy to approve?

Copy link
Collaborator

@maxhgerlach maxhgerlach left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Thanks for going over my questions.

I've added an extra comment/question, but it's not urgent.

horovod/common/controller.cc Show resolved Hide resolved
@@ -893,6 +893,10 @@ void Controller::FuseResponses(std::deque<Response>& responses,
while (!responses.empty()) {

auto& new_response = responses.front();
if (new_response.response_type() == Response::ResponseType::BARRIER ||
new_response.response_type() == Response::ResponseType::JOIN) {
break;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point, and this applies to join as well.

This loop is specific to allgather. Would it make sense to break the fusion loop in the same way for allreduce and adasum? (Even if it's not necessary there to shield us from accessing invalid pointers)

@maxhgerlach maxhgerlach merged commit 7bb5bde into master Dec 14, 2021
tkhanna1996 pushed a commit to tkhanna1996/horovod that referenced this pull request Dec 16, 2021
maxhgerlach added a commit to maxhgerlach/horovod that referenced this pull request Dec 17, 2021
…llowing horovod#3300, horovod#3313

Signed-off-by: Max H. Gerlach <git@maxgerlach.de>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging this pull request may close these issues.

Segmentation fault with hvd.barrier
3 participants