Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement more asynchronous dependency handling on GPU #2963

Merged
merged 3 commits into from Jun 22, 2021

Conversation

romerojosh
Copy link
Collaborator

Checklist before submitting

  • Did you read the contributor guide?
  • Did you update the docs?
  • Did you write any tests to validate this change?
  • Did you update the CHANGELOG, if this change affects users?

Description

This PR introduces changes to Horovod to enable more asynchronous data dependency handling on GPU, for both input data into Horovod and output data back to the framework.

In Horovod prior to this PR, input data dependencies for the GPU are enforced via CPU synchronization on ReadyEvents (i.e. cudaEvent) before scheduling any GPU collective operations. While this does enforce the dependency, it stalls the Horovod background thread unnecessarily. Instead of moving on to coordinate/schedule additional collective operations, the thread waits for the data to be ready on the first operation scheduled in the sequence. To free up the background thread, the CPU synchronization is replaced with calls to cudaStreamWaitEvent which allow the background thread to immediately schedule dependent GPU operations, offloading the data dependency handling to the CUDA driver.

Horovod also enforces output data dependencies via CPU synchronization. Currently, Horovod spawns a completion thread that waits on a cudaEvent (for this discussion, I'll refer to this as the completion event) recorded after Horovod GPU operations are scheduled, which ensures that any completion callbacks to the framework are not executed until the Horovod GPU operations have completed. However, this introduces latency on the framework ops dependent on the Horovod results because this implementation makes it impossible for the framework to schedule any of those ops until Horovod operations have completed, exposing the full cost of the callback/setup/launch of the downstream operation. Similar to the input data dependency, this PR introduces a mechanism for asynchronous dependency handling of the output data via cudaEvents and cudaStreamWaitEvent. Instead of having the Horovod completion thread wait on the completion event, we can pass the event through to the framework via the callback, and handle it on the framework side:

  • For TF/PyT, the completion event is used with a cudaStreamWaitEvent call to place the dependency on the framework GPU stream.
  • For MXNet, the dependency is still enforced via a CPU sync with cudaEventSynchronize on the completion event. This will be updated when an upstream MXNet feature (Add async GPU dependency Engine apache/mxnet#20331) that can better take advantage of this mechanism is merged. I will be posting a second WIP PR to illustrate this soon.

The output data dependency handling, due to more complex interactions with the framework, is currently disabled by default and enabled via a new env var HOROVOD_ENABLE_ASYNC_COMPLETION. This can be changed based on feedback. Additionally, when the timeline is enabled, the input and output dependencies revert to being handled via CPU synchronization in order to provide correct timings.

Some other related, but more minor changes in this PR:

  • ReadyEvents replaced with list of ReadyEvent. This is to enable an input tensor having multiple upstream dependencies (required in upcoming MXNet PR)
  • In the non-async-completion path, replaced spin waiting on cudaEventQuery in WaitForEvents with calls to cudaEventSynchronize in non-elastic cases to reduce number of CUDA API calls.

…y handle data dependencies of GPU operations.

Return cudaEvent on GPU op completion to framework to enable asynchronous callbacks (enabled with HOROVOD_ENABLE_ASYNC_COMPLETION).

Signed-off-by: Josh Romero <joshr@nvidia.com>
Signed-off-by: Josh Romero <joshr@nvidia.com>
Signed-off-by: Josh Romero <joshr@nvidia.com>
@github-actions
Copy link

github-actions bot commented Jun 9, 2021

Unit Test Results

     783 files  ±0       783 suites  ±0   6h 1m 24s ⏱️ ±0s
     601 tests ±0       566 ✔️ ±0       35 💤 ±0  0 ❌ ±0 
16 311 runs  ±0  12 295 ✔️ ±0  4 016 💤 ±0  0 ❌ ±0 

Results for commit 3514f78. ± Comparison against base commit 3514f78.

♻️ This comment has been updated with latest results.

Copy link
Collaborator

@tgaddair tgaddair left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Just a small question about WaitForData.

@@ -195,6 +199,9 @@ Status MPI_GPUAlltoall::Execute(std::vector<TensorTableEntry>& entries, const Re
assert(entries.size() == 1);

gpu_op_context_.InitGPU(entries);

WaitForData(entries);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this InitGPU call why WaitForData needs to be in the Execute function instead of sitting before it? Otherwise it seems we could reduce some duplication by calling it beforehand.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, the InitGPU call is what lazily creates Horovod's internal CUDA stream which is required for the WaitForData variant that uses cudaStreamWaitEvent. I do also dislike the duplication but wasn't sure if there were better options. If you have any suggestions, I'm all ears.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for clarifying, I think that's fine then. Let's land it!

@tgaddair tgaddair merged commit 3514f78 into horovod:master Jun 22, 2021
maxhgerlach added a commit to maxhgerlach/horovod that referenced this pull request Jun 23, 2021
…in hvd.init()

Signed-off-by: Max H. Gerlach <git@maxgerlach.de>
maxhgerlach added a commit to maxhgerlach/horovod that referenced this pull request Jul 6, 2021
…in hvd.init()

Signed-off-by: Max H. Gerlach <git@maxgerlach.de>
maxhgerlach added a commit to maxhgerlach/horovod that referenced this pull request Jul 7, 2021
…in hvd.init()

Signed-off-by: Max H. Gerlach <git@maxgerlach.de>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants