Implement more asynchronous dependency handling on GPU #2963

romerojosh · 2021-06-09T19:19:08Z

Checklist before submitting

Did you read the contributor guide?
Did you update the docs?
Did you write any tests to validate this change?
Did you update the CHANGELOG, if this change affects users?

Description

This PR introduces changes to Horovod to enable more asynchronous data dependency handling on GPU, for both input data into Horovod and output data back to the framework.

In Horovod prior to this PR, input data dependencies for the GPU are enforced via CPU synchronization on ReadyEvents (i.e. cudaEvent) before scheduling any GPU collective operations. While this does enforce the dependency, it stalls the Horovod background thread unnecessarily. Instead of moving on to coordinate/schedule additional collective operations, the thread waits for the data to be ready on the first operation scheduled in the sequence. To free up the background thread, the CPU synchronization is replaced with calls to cudaStreamWaitEvent which allow the background thread to immediately schedule dependent GPU operations, offloading the data dependency handling to the CUDA driver.

Horovod also enforces output data dependencies via CPU synchronization. Currently, Horovod spawns a completion thread that waits on a cudaEvent (for this discussion, I'll refer to this as the completion event) recorded after Horovod GPU operations are scheduled, which ensures that any completion callbacks to the framework are not executed until the Horovod GPU operations have completed. However, this introduces latency on the framework ops dependent on the Horovod results because this implementation makes it impossible for the framework to schedule any of those ops until Horovod operations have completed, exposing the full cost of the callback/setup/launch of the downstream operation. Similar to the input data dependency, this PR introduces a mechanism for asynchronous dependency handling of the output data via cudaEvents and cudaStreamWaitEvent. Instead of having the Horovod completion thread wait on the completion event, we can pass the event through to the framework via the callback, and handle it on the framework side:

For TF/PyT, the completion event is used with a cudaStreamWaitEvent call to place the dependency on the framework GPU stream.
For MXNet, the dependency is still enforced via a CPU sync with cudaEventSynchronize on the completion event. This will be updated when an upstream MXNet feature (Add async GPU dependency Engine apache/mxnet#20331) that can better take advantage of this mechanism is merged. I will be posting a second WIP PR to illustrate this soon.

The output data dependency handling, due to more complex interactions with the framework, is currently disabled by default and enabled via a new env var HOROVOD_ENABLE_ASYNC_COMPLETION. This can be changed based on feedback. Additionally, when the timeline is enabled, the input and output dependencies revert to being handled via CPU synchronization in order to provide correct timings.

Some other related, but more minor changes in this PR:

ReadyEvents replaced with list of ReadyEvent. This is to enable an input tensor having multiple upstream dependencies (required in upcoming MXNet PR)
In the non-async-completion path, replaced spin waiting on cudaEventQuery in WaitForEvents with calls to cudaEventSynchronize in non-elastic cases to reduce number of CUDA API calls.

…y handle data dependencies of GPU operations. Return cudaEvent on GPU op completion to framework to enable asynchronous callbacks (enabled with HOROVOD_ENABLE_ASYNC_COMPLETION). Signed-off-by: Josh Romero <joshr@nvidia.com>

Signed-off-by: Josh Romero <joshr@nvidia.com>

github-actions · 2021-06-09T22:52:03Z

Unit Test Results

    783 files ±0     783 suites ±0 6h 1m 24s ⏱️ ±0s
    601 tests ±0     566 ✔️ ±0     35 💤 ±0 0 ❌ ±0
16 311 runs ±0 12 295 ✔️ ±0 4 016 💤 ±0 0 ❌ ±0

Results for commit 3514f78. ± Comparison against base commit 3514f78.

♻️ This comment has been updated with latest results.

tgaddair

LGTM! Just a small question about WaitForData.

tgaddair · 2021-06-21T20:47:55Z

horovod/common/ops/mpi_gpu_operations.cc

@@ -195,6 +199,9 @@ Status MPI_GPUAlltoall::Execute(std::vector<TensorTableEntry>& entries, const Re
  assert(entries.size() == 1);

  gpu_op_context_.InitGPU(entries);
+
+  WaitForData(entries);


Is this InitGPU call why WaitForData needs to be in the Execute function instead of sitting before it? Otherwise it seems we could reduce some duplication by calling it beforehand.

Yes, the InitGPU call is what lazily creates Horovod's internal CUDA stream which is required for the WaitForData variant that uses cudaStreamWaitEvent. I do also dislike the duplication but wasn't sure if there were better options. If you have any suggestions, I'm all ears.

Thanks for clarifying, I think that's fine then. Let's land it!

…in hvd.init() Signed-off-by: Max H. Gerlach <git@maxgerlach.de>

romerojosh added 2 commits June 9, 2021 11:25

Add missing HAVE_GPU guards.

104485b

Signed-off-by: Josh Romero <joshr@nvidia.com>

romerojosh force-pushed the event_async branch from ef3ea3f to 104485b Compare June 9, 2021 20:04

CPU compilation fix.

eb09473

Signed-off-by: Josh Romero <joshr@nvidia.com>

romerojosh requested a review from tgaddair June 10, 2021 18:18

romerojosh mentioned this pull request Jun 10, 2021

[WIP] Add support for MXNet async dependency engine #2972

Closed

4 tasks

trentlo mentioned this pull request Jun 17, 2021

Horovod Ops not XLA compatible #2590

Open

tgaddair approved these changes Jun 21, 2021

View reviewed changes

tgaddair merged commit 3514f78 into horovod:master Jun 22, 2021

maxhgerlach added a commit to maxhgerlach/horovod that referenced this pull request Jun 23, 2021

Fix conflict with horovod#2963 after rebase, fix edge case exception …

6c623b5

…in hvd.init() Signed-off-by: Max H. Gerlach <git@maxgerlach.de>

maxhgerlach added a commit to maxhgerlach/horovod that referenced this pull request Jul 6, 2021

Fix conflict with horovod#2963 after rebase, fix edge case exception …

8e89e25

…in hvd.init() Signed-off-by: Max H. Gerlach <git@maxgerlach.de>

maxhgerlach added a commit to maxhgerlach/horovod that referenced this pull request Jul 7, 2021

Fix conflict with horovod#2963 after rebase, fix edge case exception …

e1ee4f7

…in hvd.init() Signed-off-by: Max H. Gerlach <git@maxgerlach.de>

romerojosh mentioned this pull request Oct 25, 2021

Add support for MXNet async dependency engine. #3242

Merged

4 tasks

romerojosh mentioned this pull request Mar 10, 2022

Make TensorFlow output allocations asynchronous when using NCCL backend. #3464

Merged

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement more asynchronous dependency handling on GPU #2963

Implement more asynchronous dependency handling on GPU #2963

romerojosh commented Jun 9, 2021

github-actions bot commented Jun 9, 2021 •

edited

tgaddair left a comment

tgaddair Jun 21, 2021

romerojosh Jun 21, 2021

tgaddair Jun 22, 2021

Implement more asynchronous dependency handling on GPU #2963

Implement more asynchronous dependency handling on GPU #2963

Conversation

romerojosh commented Jun 9, 2021

Checklist before submitting

Description

github-actions bot commented Jun 9, 2021 • edited

Unit Test Results

tgaddair left a comment

Choose a reason for hiding this comment

tgaddair Jun 21, 2021

Choose a reason for hiding this comment

romerojosh Jun 21, 2021

Choose a reason for hiding this comment

tgaddair Jun 22, 2021

Choose a reason for hiding this comment

github-actions bot commented Jun 9, 2021 •

edited