New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implement more asynchronous dependency handling on GPU #2963
Conversation
…y handle data dependencies of GPU operations. Return cudaEvent on GPU op completion to framework to enable asynchronous callbacks (enabled with HOROVOD_ENABLE_ASYNC_COMPLETION). Signed-off-by: Josh Romero <joshr@nvidia.com>
Signed-off-by: Josh Romero <joshr@nvidia.com>
Signed-off-by: Josh Romero <joshr@nvidia.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! Just a small question about WaitForData
.
@@ -195,6 +199,9 @@ Status MPI_GPUAlltoall::Execute(std::vector<TensorTableEntry>& entries, const Re | |||
assert(entries.size() == 1); | |||
|
|||
gpu_op_context_.InitGPU(entries); | |||
|
|||
WaitForData(entries); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this InitGPU
call why WaitForData
needs to be in the Execute
function instead of sitting before it? Otherwise it seems we could reduce some duplication by calling it beforehand.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, the InitGPU
call is what lazily creates Horovod's internal CUDA stream which is required for the WaitForData
variant that uses cudaStreamWaitEvent
. I do also dislike the duplication but wasn't sure if there were better options. If you have any suggestions, I'm all ears.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for clarifying, I think that's fine then. Let's land it!
…in hvd.init() Signed-off-by: Max H. Gerlach <git@maxgerlach.de>
…in hvd.init() Signed-off-by: Max H. Gerlach <git@maxgerlach.de>
…in hvd.init() Signed-off-by: Max H. Gerlach <git@maxgerlach.de>
Checklist before submitting
Description
This PR introduces changes to Horovod to enable more asynchronous data dependency handling on GPU, for both input data into Horovod and output data back to the framework.
In Horovod prior to this PR, input data dependencies for the GPU are enforced via CPU synchronization on
ReadyEvent
s (i.e.cudaEvent
) before scheduling any GPU collective operations. While this does enforce the dependency, it stalls the Horovod background thread unnecessarily. Instead of moving on to coordinate/schedule additional collective operations, the thread waits for the data to be ready on the first operation scheduled in the sequence. To free up the background thread, the CPU synchronization is replaced with calls tocudaStreamWaitEvent
which allow the background thread to immediately schedule dependent GPU operations, offloading the data dependency handling to the CUDA driver.Horovod also enforces output data dependencies via CPU synchronization. Currently, Horovod spawns a completion thread that waits on a
cudaEvent
(for this discussion, I'll refer to this as the completion event) recorded after Horovod GPU operations are scheduled, which ensures that any completion callbacks to the framework are not executed until the Horovod GPU operations have completed. However, this introduces latency on the framework ops dependent on the Horovod results because this implementation makes it impossible for the framework to schedule any of those ops until Horovod operations have completed, exposing the full cost of the callback/setup/launch of the downstream operation. Similar to the input data dependency, this PR introduces a mechanism for asynchronous dependency handling of the output data viacudaEvent
s andcudaStreamWaitEvent
. Instead of having the Horovod completion thread wait on the completion event, we can pass the event through to the framework via the callback, and handle it on the framework side:cudaStreamWaitEvent
call to place the dependency on the framework GPU stream.cudaEventSynchronize
on the completion event. This will be updated when an upstream MXNet feature (Add async GPU dependency Engine apache/mxnet#20331) that can better take advantage of this mechanism is merged. I will be posting a second WIP PR to illustrate this soon.The output data dependency handling, due to more complex interactions with the framework, is currently disabled by default and enabled via a new env var
HOROVOD_ENABLE_ASYNC_COMPLETION
. This can be changed based on feedback. Additionally, when the timeline is enabled, the input and output dependencies revert to being handled via CPU synchronization in order to provide correct timings.Some other related, but more minor changes in this PR:
ReadyEvent
s replaced with list ofReadyEvent
. This is to enable an input tensor having multiple upstream dependencies (required in upcoming MXNet PR)cudaEventQuery
inWaitForEvents
with calls tocudaEventSynchronize
in non-elastic cases to reduce number of CUDA API calls.