CUDA Graphs #14

ax3l · 2019-08-05T14:32:13Z

From latest information on CUDA Graphs, follow the following rules of thumb:

always use CUDA Graphs to start kernels, it will always be at least the same speed or faster as not using task graphs, even if you only schedule one or a few kernels
when starting an app with a new GPU context, allocate and free a large task graph first (e.g. the largest you expect to occur at runtime), which will reduce initial latencies
when tracing the performance of graphs, they are internally batched in groups of ~8 tasks again, but one does not need to optimize for that, it just becomes visible

Official Docs: https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#cuda-graphs

michaelsippel · 2020-07-01T23:46:23Z

Currently I see two ways of using CUDA Graphs in combination with redGrapes:

for static, gpu-only tasks, create the graph manually with the CUDA API and submit it inside a redGrapes-task that gets handled by the cuda-scheduler, which in turn allocates a cuda-stream at the time it can be submitted.
Generally, this would be exactly the same as with any async cuda call.
Capture all cuda calls with the graph capture, until a host-task appears that creates a dependency. Then the graph capture is ended and submitted.

sbastrakov · 2020-07-02T06:31:02Z

@michaelsippel do you prefer one way to the other, or what are your thoughts on this matter generally?

michaelsippel · 2020-07-02T21:48:06Z

From the CUDA docs it seems that it should be possible to call cudaGraphLaunch() while capturing a graph. This means that both could be used simultaneously. The question rather is if this is useful at all, because we would always create a new cuda-graph and never reuse it. I don't really believe that it would be faster to wrap every cuda call in a cuda-graph, because it would delay the submission of cuda operations and also introduces more runtime overhead. As I understand it, you can only really benefit from it if you use the cuda-graphs as template and resubmit the same graph multiple times, because it saves you the cpu-time to create the graph. If you want to do that, then this can be done manually, but I don't know how to extract a cuda-graph-template out of a dynamic task graph with mixed host- and device-tasks.

psychocoderHPC · 2020-07-03T12:19:09Z

From the CUDA docs it seems that it should be possible to call cudaGraphLaunch() while capturing a graph. This means that both could be used simultaneously. The question rather is if this is useful at all, because we would always create a new cuda-graph and never reuse it. I don't really believe that it would be faster to wrap every cuda call in a cuda-graph, because it would delay the submission of cuda operations and also introduces more runtime overhead. As I understand it, you can only really benefit from it if you use the cuda-graphs as template and resubmit the same graph multiple times, because it saves you the cpu-time to create the graph. If you want to do that, then this can be done manually, but I don't know how to extract a cuda-graph-template out of a dynamic task graph with mixed host- and device-tasks.

It is definitive useful if we enqueue for example always an event behind the task to track the state of the operation.
IMO it will also be useful if you know that you have more than two tasks ready to enqueue.

michaelsippel · 2020-07-03T13:13:43Z

It is definitive useful if we enqueue for example always an event behind the task to track the state of the operation.

You mean, because we have to do this anyways, we always have at least two cuda operations (the actual call, and eventRecord) which can be recorded into the graph.

IMO it will also be useful if you know that you have more than two tasks ready to enqueue.

Yes, thats what I meant with the second bullet. I will post an image that explains how I imagine this to work. But I don't yet believe that it will gain any performance.

michaelsippel · 2020-07-03T16:40:42Z


 1. mgr.finish_task( Host Task 1 )
   1.1. scheduling_graph.task_end( Host Task 1 )
     1.1.1. scheduling_graph.reach_event( Event 2 )


   1.2. mgr.activate_followers( Host Task 1 )
     1.2.1. scheduler.activate_task( Host Task 2 )
         ...

     1.2.2. scheduler.activate_task( CUDA Task 3 )
       1.2.2.1. cuda_scheduler.activate_task( CUDA Task 3 )

         1.2.2.1.1. scheduling_graph.is_task_ready( CUDA Task 3 ) == true
         1.2.2.1.1.1. scheduling_graph.is_event_reached( Event 5 ) == true

         1.2.2.1.2. cuda_scheduler.is_recording == false
         1.2.2.1.2.1. cudaStreamBeginCapture()

         1.2.2.1.2.2. cuda_scheduler.dispatch_task( CUDA Task 3 )
           1.2.2.1.2.2.1. current_stream = select next cuda stream

           1.2.2.1.2.2.2. mgr.run_task( CUDA Task 3 )
           1.2.2.1.2.2.2.1. scheduling_graph.reach_event( 5 )

           1.2.2.1.2.2.3. cuda_event_1 = cudaEventRecord()

           1.2.2.1.2.2.4. Add cuda event as cuda-dependency in following cuda-tasks.
           1.2.2.1.2.2.4.1. [CUDA Task 4].properties.cuda_dependencies[ current_stream ] = cuda_event_1


           1.2.2.1.2.2.5. mgr.activate_followers( CUDA Task 3 )
             1.2.2.1.2.2.5.1. scheduler.activate_task( CUDA Task 4 )
             1.2.2.1.2.2.5.1.1. cuda_scheduler.activate_task( CUDA Task 4 )
               1.2.2.1.2.2.5.1.1.1. scheduling_graph.is_task_ready( CUDA Task 4 ) == true
               1.2.2.1.2.2.5.1.1.1.1. scheduling_graph.is_event_reached( Event 7 ) == true

               1.2.2.1.2.2.5.1.1.2. cuda_scheduler.is_recording == true
               1.2.2.1.2.2.5.1.1.2.1. cuda_scheduler.dispatch_task( CUDA Task 3 )
                 1.2.2.1.2.2.5.1.1.2.1.1. current_stream = select next cuda stream

                 1.2.2.1.2.2.5.1.1.2.1.2. cudaWaitEvent( cuda_event_1, current_stream )

                 1.2.2.1.2.2.5.1.1.2.1.3. mgr.run_task( CUDA Task 3 )
                 1.2.2.1.2.2.5.1.1.2.1.3.1. scheduling_graph.reach_event( 5 )

                 1.2.2.1.2.2.5.1.1.2.1.4. cuda_event_2 = cudaEventRecord()

                 1.2.2.1.2.2.5.1.1.2.1.5. Add cuda event as cuda-dependency in following cuda-tasks.
                 1.2.2.1.2.2.5.1.1.2.1.5.1. [CUDA Task 7].properties.cuda_dependencies[ current_stream ] = cuda_event_2

                 1.2.2.1.2.2.5.1.1.2.1.6. mgr.activate_followers( CUDA Task 4 )
                   1.2.2.1.2.2.5.1.1.2.1.6.1. scheduler.activate_task( Host Task 6 )
                   1.2.2.1.2.2.5.1.1.2.1.6.1.1. default_scheduler.activate_task( Host Task 6 )
                   1.2.2.1.2.2.5.1.1.2.1.6.1.1.1. scheduling_graph.is_task_ready( Host Task 6 ) == false
                   1.2.2.1.2.2.5.1.1.2.1.6.1.1.1.1. scheduling_graph.is_event_reached( Event 11 ) == false

                   1.2.2.1.2.2.5.1.1.2.1.6.2. scheduler.activate_task( CUDA Task 7 )
                   1.2.2.1.2.2.5.1.1.2.1.6.2.1. cuda_scheduler.activate_task( CUDA Task 7 )
                   1.2.2.1.2.2.5.1.1.2.1.6.2.1.1. scheduling_graph.is_task_ready( CUDA Task 7 ) == false
                   1.2.2.1.2.2.5.1.1.2.1.6.2.1.1.1. scheduling_graph.is_event_reached( Event 13 ) == false

             1.2.2.1.2.2.5.2. scheduler.activate_task( Host Task 5 )
             1.2.2.1.2.2.5.2.1. default_scheduler.activate_task( Host Task 5 )
               1.2.2.1.2.2.5.2.1.1. scheduling_graph.is_task_ready( Host Task 5 ) == false
               1.2.2.1.2.2.5.2.1.1.1. scheduling_graph.is_event_reached( Event 9 ) == false

         1.2.2.1.2.3. cudaStreamEndCapture()
         1.2.2.1.2.4. cudaGraphLaunch()

 2. cuda_scheduler.poll()
   2.1. cudaEventQuery() == cudaSuccess
   2.1.1. mgr.task_finish( CUDA Task 3 )
     2.1.1.1. scheduling_graph.task_end( CUDA Task 3 )
     2.1.1.1.1. scheduling_graph.reach_event( Event 6 )

     2.1.1.2. mgr.activate_followers( CUDA Task 3 )
       2.1.1.2.1. scheduler.activate_task( CUDA Task 4 )
       2.1.1.2.1.1. cuda_scheduler.activate_task( CUDA Task 4 )
         2.1.1.2.1.1.1. CUDA Task 4 is already submitted

       2.1.1.2.2. scheduler.activate_task( Host Task 5 )
       2.1.1.2.2.1. default_scheduler.activate_task( Host Task 5 )
         2.1.1.2.2.1.1. scheduling_graph.is_task_ready( Host Task 5 ) == true
         2.1.1.2.2.1.1.1. scheduling_graph.is_event_reached( Event 9 ) == true

         2.1.1.2.2.1.2. run Host Task 5 ...

psychocoderHPC · 2020-07-03T19:17:47Z

You mean, because we have to do this anyways, we always have at least two cuda operations (the actual call, and eventRecord) which can be recorded into the graph.

Yes in rhat case we have per task two cuda api calls.

But I don't yet believe that it will gain any performance

Not sure if you mean with performance latency
We need to reduce the latency. In PIConGPU we have over 2k API calls per second. Each call is around 10us. So we waste a lot of time blocking the cpu with the CUDA API latency. The slowdown is not 2millisec because most API calls will be performed in parallel to the execution of tasks.

michaelsippel added this to To Do in Release 0.2.0 Apr 18, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CUDA Graphs #14

CUDA Graphs #14

ax3l commented Aug 5, 2019 •

edited

michaelsippel commented Jul 1, 2020

sbastrakov commented Jul 2, 2020 •

edited

michaelsippel commented Jul 2, 2020

psychocoderHPC commented Jul 3, 2020

michaelsippel commented Jul 3, 2020

michaelsippel commented Jul 3, 2020 •

edited

psychocoderHPC commented Jul 3, 2020 •

edited by ax3l

CUDA Graphs #14

CUDA Graphs #14

Comments

ax3l commented Aug 5, 2019 • edited

michaelsippel commented Jul 1, 2020

sbastrakov commented Jul 2, 2020 • edited

michaelsippel commented Jul 2, 2020

psychocoderHPC commented Jul 3, 2020

michaelsippel commented Jul 3, 2020

michaelsippel commented Jul 3, 2020 • edited

psychocoderHPC commented Jul 3, 2020 • edited by ax3l

ax3l commented Aug 5, 2019 •

edited

sbastrakov commented Jul 2, 2020 •

edited

michaelsippel commented Jul 3, 2020 •

edited

psychocoderHPC commented Jul 3, 2020 •

edited by ax3l