Skip to content

CUDA_Graphs for Arbor

Brent Huisman edited this page Nov 25, 2020 · 1 revision

Problem Statement

  • Arbor has a lot of small CUDA kernels
  • Most individual kernels do not fill the GPU
  • Also, there is a lot of potential parallelism in these kernels

Mechanisms

  • Mechanisms are compiled from NMODL files provided by the user
    • this happens AOT
  • These are added to regions at initialisation time
    • this means that there are multiple sets per simulation
    • these remain static across the run

First Target: fvm_lowered_cell_impl::integrate

  • We have the following structure

    #+begin_example cpp
    

    while !done { // Compute reversal potentials for m in revpotmechanisms_ { m->nrncurrent(); // KERNEL }

    // Mark events for each mechanism state_->markeventsuntilafter(); // KERNEL // Compute new currents state_->updatecurrents(mechanisms_); // multiple KERNEL

    // Add current contribution from gapjunctions state_->addgjcurrent(); // KERNEL

    // Get rid of processed elements state_->dropconsumedevents(); // KERNEL

    // Update integration step times. state_->updatedt(dtmax, tfinal); // KERNEL

    // Take samples at cell time if sample time in this step interval. state_->advancesamples(sampletime_, samplevalue_); // KERNEL

    // Integrate voltage by matrix solve. state.integrate(); // KERNEL

    // Integrate mechanism state. for (auto& m: mechanisms_) { m->nrnstate(); // KERNEL }

    // Update ion concentrations. state_->ionsinitconcentration(); // KERNEL for (auto& m: mechanisms_) { m->writeions(); // KERNEL }

    // Update time and test for spike threshold crossings. thresholdwatcher_.test(); // KERNEL state_->swaptimes(); // Pointer swap }

    #+end_example
    
  • All updates to mechanism state and current are independent

    • But assembling ion concentration and currents needs to be atomic/synchronised (addition)
    • These are dependent on zeroing the relevant states
  • Structure of update currents

    ---+- rev_pot_0 -+---+- zero_Ca -+---+- mark_events_0 --- deliver_events_0 --- current_0 -+---
       +- rev_pot_1 -+   +- zero_Na -+   +- mark_events_1 --- deliver_events_1 --- current_1 -+
       +- ...       -+   +- ...     -+   +- ...
    
    • This seems to be the most attractive target
  • We have to adjust three kernels

    • nrn_current
    #+begin_example cpp
    

    [[global]{.ul}]{.ul} void nrncurrent(mechanismgpuhhpp_ params_) { // ... // Write to global currents, needs to be atomic adds, if parallel ik = gk*(v-ek); il = params_.gl[tid_]*(v-params_.el[tid_]); // ... }

    #+end_example
    
    • nrn_state
    #+begin_example cpp
    

    [[global]{.ul}]{.ul} void nrnstate(mechanismgpuhhpp_ params_) { // Update params_ }

    #+end_example
    
    • deliver_events
    #+begin_example cpp
    

    [[global]{.ul}]{.ul} void deliverevents(int mechid_, mechanismgpuexp2synpp_ params_, deliverableeventstreamstate events) { // Consume events }

    #+end_example
    
    • arguments will change between calls