CUDA_Graphs for Arbor

Problem Statement

Arbor has a lot of small CUDA kernels
Most individual kernels do not fill the GPU
Also, there is a lot of potential parallelism in these kernels

Mechanisms

Mechanisms are compiled from NMODL files provided by the user
- this happens AOT
These are added to regions at initialisation time
- this means that there are multiple sets per simulation
- these remain static across the run

First Target: `fvm_lowered_cell_impl::integrate`

We have the following structure
```
#+begin_example cpp
```
while !done { // Compute reversal potentials for m in revpot~~mechanisms~~_ { m->nrn~~current~~(); // KERNEL }

// Mark events for each mechanism state_->mark~~eventsuntilafter~~(); // KERNEL // Compute new currents state_->update~~currents~~(mechanisms_); // multiple KERNEL

// Add current contribution from gap~~junctions~~ state_->add~~gjcurrent~~(); // KERNEL

// Get rid of processed elements state_->drop~~consumedevents~~(); // KERNEL

// Update integration step times. state_->updatedt(dt~~max~~, tfinal); // KERNEL

// Take samples at cell time if sample time in this step interval. state_->advance~~samples~~(sample~~time~~_, sample~~value~~_); // KERNEL

// Integrate voltage by matrix solve. state.integrate(); // KERNEL

// Integrate mechanism state. for (auto& m: mechanisms_) { m->nrn~~state~~(); // KERNEL }

// Update ion concentrations. state_->ions~~initconcentration~~(); // KERNEL for (auto& m: mechanisms_) { m->write~~ions~~(); // KERNEL }

// Update time and test for spike threshold crossings. threshold~~watcher~~_.test(); // KERNEL state_->swap~~times~~(); // Pointer swap }
```
#+end_example
```
All updates to mechanism state and current are independent
- But assembling ion concentration and currents needs to be atomic/synchronised (addition)
- These are dependent on zeroing the relevant states

Structure of update currents

---+- rev_pot_0 -+---+- zero_Ca -+---+- mark_events_0 --- deliver_events_0 --- current_0 -+---
   +- rev_pot_1 -+   +- zero_Na -+   +- mark_events_1 --- deliver_events_1 --- current_1 -+
   +- ...       -+   +- ...     -+   +- ...

This seems to be the most attractive target

We have to adjust three kernels
- nrn_current
```
#+begin_example cpp
```
[[global]{.ul}]{.ul} void nrn~~current~~(mechanism~~gpuhhpp~~_ params_) { // ... // Write to global currents, needs to be atomic adds, if parallel ik = gk*(v-ek); il = params_.gl[tid_]*(v-params_.el[tid_]); // ... }
```
#+end_example
```
- nrn_state
```
#+begin_example cpp
```
[[global]{.ul}]{.ul} void nrn~~state~~(mechanism~~gpuhhpp~~_ params_) { // Update params_ }
```
#+end_example
```
- deliver_events
```
#+begin_example cpp
```
[[global]{.ul}]{.ul} void deliver~~events~~(int mechid_, mechanism~~gpuexp2synpp~~_ params_, deliverable~~eventstreamstate~~ events) { // Consume events }
```
#+end_example
```
- arguments will change between calls

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CUDA_Graphs for Arbor

Problem Statement

Mechanisms

First Target: `fvm_lowered_cell_impl::integrate`

Clone this wiki locally

CUDA_Graphs for Arbor

Problem Statement

Mechanisms

First Target: fvm_lowered_cell_impl::integrate

Clone this wiki locally

First Target: `fvm_lowered_cell_impl::integrate`