Skip to content

Notes on Reductions

Nora Abi Akar edited this page Oct 13, 2020 · 3 revisions

Arbor performs reductions on the current, conductance, and ion concentration contributions made by mechanisms as part of the generated mechanism code. It would make more sense for the reductions to be calculated in the arbor library, which would result in better separation of concerns, and allow us to have more reproducible results. However, the inclusion of the reductions in the mechanism code is an optimization that makes the most out of cache-locality and cuts on memory costs.

We study the effect of moving the reduction code into the Arbor library, both on runtime and memory consumption. Ideally, we would like to achieve an improvement in the two metrics mentioned, but we could allow for some degradation in performance (in one or both metrics) for the reasons mentioned above.

Design Propsals

We describe 3 approaches to reductions, focusing only on the total current (vec_i) to illustrate the methods, though this would also apply to the total conductances (vec_g), the ionic internal and external concentrations (Xi and Xo), and the ionic currents (iX).

1. Present implementation:

A single vec_i that is read, modified and written for every mechanism. vec_i is not accessed in order.

for every mechanism m
  for i in range(m.num_cvs)
      vec_i[m.index[i]] += local_curr

Complexity: num_mechanisms * (cost(gather) + cost(scatter))

2. Split reduction from local current calculation #1:

A local_i vector per mechanism that is written in order. The local_i vectors are merged and reduced into the main vec_i after all mechanism current updates are performed.

for every mechanism m
  for i in range(m.num_cvs)
      m.local_i[i] = local_curr

// "Smart" merge and reduce the m.local_i vectors of mechanisms then accumulate into the main vec_i

Complexity: num_mechanisms * cost(sequential_store) + cost(merge_reduce) + cost(scatter) + cost(gather) Memory overhead: sum(num_cvs_per_mechanism)

3. Split reduction from local current calculation #2:

A local_i vector for all the mechanisms (with size = sum(num_cvs_per_mechanism)) , that is written out of order. The mechanisms would write their contributions into the vector such that all current contributions per cv will be consecutive in memory. The local_i vector is then reduced in order into vec_i.

for every mechanism m
  for i in range(m.num_cvs)
      local_i[m.new_index[i]] = local_curr

// Reduce local_i into the main vec_i

Complexity: num_mechanisms * cost(scatter) + cost(reduce) + cost(gather) + cost(scatter) Memory overhead: sum(num_cvs_per_mechanism)*2 // additional memory for the new_index vector


If we have any hope of improving the performance, we need to at least make sure that the performance of the mechanism current updates improves when we don't perform the reduction inside the mechanism. Because in both approaches (2) and (3) we are touching a larger memory (size = sum(num_cvs_per_mechanism) instead of size = num_cvs_in_cell), the cache performance may affect the runtime negatively, unless we use non-temporal stores. My experiments so far show that modifying the mechanism current updates to write into local vectors as in approaches (2) and (3) slightly impedes performance relative to approach (1) (we seem to get the add and scatter from approach (1) for free); that is without taking into account the extra reduction step that needs to be performed later.

If we hope not to improve performance but instead not too greatly impact it, then we can implement either approach (2) or (3). Approach (3) has been proven to be faster (the "smart_merge" indices from approach 2 are pre-computed during model-initialization and saved in the "new_index" vector from approach (3))

With both approaches (2) and (3), we need to implement a highly optimized reduction step because in all tests of the present approach (1), the reduction seems to happens for free.

Approach (3) vectorization approaches:

1. Vectorized reduction across CV:

local_i is stored as follows (assuming 2 mechanisms present on all compartments):

i0,0 i1,0 i0,1 i1,1 i0,2 i1,2 ...  

where ix,y is the contribution of mech x on cv y . We would use vector instructions to reduce i0,0 i1,0 ; i0,1 i1,1 and i0,2 i1,2 separately.

2. Smart vectorize, 1 lane per CV:

local_i is stored as follows (assuming k mechanisms present on all compartments and a vector length (or warp size) of 4):

i0,0 i0,1 i0,2 i0,3 | i1,0 i1,1 i1,2 i1,3 | ... | ik,0 ik,1 ik,2 ik,3 | i0,4, i0,5, i0,6, i0,7 | i1,4, i1,5, i1,6, i1,7 | ...

where ix,y is the contribution of mech x on cv y . We would use vector instructions to reduce [i0,0 i0,1 i0,2 i03] + [i1,0 i1,1 i1,2 i1,3] + ... + [ik,0 ik,1 ik,2 ik,3] and so on.

Performance numbers:

Performance numbers were obtained for the reduction implementation from approach (3), using vectorization approach (1). The changes were implemented only for the total current and total conductance (vec_i and vec_g). This was enough to run the following:

 A simulation of a model that has 100 cells with around 6600 compartments each, the dendrites of the cells have 2 mechanisms per cv.
 The runtime increases by 7% on GPU, and 2% on CPU.

The most optimal performance (besides implementation approach (1)) would be implementation approach (3), with vectorization approach (2), which has not yet been implemented.