SIMD Performance nrn_current suboptimal performance

Cell under consideration:

Purkinje cell ~ 4000 compartments; only 9 are painted with mechanisms: the soma & 8 compartments of the axon.
The soma has 17 mechanisms and the axon shares 6 of those mechanisms.

Simulation configuration:

Cell group size:

1 cell per cell group:
For the shared 6 mechanisms the SIMD index vectors are arranged as follows: |___0___|___x___|__x+1__|__x+2__|__x+3__|__x+4__|__x+5__|__x+6__|__x+7__|__x+7__|__x+7__|__x+7__|
{__________INDEPENDENT__________}{__________CONTIGUOUS__________}{___________CONSTANT___________}

For the rest of the mechanisms, the SIMD index vectors are arranged as follows:
|___0___|___0___|___0___|___0___|
{___________CONSTANT___________}

The last 3 elements in both vectors are padding. They have zero weight and therefore zero contribution to the current and state updates.
4 cells per cell group:
For the shared 6 mechanisms the SIMD index vectors are arranged as follows: |___0___|___x___|__x+1__|__x+2__|__x+3__|__x+4__|__x+5__|__x+6__|__x+7__|___y___|__y+x__|_y+x+1_|
{__________INDEPENDENT__________}{__________CONTIGUOUS__________}{__________INDEPENDENT__________}

|_y+x+2_|_y+x+3_|_y+x+4_|_y+x+5_|_y+x+6_|_y+x+7_|___z___|__z+x__|_z+x+1_|_z+x+2_|_z+x+3_|_z+x+4_|
{__________CONTIGUOUS__________}{__________INDEPENDENT__________}{__________CONTIGUOUS__________}

|_z+x+5_|_z+x+6_|_z+x+7_|___w___|__w+x__|_w+x+1_|_w+x+2_|_w+x+3_|_w+x+4_|_w+x+5_|_w+x+6_|_w+x+7_|
{__________INDEPENDENT__________}{__________CONTIGUOUS__________}{__________CONTIGUOUS__________}

For the rest of the mechanisms, the SIMD index vectors are arranged as follows:
|___0___|___x___|___y___|___z___|
{__________INDEPENDENT__________}

Number of cells: 2048
Network configuration: ring with additional randomly connected synapses (9) with zero weight.

SIMD statistics and analysis:

With the current setup, when considering density mechanisms:

1 cell per cell group:
CONTIGUOUS: 6 SIMD vectors per cell group = 20.7%
CONSTANT: 17 SIMD vectors per cell group = 58.6%
INDEPENDENT: 6 SIMD vectors per cell group = 20.7%
NONE: 0 SIMD vectors per cell group
4 cells per cell group:
CONTIGUOUS: 30 SIMD vectors per cell group = 46.15%
CONSTANT: 0 SIMD vectors per cell group
INDEPENDENT: 35 SIMD vectors per cell group = 53.85%
NONE: 0 SIMD vectors per cell group

The CONSTANT vector stores require a vector reduction and a single element store and their loads require a single element load and a vector broadcast. The INDEPENDENT vector loads/stores are vectors gathers/scatter but on broadwell they are essentially serialized per element. The CONTIGUOUS loads/stores are vector loads/stored.

The number of vector operations in nrn_state and nrn_current differ from mechanism to mechanism and depend on the number of accessed elements and the intended operation. But in general vector stores in nrn_state are always contiguous, and vector loads adhere to the previously mentioned categories and their statistics. Arithmetic operations in nrn_state can be quite complex. In nrn_current arithmetic operations are few and simple, and both vector loads and stores adhere to the previously mentioned categories and their statistics.

Experiment setup:

exp 1: 1 cell/cell group - non-vectorized
exp 2: 4 cells/cell group - non-vectorized
exp 3: 1 cell/cell group - vectorized
exp 4: 4 cells/cell group - vectorized

NON-VECTORIZED

time\config	1 cell/cell group	4 cells/cell group
nrn_state	15.74 s	15.083 s
nrn_current	3.26 s	3.167 s
matrix	22.217 s	24.620 s
total	45.027 s	47.079 s

VECTORIZED

time\config	1 cell/cell group	4 cells/cell group
nrn_state	12.327 s	7.213 s
nrn_current	3.815 s	3.137 s
matrix	21.901 s	24.980 s
total	41.7 s	39.491 s

nrn_current is faster in the non-vectorized version than in the vectorized version when we have 1 cell/cell group. This is potentially because of the 58.6% of constant stores that require an additional SIMD reduction each.

However, when we use 4 cells/cell group, the vectorized version becomes faster than the non-vectorized version. It is even better than in the 1 cell/cell group non-vectorized case, but only slightly. There aren't any more CONSTANT SIMD vectors so the lack of significant speedup could be attributed to other causes such as the high percentage of INDEPENDENT stores, that are no better than non-vectorized serial stores; or to the fact that the arithmetic operations in nrn_current are not computationally intensive.

nrn_state benefits much more from the shift from 1 cell/cell group to 4 cell/cell group. However, the speedup does not yet warrant the switch to a default of 4 cells/cell group, which would require parallelizing the matrix assemble and solve functions.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

SIMD Performance nrn_current suboptimal performance

Cell under consideration:

Simulation configuration:

SIMD statistics and analysis:

Experiment setup:

Clone this wiki locally