Skip to content
Evan Nemerson edited this page Jun 3, 2021 · 5 revisions

Compiler Options

Since SIMDe often relies heavily on autovectorization, compiler options are critical. We do what we can to help he compiler when possible, and we try to include optimized versions whenever we can, but there are things the compiler can optimize which we can't.

-O3

First, use -O3 (at least). There really is a lot of stuff in SIMDe which will vectorize at -O3 but not at -O2. Take a look at GCC's documentation of -O3, taking note especially of -ftree-loop-vectorize and -ftree-slp-vectorize; those are incredibly important optimization for SIMDe, since it makes the compiler look at all of our little loops and try to turn them into a single instruction.

I know a lot of people have been told that they shouldn't use -O3 because it can break your code. To be clear, -O3 does not enable unsafe optimizations, it enables expensive (at compile time) optimizations. -Ofast (or -ffast-math) contain some unsafe optimizations (depending on your definition of "unsafe", anyways), and we'll get to them, but -O3 does not.

Historically, I think the idea that -O3 was unsafe came from two places. First, compilers were more buggy in the past, and since -O3 enables more cutting-edge optimizations you are more likely to run into one of those bugs. I'm not saying compilers don't contain bugs anymore (we have found lots of them while developing SIMDe), but SIMDe is very well tested and we work around all the bugs we find.

The other thing which led people to believe -O3 is unsafe is that it can expose bugs in your code which were dormant at other optimization levels. Generally it's because you're depending on undefined behavior, and at -O3 the compiler is performing more optimizations which means it's more likely to perform an optimization which assumes you're not relying on undefined behavior. UBSan can help you find places where your code is relying on undefined behavior so you can eliminate it.

-Ofast and -ffast-math

While -O3 shouldn't break correct code, -Ofast can. According to the description in GCC's documentation, -O3 will

Disregard strict standards compliance. -Ofast enables all -O3 optimizations. It also enables optimizations that are not valid for all standard-compliant programs. It turns on -ffast-math, -fallow-store-data-races and the Fortran-specific -fstack-arrays, unless -fmax-stack-var-size is specified, and -fno-protect-parens.

Of those, -ffast-math is particularly interesting for SIMDe; I won't go into detail about exactly what it does here; GCC's documentation has some information, but what you should know is that if you use -ffast-math (including through -Ofast) SIMDe will enable some internal optimization flags (``). TODO finish

OpenMP SIMD

OpenMP 4 includes support for SIMD parallelism by annotating loops with pragmas. For example:

#pragma omp simd
for (int i = 0 ; i < 4 ; i++) {
  r[i] = a[i] + b[i];
}

SIMDe uses these annotations very extensively. We wrap it up in the SIMDE_VECTORIZE macros so you don't necessarily see it in the code directly, but most of the portable implementations in SIMDe use it; I currently count 2667 instances of "SIMDE_VECTORIZE" in SIMDe.

When people see "OpenMP", I think a lot of people get scared off because they don't want to use the OpenMP runtime, which is used for multi-threading. However, the OpenMP SIMD pragma doesn't have anything to do with runtime OpenMP; all the magic happens at compile time. In fact, several compilers (-qopenmp-simd for Intel C/C++ Compiler, and -fopenmp-simd for GCC and clang) support options to enable OpenMP 4 SIMD support without enabling full OpenMP support; in this case, the OpenMP runtime won't even be linked.

The down side of just using -fopenmp-simd instead of -fopenmp is that the compiler doesn't actually communicate that OpenMP SIMD is enabled in a way we can observe in the source code (i.e., there is no _OPENMP_SIMD macro), so SIMDe doesn't know that it is enabled and the SIMDE_VECTORIZE macros won't output OpenMP SIMD pragmas. To get around this, you'll need to define the SIMDE_ENABLE_OPENMP when compiling. For example: -fopenmp-simd -DSIMDE_ENABLE_OPENMP instead of just -fopenmp-simd.

SIMDe Configuration Options

Enabled by SIMDE_FAST_MATH

There are several macros you can define to get SIMDe to output faster code as a trade-off for something else. Most of these can be enabled by defining the SIMDE_FAST_MATH macro, which is also defined automatically when you pass -ffast-math. Here are the individual options which are set if you set SIMDE_FAST_MATH:

SIMDE_FAST_NANS

In my experience, most software doesn't really handle NaNs, or "handles" them by avoiding generating them. NaNs usually result from bad data which causes your code to do something like dividing by zero, taking the square root of a negative, etc.

Different platforms tend to have roughly equivalent functions which handle NaN very differently. For example, consider the x86 _mm_min_ps function and vminq_f32. These functions are both intended to return the minimum of two values, but if one of those values is NaN they behave differently:

a b _mm_min_ps vminq_f32
Real Real Real Real
Real NaN NaN NaN
NaN Real Real NaN
NaN NaN NaN NaN

In SIMDe, that means that we can't normally implement _mm_min_ps using vminq_f32; we have to do something like vec_sel(b, a, vec_cmpgt(b, a)). On the NEON side we can't use _mm_min_ps to implement vminq_f32; we have to do something like _mm_blendv_ps(_mm_set1_ps(NaN), _mm_min_ps(a, b), _mm_cmpord_ps(a, b)).

SIMDE_FAST_NANS tells SIMDe to just go ahead and ignore issues like this; just implement _mm_min_ps and vminq_f32 using one another. The vast majority of applications don't really care how NaNs are handled because there should never be any NaNs in the first case, and even if there are the code doesn't really know what to do with them. In these cases, SIMDE_FAST_NANS can provide a significant speed-up for free.

SIMDE_FAST_EXCEPTIONS

In many cases, we only want to apply an operation on part of a vector, but it's a lot faster to apply the operation to the entire vector, then blend the lanes we're interested in with the original vector. Unfortunately, if we do this and there is garbage in the lanes we're not interested in, we can end up with spurious floating point exceptions. SIMDE_FAST_EXCEPTIONS tells SIMDe to go ahead and ignore this, which is safe for most applications.

To be clear, these aren't C++ exceptions. If you're not using functions like _mm_getcsr or fegetexcept, you're not doing anything with these exceptions anyways.

The _mm_*_ss and _mm_*_sd functions are a good example of this; they only operate on the lowest element in the input. To get around this, SIMDe will first broadcast the lowest lane to all elements, then perform the operation, then blend the result into the lane we're interested in. Unless, of course, you use SIMDE_FAST_EXCEPTIONS.

This is likely to be much more important in the future; we're currently mostly ignoring this issue for predicated instruction sets like AVX-512 and SVE, but once that changes not using SIMDE_FAST_EXCEPTIONS will likely result in a major performance hit.

SIMDE_FAST_ROUND_TIES

Some functions have equivalents on different platforms which are the same except for how ties are rounded. For example, on x86 _mm_cvtps_epi32 seems to do pretty much the same thing as vcvtnq_s32_f32, but _mm_cvtps_epi32 uses the current rounding mode, whereas vcvtnq_s32_f32 always rounds ties towards even (which is the default rounding mode).

SIMDE_FAST_ROUND_TIES tells SIMDe to just ignore these differences.

SIMDE_FAST_ROUND_MODE

This is a bit more heavy-handed than SIMDE_FAST_ROUND_TIES, as it applies to to which rounding mode is used (usually truncation or towards even, but potentially also floor or ceiling).

Note that this mode only applies to functions where rounding is not the primary operation. For example, _mm_floor_ps will always round down, even if SIMDE_FAST_ROUND_TIES is defined.

SIMDE_FAST_CONVERSION_RANGE

For functions which convert from floating-point to integer types, there can be differences between platforms regarding how out-of-range values are handled. For example, if you're converting from 32-bit floats to 32-bit ints, how values outside of [INT32_MIN, INT32_MAX] are handled can vary; maybe on one platform out-of-range values return 0, whereas on others they are saturated. SIMDE_FAST_CONVERSION_RANGE allows SIMDe to ignore these differences.

It's worth noting that out-of-range conversions are undefined behavior in C, per § 6.3.1.4 of the standard:

When a finite value of real floating type is converted to an integer type other than _Bool, the fractional part is discarded (i.e., the value is truncated toward zero). If the value of the integral part cannot be represented by the integer type, the behavior is undefined.

The SIMD APIs which SIMDe provides, however, were originally intended to be hardware-specific so out-of-range conversions are defined by the hardware, and SIMDe has to honor that. However, most applications do not rely on out-of-range conversions, and it should generally be pretty safe to enable this.

Finding Performance Problems

Not all of the implementations in SIMDe as are as well-optimized as they could be. While our goal is to make sure that every function in SIMDe is as fast as we can make it on every platform, the reality is that SIMDe is enormous and our resources are limited, which means we have to focus our efforts. To that end, if you have real-world code using SIMDe on any target, any data you have about where in SIMDe your code is spending time would be extremely valuable to us.

Profiling tools and usage will vary by platform, but we'll take whatever we can get. If you'd like to add a section (or link) below on gathering profiling data on a specific platform or using a specific tool, please feel free.

Since SIMDe inlines everything by default, getting good profiling data can be a bit tricky. However, definining SIMDE_NO_INLINE prior to including SIMDe (for example, by adding -DSIMDE_NO_INLINE to your CFLAGS or CXXFLAGS environment variable(s)) will change this to never inline. It will be a big performance hit so you should never do it in production, but it can be invaluable for profiling!

WebAssembly

To figure out exactly what is slow, we use V8's built-in sampling profiler.

We also need to enable debug information to get function names (pass -g to the compiler), and also turn of SIMDe's inlining declarations (using -DSIMDE_NO_INLINE) so that we can see which intrinsics spend the most time.

Run benchmarks using d8 --prof Test.js, which generates a v8.log in the current working directory.

Next, process the generated output using tools/linux-tick-processor v8.log.