Introduction

What is SIMD

History of SIMD in Rust

Discover packed_simd

Writing fast and portable SIMD algorithms using packed_simd is, unfortunately, not trivial. There are many pitfals that one should be aware of, and some idioms that help avoid those pitfalls.

This book attempts to document these best practices and provides practical examples on how to apply the tips to your code.

Floating-point math

This chapter contains information pertaining to working with floating-point numbers.

Short Vector Math Library

Approximate functions

Fused Multiply Add

Enabling target features

Not all processors of a certain architecture will have SIMD processing units, and using a SIMD instruction which is not supported will trigger undefined behavior.

To allow building safe, portable programs, the Rust compiler will not, by default, generate any sort of vector instructions, unless it can statically determine they are supported. For example, on AMD64, SSE2 support is architecturally guaranteed. The x86_64-apple-darwin target enables up to SSSE3. The get a defintive list of which features are enabled by default on various platforms, refer to the target specifications in the compiler's source code.

Using RUSTFLAGS

One of the easiest ways to benefit from SIMD is to allow the compiler to generate code using certain vector instruction extensions.

The environment variable RUSTFLAGS can be used to pass options for code generation to the Rust compiler. These flags will affect all compiled crates.

There are two flags which can be used to enable specific vector extensions:

target-feature

  • Syntax: -C target-feature=<features>

  • Provides the compiler with a comma-separated set of instruction extensions to enable.

    Example: Use -C target-features=+sse3,+avx to enable generating instructions for Streaming SIMD Extensions 3 and Advanced Vector Extensions.

  • To list target triples for all targets supported by Rust, use:

    rustc --print target-list
    
  • To list all support target features for a certain target triple, use:

    rustc --target=${TRIPLE} --print target-features
    
  • Note that all CPU features are independent, and will have to be enabled individually.

    Example: Setting -C target-features=+avx2 will not enable fma, even though all CPUs which support AVX2 also support FMA. To enable both, one has to use -C target-features=+avx2,+fma

  • Some features also depend on other features, which need to be enabled for the target instructions to be generated.

    Example: Unless v7 is specified as the target CPU (see below), to enable NEON on ARM it is necessary to use -C target-feature=+v7,+neon.

target-cpu

  • Syntax: -C target-cpu=<cpu>

  • Sets the identifier of a CPU family / model for which to build and optimize the code.

    Example: RUSTFLAGS='-C target-cpu=cortex-a75'

  • To list all supported target CPUs for a certain target triple, use:

    rustc --target=${TRIPLE} --print target-cpus
    

    Example:

    rustc --target=i686-pc-windows-msvc --print target-cpus
    
  • The compiler will translate this into a list of target features. Therefore, individual feature checks (#[cfg(target_feature = "...")]) will still work properly.

  • It will cause the code generator to optimize the generated code for that specific CPU model.

  • Using native as the CPU model will cause Rust to generate and optimize code for the CPU running the compiler. It is useful when building programs which you plan to only use locally. This should never be used when the generated programs are meant to be run on other computers, such as when packaging for distribution or cross-compiling.

The target_feature attribute

Inlining

Detecting host features at runtime

Bounds checking

Reading and writing packed vectors to/from slices is checked by default. Independently of the configuration options used, the safe functions:

  • Simd<[T; N]>::from_slice_aligned(& s[..])
  • Simd<[T; N]>::write_to_slice_aligned(&mut s[..])

always check that:

  • the slice is big enough to hold the vector
  • the slice is suitably aligned to perform an aligned load/store for a Simd<[T; N]> (this alignment is often much larger than that of T).

There are _unaligned versions that use unaligned load and stores, as well as unsafe _unchecked that do not perform any checks iff debug-assertions = false / debug = false. That is, the _unchecked methods do still assert size and alignment in debug builds and could also do so in release builds depending on the configuration options.

These assertions do often significantly impact performance and you should be aware of them.

Vertical and horizontal operations

In SIMD terminology, each vector has a certain "width" (number of lanes). A vector processor is able to perform two kinds of operations on a vector:

  • Vertical operations: operate on two vectors of the same width, result has same width

Example: vertical addition of two f32x4 vectors

  %0     == | 2 | -3.5 |  0 | 7 |
              +     +     +   +
  %1     == | 4 |  1.5 | -1 | 0 |
              =     =     =   =
%0 + %1  == | 6 |  -2  | -1 | 7 |
  • Horizontal operations: reduce the elements of two vectors in some way, the result's elements combine information from the two original ones

Example: horizontal addition of two u64x2 vectors

  %0     == | 1 |  3 |
              └─+───┘
                └───────┐
                        │
  %1     == | 4 | -1 |  │
              └─+──┘    │
                └───┐   │
                    │   │
              ┌─────│───┘
              ▼     ▼
%0 + %1  == | 4 |   3 |

Performance consideration of horizontal operations

The result of vertical operations, like vector negation: -a, for a given lane, does not depend on the result of the operation for the other lanes. The result of horizontal operations, like the vector sum reduction: a.sum(), depends on the value of all vector lanes.

In virtually all architectures vertical operations are fast, while horizontal operations are, by comparison, very slow.

Consider the following two functions for computing the sum of all f32 values in a slice:


# #![allow(unused_variables)]
#fn main() {
fn fast_sum(x: &[f32]) -> f32 {
    assert!(x.len() % 4 == 0);
    let mut sum = f32x4::splat(0.); // [0., 0., 0., 0.]
    for i in (0..x.len()).step_by(4) {
        sum += f32x4::from_slice_unaligned(&x[i..]);
    }
    sum.sum()
}

fn slow_sum(x: &[f32]) -> f32 {
    assert!(x.len() % 4 == 0);
    let mut sum: f32 = 0.;
    for i in (0..x.len()).step_by(4) {
        sum += f32x4::from_slice_unaligned(&x[i..]).sum();
    }
    sum
}
#}

The inner loop over the slice is where the bulk of the work actually happens. There, the fast_sum function perform vertical operations into a vector, doing a single horizontal reduction at the end, while the slow_sum function performs horizontal vector operations inside of the loop.

On all widely-used architectures, fast_sum is a large constant factor faster than slow_sum. You can run the slice_sum example and see for yourself. On the particular machine tested there the algorithm using the horizontal vector addition is 2.7x slower than the one using vertical vector operations!

Performance profiling

While the rest of the book provides practical advice on how to improve the performance of SIMD code, this chapter is dedicated to performance profiling. Profiling consists of recording a program's execution in order to identify program hotspots.

Important: most profilers require debug information in order to accurately link the program hotspots back to the corresponding source code lines. Rust will disable debug info generation by default for optimized builds, but you can change that in your Cargo.toml.

Performance profiling on Linux

Using perf

perf is the most powerful performance profiler for Linux, featuring support for various hardware Performance Monitoring Units, as well as integration with the kernel's performance events framework.

We will only look at how can the perf command can be used to profile SIMD code. Full system profiling is outside of the scope of this book.

Recording

The first step is to record a program's execution during an average workload. It helps if you can isolate the parts of your program which have performance issues, and set up a benchmark which can be easily (re)run.

Build the benchmark binary in release mode, after having enabled debug info:

$ cargo build --release
Finished release [optimized + debuginfo] target(s) in 0.02s

Then use the perf record subcommand:

$ perf record --call-graph=dwarf ./target/release/my-program
[ perf record: Woken up 10 times to write data ]
[ perf record: Captured and wrote 2,356 MB perf.data (292 samples) ]

Instead of using --call-graph=dwarf, which can become pretty slow, you can use --call-graph=lbr if you have a processor with support for Last Branch Record (i.e. Intel Haswell and newer).

perf will, by default, record the count of CPU cycles it takes to execute various parts of your program. You can use the -e command line option to enable other performance events, such as cache-misses. Use perf list to get a list of all hardware counters supported by your CPU.

Viewing the report

The next step is getting a bird's eye view of the program's execution. perf provides a ncurses-based interface which will get you started.

Use perf report to open a visualization of your program's performance:

perf report --hierarchy -M intel

--hierarchy will display a tree-like structure of where your program spent most of its time. -M intel enables disassembly output with Intel syntax, which is subjectively more readable than the default AT&T syntax.

Here is the output from profiling the nbody benchmark:

- 100,00% nbody
  - 94,18% nbody
    + 93,48% [.] nbody_lib::simd::advance
    + 0,70% [.] nbody_lib::run
    + 5,06% libc-2.28.so

If you move with the arrow keys to any node in the tree, you can the press a to have perf annotate that node. This means it will:

  • disassemble the function

  • associate every instruction with the percentage of time which was spent executing it

  • interleaves the disassembly with the source code, assuming it found the debug symbols (you can use s to toggle this behaviour)

perf will, by default, open the instruction which it identified as being the hottest spot in the function:

0,76  │ movapd xmm2,xmm0
0,38  │ movhlps xmm2,xmm0
      │ addpd  xmm2,xmm0
      │ unpcklpd xmm1,xmm2
12,50 │ sqrtpd xmm0,xmm1
1,52  │ mulpd  xmm0,xmm1

In this case, sqrtpd will be highlighted in red, since that's the instruction which the CPU spends most of its time executing.

Using Valgrind

Valgrind is a set of tools which initially helped C/C++ programmers find unsafe memory accesses in their code. Nowadays the project also has

  • a heap profiler called massif

  • a cache utilization profiler called cachegrind

  • a call-graph performance profiler called callgrind

Machine code analysis tools

The microarchitecture of modern CPUs

While you might have heard of Instruction Set Architectures, such as x86 or arm or mips, the term microarchitecture (also written here as µ-arch), refers to the internal details of an actual family of CPUs, such as Intel's Haswell or AMD's Jaguar.

Replacing scalar code with SIMD code will improve performance on all CPUs supporting the required vector extensions. However, due to microarchitectural differences, the actual speed-up at runtime might vary.

Example: a simple example arises when optimizing for AMD K8 CPUs. The assembly generated for an empty function should look like this:

nop
ret

The nop is used to align the ret instruction for better performance. However, the compiler will actually generated the following code:

repz ret

The repz instruction will repeat the following instruction until a certain condition. Of course, in this situation, the function will simply immediately return, and the ret instruction is still aligned. However, AMD K8's branch predictor performs better with the latter code.

For those looking to absolutely maximize performance for a certain target µ-arch, you will have to read some CPU manuals, or ask the compiler to do it for you with -C target-cpu.

Summary of CPU internals

Modern processors are able to execute instructions out-of-order for better performance, by utilizing tricks such as branch prediction, instruction pipelining, or superscalar execution.

SIMD instructions are also subject to these optimizations, meaning it can get pretty difficult to determine where the slowdown happens. For example, if the profiler reports a store operation is slow, one of two things could be happening:

  • the store is limited by the CPU's memory bandwidth, which is actually an ideal scenario, all things considered;

  • memory bandwidth is nowhere near its peak, but the value to be stored is at the end of a long chain of operations, and this store is where the profiler encountered the pipeline stall;

Since most profilers are simple tools which don't understand the subtleties of instruction scheduling, you

Analyzing the machine code

Certain tools have knowledge of internal CPU microarchitecture, i.e. they know

  • how many physical register files a CPU actually has

  • what is the latency / throughtput of an instruction

  • what µ-ops are generated for a set of instructions

and many other architectural details.

These tools are therefore able to provide accurate information as to why some instructions are inefficient, and where the bottleneck is.

The disadvantage is that the output of these tools requires advanced knowledge of the target architecture to understand, i.e. they cannot point out what the cause of the issue is explicitly.

Intel's Architecture Code Analyzer (IACA)

IACA is a free tool offered by Intel for analyzing the performance of various computational kernels.

Being a proprietary, closed source tool, it only supports Intel's µ-arches.

llvm-mca