Performance profiling on Linux

Using perf

perf is the most powerful performance profiler for Linux, featuring support for various hardware Performance Monitoring Units, as well as integration with the kernel's performance events framework.

We will only look at how can the perf command can be used to profile SIMD code. Full system profiling is outside of the scope of this book.


The first step is to record a program's execution during an average workload. It helps if you can isolate the parts of your program which have performance issues, and set up a benchmark which can be easily (re)run.

Build the benchmark binary in release mode, after having enabled debug info:

$ cargo build --release
Finished release [optimized + debuginfo] target(s) in 0.02s

Then use the perf record subcommand:

$ perf record --call-graph=dwarf ./target/release/my-program
[ perf record: Woken up 10 times to write data ]
[ perf record: Captured and wrote 2,356 MB (292 samples) ]

Instead of using --call-graph=dwarf, which can become pretty slow, you can use --call-graph=lbr if you have a processor with support for Last Branch Record (i.e. Intel Haswell and newer).

perf will, by default, record the count of CPU cycles it takes to execute various parts of your program. You can use the -e command line option to enable other performance events, such as cache-misses. Use perf list to get a list of all hardware counters supported by your CPU.

Viewing the report

The next step is getting a bird's eye view of the program's execution. perf provides a ncurses-based interface which will get you started.

Use perf report to open a visualization of your program's performance:

perf report --hierarchy -M intel

--hierarchy will display a tree-like structure of where your program spent most of its time. -M intel enables disassembly output with Intel syntax, which is subjectively more readable than the default AT&T syntax.

Here is the output from profiling the nbody benchmark:

- 100,00% nbody
  - 94,18% nbody
    + 93,48% [.] nbody_lib::simd::advance
    + 0,70% [.] nbody_lib::run
    + 5,06%

If you move with the arrow keys to any node in the tree, you can the press a to have perf annotate that node. This means it will:

  • disassemble the function

  • associate every instruction with the percentage of time which was spent executing it

  • interleaves the disassembly with the source code, assuming it found the debug symbols (you can use s to toggle this behaviour)

perf will, by default, open the instruction which it identified as being the hottest spot in the function:

0,76  │ movapd xmm2,xmm0
0,38  │ movhlps xmm2,xmm0
      │ addpd  xmm2,xmm0
      │ unpcklpd xmm1,xmm2
12,50 │ sqrtpd xmm0,xmm1
1,52  │ mulpd  xmm0,xmm1

In this case, sqrtpd will be highlighted in red, since that's the instruction which the CPU spends most of its time executing.

Using Valgrind

Valgrind is a set of tools which initially helped C/C++ programmers find unsafe memory accesses in their code. Nowadays the project also has

  • a heap profiler called massif

  • a cache utilization profiler called cachegrind

  • a call-graph performance profiler called callgrind