While you might have heard of Instruction Set Architectures, such as
mips, the term microarchitecture (also written here as µ-arch),
refers to the internal details of an actual family of CPUs, such as Intel's
Haswell or AMD's Jaguar.
Replacing scalar code with SIMD code will improve performance on all CPUs supporting the required vector extensions. However, due to microarchitectural differences, the actual speed-up at runtime might vary.
Example: a simple example arises when optimizing for AMD K8 CPUs. The assembly generated for an empty function should look like this:
nop is used to align the
ret instruction for better performance.
However, the compiler will actually generated the following code:
repz instruction will repeat the following instruction until a certain
condition. Of course, in this situation, the function will simply immediately
return, and the
ret instruction is still aligned.
However, AMD K8's branch predictor performs better with the latter code.
For those looking to absolutely maximize performance for a certain target µ-arch,
you will have to read some CPU manuals, or ask the compiler to do it for you
SIMD instructions are also subject to these optimizations, meaning it can get pretty difficult to determine where the slowdown happens. For example, if the profiler reports a store operation is slow, one of two things could be happening:
the store is limited by the CPU's memory bandwidth, which is actually an ideal scenario, all things considered;
memory bandwidth is nowhere near its peak, but the value to be stored is at the end of a long chain of operations, and this store is where the profiler encountered the pipeline stall;
Since most profilers are simple tools which don't understand the subtleties of instruction scheduling, you
Certain tools have knowledge of internal CPU microarchitecture, i.e. they know
how many physical register files a CPU actually has
what is the latency / throughtput of an instruction
what µ-ops are generated for a set of instructions
and many other architectural details.
These tools are therefore able to provide accurate information as to why some instructions are inefficient, and where the bottleneck is.
The disadvantage is that the output of these tools requires advanced knowledge of the target architecture to understand, i.e. they cannot point out what the cause of the issue is explicitly.
IACA is a free tool offered by Intel for analyzing the performance of various computational kernels.
Being a proprietary, closed source tool, it only supports Intel's µ-arches.