Friday, March 9, 2018

Compiler Optimizations

Peephole optimization
In compiler theory, peephole optimization is a kind of optimization performed over a very small set of instructions in a segment of generated code. The set is called a "peephole" or a "window". It works by recognising sets of instructions that can be replaced by shorter or faster sets of instructions.
  • Null sequences – Delete useless operations.
  • Combine operations – Replace several operations with one equivalent.
  • Algebraic laws – Use algebraic laws to simplify or reorder instructions.
  • Special case instructions – Use instructions designed for special operand cases.
  • Address mode operations – Use address modes to simplify code.
The following Java bytecode
... aload 1 aload 1 mul ...
can be replaced by
... aload 1 dup mul ...

Induction variable
In computer science, an induction variable is a variable that gets increased or decreased by a fixed amount on every iteration of a loop or is a linear function of another induction variable.


For example, in the following loop, i and j are induction variables:
for (i = 0; i < 10; ++i) { j = 17 * i; }

Induction variable substitution
Induction variable substitution is a compiler transformation to recognize variables which can be expressed as functions of the indices of enclosing loops and replace them with expressions involving loop indices.
This transformation makes the relationship between the variables and loop indices explicit, which helps other compiler analysis, such as dependence analysis.


Input code:
int c, i; c = 10; for (i = 0; i < 10; i++) { c = c + 5; // c is incremented by 5 for each loop iteration }
Output code
int c, i; c = 10; for (i = 0; i < 10; i++) { c = 10 + 5 * (i + 1); // c is explicitly expressed as a function of loop index }
Non-linear induction variables
The same optimizations can be applied to induction variables that are not necessarily linear functions of the loop counter; for example, the loop
j = 1; for (i = 0; i < 10; ++i) { j = j << 1; }
may be converted to
for (i = 0; i < 10; ++i) { j = 1 << i+1; }

Strength reduction
In compiler construction, strength reduction is a compiler optimization where expensive operations are replaced with equivalent but less expensive operations. The classic example of strength reduction converts "strong" multiplications inside a loop into "weaker" additions – something that frequently occurs in array addressing. (Cooper, Simpson & Vick 1995, p. 1)
Examples of strength reduction include:
  • replacing a multiplication within a loop with an addition
  • replacing an exponentiation within a loop with a multiplication

在编译器构造中,强度降低是一种编译器优化,其中昂贵的操作被等价但较便宜的操作所取代。 强度降低的经典示例将循环内部的“强”乘法转换为“较弱”的加法 - 这在数组寻址中经常发生。 (Cooper,Simpson&Vick 1995,第1页)

Loop fission and fusion(循环整合/拆分)
In computer science, loop fission (or loop distribution) is a compiler optimization in which a loop is broken into multiple loops over the same index range with each taking only a part of the original loop's body. The goal is to break down a large loop body into smaller ones to achieve better utilization of locality of reference. This optimization is most efficient in multi-core processors that can split a task into multiple tasks for each processor.
Conversely, loop fusion (or loop jamming) is a compiler optimization and loop transformation which replaces multiple loops with a single one. It is possible when two loops iterate over the same range and do not reference each other's data. Loop fusion does not always improve run-time speed. On some architectures, two loops may actually perform better than one loop because, for example, there is increased data locality within each loop.

在计算机科学中,循环分裂(或循环分布)是一种编译器优化,其中循环在相同的索引范围内分裂为多个循环,每个循环只取得原始循环体的一部分。 目标是将一个大的循环体分解成更小的循环体,以更好地利用参考的局部性。 这种优化在多核处理器中最有效,可以将任务分解为每个处理器的多个任务。
相反,循环融合(或循环干扰)是一种编译器优化和循环变换,它可以用一个循环代替多个循环。当两个循环遍历相同的范围并且不引用彼此的数据时,这是可能的。 循环融合并不总是提高运行时速度。 在某些体系结构中,两个循环实际上可能比一个循环执行得更好,因为例如每个循环内的数据局部性都会增加。

Loop inversion(利用流水线)
int i, a[100];
i = 0;
while (i < 100) {
a[i] = 0;

is equivalent to:
int i, a[100];
i = 0;
if (i < 100) {
do {
a[i] = 0;
} while (i < 100);

Despite the seemingly greater complexity of the second example, it may actually run faster on modern CPUs because they use an instruction pipeline. By nature, any jump in the code causes a pipeline stall, which is a detriment to performance.

Loop interchange(利用缓存机制)
In compiler theory, loop interchange is the process of exchanging the order of two iteration variables used by a nested loop. The variable used in the inner loop switches to the outer loop, and vice versa. It is often done to ensure that the elements of a multi-dimensional array are accessed in the order in which they are present in memory, improving locality of reference.

在编译器理论中,循环交换是交换嵌套循环使用的两个迭代变量顺序的过程。 内循环中使用的变量切换到外循环,反之亦然。 通常这样做是为了确保多维数组的元素按照它们在内存中的存在顺序进行访问,从而改善参考的局部性。

For example, in the code fragment:
for i from 0 to 10
for j from 0 to 20
a[i,j] = i + j
loop interchange would result in:
for j from 0 to 20
for i from 0 to 10
a[i,j] = i + j
On occasion, such a transformation may create opportunities to further optimize, such as automatic vectorization of the array assignment

Loop-invariant code motion
In computer programming, loop-invariant code consists of statements or expressions (in an imperative programming language) which can be moved outside the body of a loop without affecting the semantics of the program. Loop-invariant code motion (also called hoisting or scalar promotion) is a compiler optimization which performs this movement automatically.

在计算机编程中,循环不变代码由语句或表达式(以命令式编程语言)组成,可以在循环体外移动而不影响程序的语义。 循环不变代码运动(也称为提升或标量提升)是一种自动执行此运动的编译器优化。

If we consider the following code sample, two optimizations can be easily applied.
for (int i = 0; i < n; i++) {
x = y + z;
a[i] = 6 * i + x * x;
The calculation x = y + z and x * x can be moved outside the loop since within they are loop-invariant code, so the optimized code will be something like this:
x = y + z;
t1 = x * x;
for (int i = 0; i < n; i++) {
a[i] = 6 * i + t1;

Invariant code detection(RDA)
Usually a reaching definitions analysis is used to detect whether a statement or expression is loop invariant.
For example, if all reaching definitions for the operands of some simple expression are outside of the loop, the expression can be moved out of the loop.


Loop nest optimization(循环嵌套优化)
In computer science and particularly in compiler design, loop nest optimization (LNO) is an optimization technique that applies a set of loop transformations for the purpose of locality optimization or parallelization or other loop overhead reduction of the loop nests. One classical usage is to reduce memory access latency or the cache bandwidth necessary due to cache reuse for some common linear algebra algorithms.
The technique used to produce this optimization is called loop tiling; also known as loop blocking, or strip mine and interchange.

在计算机科学中,特别是在编译器设计中,循环嵌套优化(LNO)是一种优化技术,它针对局部优化或并行化或其他环路开销减少循环嵌套应用一组循环转换。 一个经典的用法是减少存储器访问延迟或由于某些常见线性代数算法的缓存重用而需要的缓存带宽。
用于产生这种优化的技术称为循环平铺; 也被称为环路阻塞,或剥离矿井和交换。

Loop unrolling(空间换时间)
Loop unrolling, also known as loop unwinding, is a loop transformation technique that attempts to optimize a program's execution speed at the expense of its binary size, which is an approach known as the space–time tradeoff. The transformation can be undertaken manually by the programmer or by an optimizing compiler.
The goal of loop unwinding is to increase a program's speed by reducing or eliminating instructions that control the loop, such as pointer arithmetic and "end of loop" tests on each iteration; reducing branch penalties; as well as hiding latencies including the delay in reading data from memory. To eliminate this computational overhead, loops can be re-written as a repeated sequence of similar independent statements.
Loop unrolling is also part of certain formal verification techniques, in particular bounded model checking.

循环展开,也称为循环展开,是一种循环变换技术,它试图以牺牲二进制大小为代价来优化程序的执行速度,这是一种被称为时空折衷的方法。 转换可以由程序员或优化编译器手动完成。
循环展开的目标是通过减少或消除控制循环的指令(如每次迭代中的指针算术和“循环结束”测试)来提高程序的速度; 减少分支处罚; 以及隐藏延迟,包括延迟从内存读取数据。 为了消除这种计算开销,可以将循环重写为类似独立语句的重复序列。

Loop splitting(拆分循环以简化依赖关系)
Loop splitting is a compiler optimization technique. It attempts to simplify a loop or eliminate dependencies by breaking it into multiple loops which have the same bodies but iterate over different contiguous portions of the index range.
Loop peeling is a special case of loop splitting which splits any problematic first (or last) few iterations from the loop and performs them outside of the loop body.
Suppose a loop was written like this:
int p = 10;
for (int i=0; i<10; ++i)
y[i] = x[i] + x[p];
p = i;
Notice that p = 10 only for the first iteration, and for all other iterations, p = i - 1. A compiler can take advantage of this by unwinding (or "peeling") the first iteration from the loop.
After peeling the first iteration, the code would look like this:
y[0] = x[0] + x[10];
for (int i=1; i<10; ++i)
y[i] = x[i] + x[i-1];
This equivalent form eliminates the need for the variable p inside the loop body.

循环分割是一种编译器优化技术。 它试图通过将循环拆分成多个具有相同主体但循环遍历索引范围的不同连续部分的循环来简化循环或消除依赖关系。

Loop unswitching(提高并行性)
Loop unswitching is a compiler optimization. It moves a conditional inside a loop outside of it by duplicating the loop's body, and placing a version of it inside each of the if and else clauses of the conditional. This can improve the parallelization of the loop. Since modern processors can operate fast on vectors this increases the speed.

循环不切换是一种编译器优化。 它通过复制循环的主体,并在条件的if和else子句中放置它的一个版本,从而在其外部的循环内移动条件。 这可以改善循环的并行化。 由于现代处理器可以在矢量上快速运行,因此速度会提高。

Software pipelining(软流水线)
In computer science, software pipelining is a technique used to optimize loops, in a manner that parallels hardware pipelining. Software pipelining is a type of out-of-order execution, except that the reordering is done by a compiler (or in the case of hand written assembly code, by the programmer) instead of the processor. Some computer architectures have explicit support for software pipelining, notably Intel's IA-64 architecture.
It is important to distinguish software pipelining which is a target code technique for overlapping loop iterations, from modulo scheduling, the currently most effective known compiler technique for generating software pipelined loops. Software pipelining has been known to assembly language programmers of machines with instruction-level parallelism since such architectures existed. Effective compiler generation of such code dates to the invention of modulo scheduling by Rau and Glaeser. Lam showed that special hardware is unnecessary for effective modulo scheduling. Her technique, modulo variable expansion is widely used in practice. Gao et al. formulated optimal software pipelining in integer linear programming, culminating in validation of advanced heuristics in an evaluation paper. This paper has a good set of references on the topic.

区分作为迭代循环迭代的目标代码技术的软件流水线是非常重要的,这是模调度,这是当前用于生成软件流水线循环的最有效的已知编译器技术。自从存在这样的体系结构以来,具有指令级并行性的机器的汇编语言程序员已知软件流水线。这些代码日期的有效编译器生成由Rau和Glaeser发明的模调度。 Lam表示,对于有效的模块调度,特殊的硬件是不必要的。她的技术,模变扩展在实践中被广泛使用。高等人。在整数线性规划中制定了最优的软件流水线,最终验证了评估论文中的先进启发式算法。本文对这个主题有很好的参考。

Common subexpression elimination
In compiler theory, common subexpression elimination (CSE) is a compiler optimization that searches for instances of identical expressions (i.e., they all evaluate to the same value), and analyzes whether it is worthwhile replacing them with a single variable holding the computed value.


Constant folding(RDA)
Constant folding and constant propagation are related compiler optimizations used by many modern compilers. An advanced form of constant propagation known as sparse conditional constant propagation can more accurately propagate constants and simultaneously remove dead code.
Constant propagation is implemented in compilers using reaching definition analysis results. If all a variable's reaching definitions are the same assignment which assigns a same constant to the variable, then the variable has a constant value and can be replaced with the constant.

常量折叠和常量传播是许多现代编译器使用的相关编译器优化。 称为稀疏条件常量传播的一种高级形式的常量传播可以更准确地传播常量并同时删除死代码。
使用到达定义分析结果在编译器中执行常量传播。 如果所有变量的到达定义都是为变量赋予相同常量的相同赋值,则该变量具有一个常数值,并且可以用该常数替换。

Dead store
In computer programming, a local variable that is assigned a value but is not read by any subsequent instruction is referred to as a dead store. Dead stores waste processor time and memory, and may be detected through the use of static program analysis, and removed by an optimizing compiler.
If the purpose of a store is intentionally to overwrite data, for example when a password is being removed from memory, dead store optimizations can cause the write not to happen, leading to a security issue. Some system libraries have specific functions designed to avoid such dangerous optimizations, e.g. explicit_bzero on OpenBSD.

在计算机编程中,被分配一个值但不被任何后续指令读取的局部变量被称为死存储。 死亡存储会浪费处理器时间和内存,并可能通过使用静态程序分析进行检测,并通过优化编译器将其删除。
如果商店的目的是有意覆盖数据,例如当密码从内存中移除时,死锁商店优化会导致写入不发生,从而导致安全问题。 一些系统库具有旨在避免这种危险优化的特定功能,例如, OpenBSD上的explicit_bzero。

Live variable analysis
In compiler theory, live variable analysis (or simply liveness analysis) is a classic data-flow analysis performed by compilers to calculate for each program point the variables that may be potentially read before their next write, that is, the variables that are live at the exit from each program point.
Stated simply: a variable is live if it holds a value that may be needed in the future.

在编译器理论中,活动变量分析(或简单活性分析)是由编译器执行的经典数据流分析,用于为每个程序点计算在下一次写入之前可能被读取的变量,即活动时间为 从每个程序点退出。

Available expression(CSE)
In the field of compiler optimizations, available expressions is an analysis algorithm that determines for each point in the program the set of expressions that need not be recomputed. Those expressions are said to be available at such a point. To be available on a program point, the operands of the expression should not be modified on any path from the occurrence of that expression to the program point.
The analysis is an example of a forward data flow analysis problem. A set of available expressions is maintained. Each statement is analysed to see whether it changes the operands of one or more available expressions. This yields sets of available expressions at the end of each basic block, known as the outset in data flow analysis terms. An expression is available at the start of a basic block if it is available at the end of each of the basic block's predecessors. This gives a set of equations in terms of available sets, which can be solved by an iterative algorithm.
Available expression analysis is used to do global common subexpression elimination (CSE). If an expression is available at a point, there is no need to re-evaluate it.


Value numbering
Value numbering is a technique of determining when two computations in a program are equivalent and eliminating one of them with a semantics preserving information.


Sparse conditional constant propagation
In computer science, sparse conditional constant propagation is an optimization frequently applied in compilers after conversion to static single assignment form (SSA). It simultaneously removes some kinds of dead code and propagates constants throughout a program. Moreover, it can find more constant values, and thus more opportunities for improvement, than separately applying dead code elimination and constant propagation in any order or any number of repetitions.
The algorithm operates by performing abstract interpretation of the code in SSA form. During abstract interpretation, it typically uses a flat lattice of constants for values and a global environment mapping SSA variables to values in this lattice. The crux of the algorithm comes in how it handles the interpretation of branch instructions. When encountered, the condition for a branch is evaluated as best possible given the precision of the abstract values bound to variables in the condition. It may be the case that the values are perfectly precise (neither top nor bottom) and hence, abstract execution can decide in which direction to branch. If the values are not constant, or a variable in the condition is undefined, then both branch directions must be taken to remain conservative.
Upon completion of the abstract interpretation, instructions which were never reached are marked as dead code. SSA variables found to have constant values may then be inlined at (propagated to) their point of use.


Compiler Optimizations

Peephole optimization In  compiler theory , peephole optimization is a kind of  optimization  performed over a very small set of instruct...