Optimizing Critical Paths in Loops
Table of Contents
Here are some examples of not-so-obvious optimizations.
1. Example: Evaluating Polynomial
See: Computer Systems - A Programmer’s Perspective [Problem 5.5]
Take the following formula for evaluating a polynomial:
\begin{equation} a_0 + a_1 x + a_2 x^2 + ... + a_n x^n \end{equation}and the corresponding code:
double poly(double a[], double x, long degree) { long i; double result = a[0]; double x_pwr = x; double prod; for (i=1; i <= degree; i++) { prod = a[i] * x_pwr; result += prod; x_pwr *= x; } return result; }
And consider another formula (called Horner's rule) for evaulating the same polynomial:
\begin{equation} a_0 + x (a_1 + x (a_2 + ... + x (a_{n-1} + x a_n))) \end{equation}and the corresponding code:
double poly(double a[], double x, long degree) { long i; double result = a[degree]; double prod; for (i = degree - 1; i >= 0; i--) { prod = x * result; result = a[i] + prod; } return result; }
The first code has 2 float multiplications and 1 float addition per iteration while the second one has just 1 float multiplication and 1 float addition. Still, the first one runs faster than the second one when there are two float multiplication units in the CPU. Think why?
Let's rearrange the code inside the loop for the first code block:
prod = a[i] * x_pwr; x_pwr *= x;
result += prod;
Here two product can be done simultaneously and the result of addition is not required for computing the products in next iteration. Thus addition from previous iteration and two products of current iteration can run parallely. This means only one product x_pwr*=x
is in the critical path. This can be re-written as below. Where code in each line is run parallely by the CPU.
prod_1 = a[1] * x_pwr_1; x_pwr_2 = x_pwr_1 * x; result_2 = result_1 + prod_1; incr_cmp_bleq; prod_2 = a[1] * x_pwr_2; x_pwr_3 = x_pwr_2 * x; result_2 = result_2 + prod_2; incr_cmp_bleq; prod_3 = a[1] * x_pwr_3; x_pwr_4 = x_pwr_3 * x; result_3 = result_2 + prod_3; ...
In contrast for the second code, the product, then the sum and then the product, sum of next iteration are consecutively dependent one each other. Due to this true data dependency, there is no chance for instruction level parallelism. Thus both product prod=x*result
and sum result=a[i]+prod
fall in the critical path.
2. Reassociation Transformation
Compare the following two snippets
for (i=0; i<limit; i+=2) {
acc = acc * (data[i] * data[i+1]);
}
for (i=0; i<limit; i+=2) {
acc = (acc * data[i]) * data[i+1];
}
Both snippets look same, with only difference in the associativity of multiplication. The first code is faster than the second one because the first one can computed product of data[i]
and data[i+1]
in parallel with the product of acc
from previous iteration. Thus when pipelined, only one product is in critical path.
In contrast, for the second code, since value of acc
is required at begining of iteration but that value is computed only at the end of the previous iteration. Thus the previous iteration must fully complete before next iteration can start. Thus all operations of an iteration (i.e. both products) are in critical path.