Optimizing Critical Paths in Loops

1. Example: Evaluating Polynomial
2. Reassociation Transformation

Here are some examples of not-so-obvious optimizations.

1. Example: Evaluating Polynomial

See: Computer Systems - A Programmer’s Perspective [Problem 5.5]

Take the following formula for evaluating a polynomial:

\begin{equation} a_0 + a_1 x + a_2 x^2 + ... + a_n x^n \end{equation}

and the corresponding code:

double poly(double a[], double x, long degree) {
  long i;
  double result = a[0];
  double x_pwr = x;
  double prod;
  for (i=1; i <= degree; i++) {
    prod = a[i] * x_pwr;
    result += prod;
    x_pwr *= x;
  }
  return result;
}

And consider another formula (called Horner's rule) for evaulating the same polynomial:

\begin{equation} a_0 + x (a_1 + x (a_2 + ... + x (a_{n-1} + x a_n))) \end{equation}

and the corresponding code:

double poly(double a[], double x, long degree) {
  long i;
  double result = a[degree];
  double prod;
  for (i = degree - 1; i >= 0; i--) {
    prod = x * result;
    result = a[i] + prod;
  }
  return result;
}

The first code has 2 float multiplications and 1 float addition per iteration while the second one has just 1 float multiplication and 1 float addition. Still, the first one runs faster than the second one when there are two float multiplication units in the CPU. Think why?

Let's rearrange the code inside the loop for the first code block:

prod = a[i] * x_pwr; x_pwr *= x;
result += prod;

Here two product can be done simultaneously and the result of addition is not required for computing the products in next iteration. Thus addition from previous iteration and two products of current iteration can run parallely. This means only one product x_pwr*=x is in the critical path. This can be re-written as below. Where code in each line is run parallely by the CPU.

prod_1   = a[1] * x_pwr_1;    x_pwr_2  = x_pwr_1 * x;
result_2 = result_1 + prod_1; incr_cmp_bleq; prod_2 = a[1] * x_pwr_2; x_pwr_3 = x_pwr_2 * x;
result_2 = result_2 + prod_2; incr_cmp_bleq; prod_3 = a[1] * x_pwr_3; x_pwr_4 = x_pwr_3 * x;
result_3 = result_2 + prod_3; ...

In contrast for the second code, the product, then the sum and then the product, sum of next iteration are consecutively dependent one each other. Due to this true data dependency, there is no chance for instruction level parallelism. Thus both product prod=x*result and sum result=a[i]+prod fall in the critical path.

2. Reassociation Transformation

Compare the following two snippets

for (i=0; i<limit; i+=2) {
  acc = acc * (data[i] * data[i+1]);
}

for (i=0; i<limit; i+=2) {
  acc = (acc * data[i]) * data[i+1];
}

Both snippets look same, with only difference in the associativity of multiplication. The first code is faster than the second one because the first one can computed product of data[i] and data[i+1] in parallel with the product of acc from previous iteration. Thus when pipelined, only one product is in critical path.

In contrast, for the second code, since value of acc is required at begining of iteration but that value is computed only at the end of the previous iteration. Thus the previous iteration must fully complete before next iteration can start. Thus all operations of an iteration (i.e. both products) are in critical path.

Backlinks

Computer Systems - A Programmer's Perspective

Optimizing Critical Paths in Loops

Table of Contents

1. Example: Evaluating Polynomial

2. Reassociation Transformation

Backlinks