Loop Optimization

Principles

Loop optimization is to optimize the code of the loop used in the program. It helps to utilize the processor's computing units, improve the scheduling efficiency of the instruction pipeline, and increase the cache hit ratio. There are many methods for loop optimization, such as unrolling, fusion, fission, interchange, and tiling. You can select a method based on your source code.

Modification Method

Loop unrolling
Loop unrolling is to duplicate the loop body for multiple times to decrease the number of operations the loop condition is tested. Since the Kunpeng processor has multiple instruction execution units, you can increase the computing density to improve scheduling efficiency of the instruction pipeline. The following shows the code comparison before and after loop unrolling.

Before:
```
for (int i = 0; i < ARRAYLEN; i++) {
    arrayC[i] = arrayA[i] * arrayB[i];
}
```
After:
```
int i;
for (i = 0; i < ARRAYLEN - (ARRAYLEN % 4); i+=4) {
    arrayC[i] = arrayA[i] * arrayB[i];
    arrayC[i + 1] = arrayA[i + 1] * arrayB[i + 1];
    arrayC[i + 2] = arrayA[i + 2] * arrayB[i + 2];
    arrayC[i + 3] = arrayA[i + 3] * arrayB[i + 3];
}
for (; i < ARRAYLEN; i++) {
    arrayC[i] = arrayA[i] * arrayB[i];
}
```

Loop fusion
Loop fusion combines the bodies of adjacent loops to reduce operations on loop variables. Iteration variables are used in the loop body to improve the cache usage. The fusion of small loops also helps increase the probability of out-of-order execution by the Kunpeng processor and improve the scheduling efficiency of the instruction pipeline.
Before:
```
for (int i = 0; i < ARRAYLEN; i++) {
    arrayA[i] = arrayB[i] + 3;
}
for (int i = 0; i < ARRAYLEN; i++) {
    arrayC[i] = arrayA[i] + 5;
}
```
After:
```
for (int i = 0; i < ARRAYLEN; i++) {
    arrayA[i] = arrayB[i] + 3;
    arrayC[i] = arrayA[i] + 5;
}
```

Loop fission

Loop fission attempts to break complex or computing-intensive loop into multiple loops to improve register utilization.

Before:

for (int i = 1; i < ARRAYLEN; i++) {
    arrayA[i] = arrayA[i-1] + 2;
    arrayC[i] = arrayB[i] + 6;
}

After:

for (int i = 1; i < ARRAYLEN; i++) {
    arrayA[i] = arrayA[i-1] + 2;
}
for (int i = 1; i < ARRAYLEN; i++) {
    arrayC[i] = arrayB[i] + 6;
}

Loop interchange

Loop interchange is to exchange the nested sequence of loops so that memory access can meet the principle of locality to improve the cache hit ratio.

Before:

for (int j = 0; j < ARRAYLEN; j++) {
    for (int i = 0; i < ARRAYLEN; i++) {
        matrixC[i][j] = matrixA[i][j] + 3;
    }
}

After:

for (int i = 0; i < ARRAYLEN; i++) {
    for (int j = 0; j < ARRAYLEN; j++) {
        matrixC[i][j] = matrixA[i][j] + 3;
    }
}

Loop tiling

Loop tiling reorganizes a loop into a group of nested loops, each internal loop responsible for a small data block. It helps to utilize the existing data in the cache and improve the cache hit ratio. Generally, this method is used for large data sets.

Before:

for (int i = 0; i < n * ARRAYLEN; i++) {
    for (int j = 0; j < n * ARRAYLEN; j++) {
        matrixA[i][j] = matrixA[i][j] + matrixB[i][j];
    }
}

After:

// Assume that a proper tile block size is defined, which is named TILE_SIZE in the following. You can adjust the size based on the actual cache size or other factors.
#define TILE_SIZE 16 
for (int ii = 0; ii < n * ARRAYLEN; ii += TILE_SIZE) {
    for (int jj = 0; jj < n * ARRAYLEN; jj += TILE_SIZE) {
        for (int i = ii; i < ii + TILE_SIZE && i < n * ARRAYLEN; i++) {
            for (int j = jj; j < jj + TILE_SIZE && j < n * ARRAYLEN; j++) {
                matrixA[i][j] = matrixA[i][j] + matrixB[i][j];
            }
        }
    }
}

Parent topic: Optimization of Hot Functions