Rate This Document
Findability
Accuracy
Completeness
Readability

Suggestions for Vectorizing Source Code

Table 1 Suggestions for vectorizing source code

Vectorization Suggestion

Cause of Failure to Vectorize

How to Modify

Example

Extracting loop control variables

The loop control variable of the for loop is a structure member. The compiler cannot determine the loop end condition. As a result, the loop cannot be automatically vectorized.

Extract the loop control variable out of the loop.

Example code:

1
2
3
for (i = 0; i < data->len; i++) {  
    vecC[i] = vecA[i] + vecB[i];
}

Modify it as follows:

1
2
3
4
int len = data->len;   // Extract the loop control variable out of the loop.
for (i = 0; i < len; i++) {  
     vecC[i] = vecA[i] + vecB[i];
}

Modifying the loop control condition

Clang 15 supports automatic vectorization while earlier versions do not.

Change the loop condition from <= to <, and the loop length from len to len+1.

Example code:

1
2
3
for (i = 0; i <= data->len; i++) {                     
  vecC[i] = vecA[i] + vecB[i];                       
}                                                      

Modify it as follows:

1
2
3
4
// Change the loop condition from <= to <, and the loop length from len to len+1.
for (i = 0; i < data->len + 1; i++) {
  vecC[i] = vecA[i] + vecB[i];
}

Adding a compilation instruction for automatic vectorization

After evaluating the benefits of vectorization, the compiler adopts a conservative policy and determines not to perform automatic vectorization.

Add the pragma compilation instruction to force automatic code vectorization.

Example code:

1
2
3
for (i = 0; i < data->len; i++) {                      
  vecC[i] = vecA[i] + vecB[i];                       
}                                                      

Modify it as follows:

1
2
3
4
5
// Add the pragma compilation instruction to force automatic code vectorization.
#pragma clang loop vectorize(enable)                   
for (i = 0; i < data->len; i++) {                      
  vecC[i] = vecA[i] + vecB[i];                       
}

Specifying that the memory to which the pointer points is not referenced by other pointers

It cannot be determined whether the memory to which the pointer points is referenced by any other pointers. The compiler will abandon automatic vectorization.

Add the restrict keyword to label the pointer variable.

Example code:

1
2
3
4
5
void func(int *A, struct Data *data)                   
{                                                      
data->a = A[0];                                        
data->b = A[1];                                        
}                                                      

Modify it as follows:

1
2
3
4
5
6
// Add restrict to the argument <A>.                   
void func(int *restrict A, struct Data *data)          
{                                                      
data->a = A[0];                                        
data->b = A[1];                                        
} 

Keeping the consistent data type and length

The variable type does not match. The compiler cannot perform automatic vectorization.

Change the variable type from long to int.

Example code:

1
2
3
4
5
6
7
void func(int *vec) {                                  
  long b = 1;                                        
  int i;                                             
  for (i = 0; i < 64; i++) {                         
      vec[i] = (b << i);                             
  }                                                  
}                                                      

Modify it as follows:

1
2
3
4
5
6
7
8
void func(int *vec) {                                  
// Change the variable type from long to int.
  int b = 1;                                         
  int i;                                             
  for (i = 0; i < 64; i++) {                         
      vec[i] = (b << i);                             
  }                                                  
}

Splitting the loop

The l-value space of the loop operation is fixed and the loop dependency exists. Therefore, the compiler cannot perform vectorization.

Split the loop. Store the l-value of each round of loop operation independently and then merge all the left values.

Example code:

1
2
3
4
for( int i = 0; i < 4; i++ ) {                          
......                                                 
  sum += a0 + a1 + a2 + a3;                          
}                                                      

Modify it as follows:

1
2
3
4
5
6
7
8
9
// Declare an array,                                   
// and assign element in each iteration,               
// and finally accumulate elements in the array.       
uint32_t sumTmp[4];                                    
for( int i = 0; i < 4; i++ ) {                         
......                                                 
  sumTmp[i] = a0 + a1 + a2 + a3;                     
}                                                      
sum = sumTmp[0] + sumTmp[1] + sumTmp[2] + sumTmp[3];   

Simplifying the code logic in the conditional branch

Complex operations exist in conditional branch statements. As a result, automatic vectorization is impossible.

Extract the operation statements out of the conditional branch.

Example code:

1
2
3
4
5
6
7
for( int i = 0; i < len; i++ ) {                       
  if (flag[i])                                       
      vecC[i] = vecA[i] + vecB[i];                   
  else                                               
      vecC[i] = vecA[i] - vecB[i];                   
  sum += vecC[i];                                    
}                                                      

Modify it as follows:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
for( int i = 0; i < len; i++ ) {                       
// Extract all expressions outside the branch.         
  int ifTrue = vecA[i] + vecB[i];                    
  int ifFalse = vecA[i] - vecB[i];                   
  if (flag[i])                                       
      vecC[i] = ifTrue;                              
  else                                               
      vecC[i] = ifFalse;                             
  sum += vecC[i];                                    
}

Changing the data type to unsigned

The data types are inconsistent and the compiler cannot perform vectorization.

Change the data type from signed to unsigned.

Example code:

1
2
3
4
5
6
int sum;                                               
for( int i = 0; i < len; i++ ){                        
  b0 = abs2(a0 + a4) + abs2(a0 - a4);                
  sum += (uint16_t)b0;                               
  }                                                  
return sum;                                            

Modify it as follows:

1
2
3
4
5
6
7
"// Change the type of <sum> from signed to unsigned.    
unsigned int sum;                                      
for( int i = 0; i < len; i++ ){                        
  b0  = abs2(a0 + a4) + abs2(a0 - a4);               
  sum += (uint16_t)b0;                               
  }                                                  
return sum;     

Reducing the calculation precision

The calculation precision requirement is high. To ensure the calculation precision, the compiler does not perform automatic vectorization.

Reduce the calculation precision.

Example code:

1
2
3
  DO K = 1,KM                                            
     veC(k)= (vecA(K) + vecB(K + 1))*0.5D0               
  END DO                                                 

Modify it as follows:

1
2
3
  DO K = 1,KM                                            
     veC(k)= (vecA(K) + vecB(K + 1))*0.5                 
  END DO    

Splitting the loop

The loo has many statements. The compiler cannot determine the variable dependency and does not perform vectorization.

Split the statements of the loop and add them to multiple loops.

Example code:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
  DO A = 1,AM                                            
  DO K = 1,KM                                            
  DO J = 3,JMT                                           
  DO I = 3,IMT                                           
      V1 (I,J,K,A)= V2 (I,J,K,A) + V3 (I,J,K,A)* D       
      U1 (I,J,K,A)= U2 (I,J,K,A) + U3 (I,J,K,A)* D       
  END DO                                                 
  END DO                                                 
  END DO                                                 
  END DO                                                 

Modify it as follows:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
  DO A = 1,N                                             
  DO K = 1,KM                                            
  DO J = 3,JMT                                           
  DO I = 3,IMT                                           
      V1 (I,J,K,A)= V2 (I,J,K,A) + V3 (I,J,K,A)* D       
  END DO                                                 
  DO I = 3,IMT                                           
      U1 (I,J,K,A)= U2 (I,J,K,A) + U3 (I,J,K,A)* D       
  END DO                                                 
  END DO                                                 
  END DO                                                 
  END DO

Reducing function calls in the loop

Function calls exist in the loop and the compiler cannot perform vectorization.

Extract calculations related to function calls out of the loop.

Example code:

1
2
3
4
5
6
7
  for( int i = 0; i < len; i++ ) {                       
      delta = -0.5 + (2*m+1)/(2.0*n);                    
      vecA[k].dx = delta*length*cos(theta);              
      vecA[k].dy = delta*length*sin(theta);              
      k++;                                               
  }                                                      
                                                         

Modify it as follows:

1
2
3
4
5
6
7
8
9
  // Extract math lib call outside the loop.             
      double cosNum = cos(theta);                        
      double sinNum = sin(theta);                        
  for( int i = 0; i < len; i++ ) {                       
      delta = -0.5 + (2*m+1)/(2.0*n);                    
      vecA[k].dx = delta*length*cosNum;                  
      vecA[k].dy = delta*length*sinNum;                  
      k++;                                               
  }

Using Fortran keywords

The Fortran language feature is not fully used.

Use array assignment instead of the loop to implement operations on multiple data records.

Example code:

1
2
3
 do i = 1, maxI                                          
 type1%array1(i)=array3(type1%array2(i))                 
 enddo                                                   

Modify it as follows:

1
 type1%array1=array3(type1%array2)

Specifying that the memory to which the pointer points is not referenced by other pointers and adding compilation commands

It cannot be determined whether the memory to which the pointer points is referenced by any other pointers. The compiler will abandon automatic vectorization.

Add the restrict keyword to label the pointer variable.

Example code:

1
2
3
4
 for (int i=0;i<len;++i) {                             
     a[i] = b[index[i]];                               
 }                                                     
                                                       

Modify it as follows:

1
2
3
4
5
6
7
8
 void func(int *a, int *__restrict__ b,                
               int *index, int len)                    
 {                                                     
     #pragma clang loop vectorize(enable)              
     for (int i=0;i<len;++i) {                         
           a[i] = b[index[i]];                         
     }                                                 
 }