Matricization Check

The tool checks matricizable code snippets and provides modification suggestions.

Introduction

The matricization check tool checks and optimizes code snippets that incorporate the Stencil, General matrix-vector multiplication (GEMV), or Fast Fourier Transform (FFT) technology. The tool can check and optimize C, C++, and Fortran source code. The check process is performed on the Abstract Syntax Tree (AST). The C and C++ source code uses Clang to generate the AST, and the Fortran source code uses Fparser to generate the AST. The tuning process is closely related to each computing mode.

Stencil computation is an important kind of computation widely used in scientific applications, such as partial differential equations, the Gauss–Seidel method, computational fluid dynamics, and earth system simulation. It iteratively updates the values of the spatial grid points over multiple time steps according to a given pattern. The fixed pattern in which each point in the spatial grid is updated based on a subset of its neighbors is called Stencil.
GEMV is a common linear algebra operation that can be highly optimized to take advantage of the parallelism and vectorization instructions of modern computer architectures. In computer science, GEMV is usually used as part of matrix multiplication, that is, multiplying a matrix with a vector.
FFT is an efficient and fast method for calculating the discrete Fourier transform (DFT). This method features a high calculation efficiency since it can complete the calculation within the time complexity of O(nlogn), where n is the length of the sequence. At the same time, it has good flexibility for it supports different decomposition methods and calculation algorithms.

FFT implements optimization based on the Fastest Fourier Transform in the West (FFTW) acceleration library. During a scan, you need to add the corresponding FFTW acceleration library header files (fftw3.h and fftw3-mpi.h) to the directory where the file to be identified is located. After that, you can obtain an optimization suggestion result.

Table 1 describes the 12 domain optimization technologies for C/C++.

**Table 1** Domain optimization technologies for C/C++
Domain Optimization	Description
Equivalent transformation	Vectorization is enabled by converting power expansion to multiplication.
Precision-consistent conversion of division to multiplication	The reciprocal calculation is hoisted out of the loop to convert the division calculation into the multiplication calculation of the same precision.
Communication hiding	Some code snippets are irrelevant to communication variables before and after a blocking communication function is called. Those irrelevant code snippets are identified and moved to the end of the function and the blocking communication function is changed to a non-blocking communication function, aiming to improve code parallelism.
Cut-off radius branch elimination	The conditional branch statements that depend on loop variables in a loop are replaced with conditional expressions. This reduces the penalties caused by branch prediction failures.
Iterative calculation and lookup splitting	For conditional branch statements in a loop whose code structure is complex, it is difficult to directly use the optimization code of cut-off radius branch elimination. Example code is provided to split the loop structure based on the conditional branch.
Inter-particle forces iteration optimization	For conditional branch statements in a loop, a temporary array is introduced to store the result of the conditional branch. The original loop is rewritten into two, and the conditional judgment is moved to the previous loop and rewritten into a ternary conditional operation. The next loop is executed based on the result stored in the temporary array.
Unrolling of intermolecular forces iteration	Calculations of intermolecular force kernel functions are identified and cyclic iterations are unrolled to improve the instruction parallelism degree.
Full unrolling of fixed-length loops	For the innermost loop that has the fixed upper and lower bounds, compilation directive statements are automatically added to fully unroll the loops, reducing the loop branch overhead.
Adjacency table aggregation	Particles in the same calculation threshold range are aggregated into the same group, so that the same calculation method can be used for the particles in the same group, to eliminate the need to determine the particle distances.
Vectorized calculation of adjacency particle forces	The code in a loop is rewritten by moving, copying, or defining temporary arrays, and directive statements are added to the vectorized loop so that the compiler implements vectorization.
Optimized default number of OpenMP threads	The default number of OpenMP threads in function declaration is changed to the default maximum available number of OpenMP threads. This optimization enables multi-thread acceleration.
SpMV vectorization	Sparse matrix-vector multiplication (SpMV) in CSR format enables automatic vectorization.

Table 2 describes the 19 domain optimization technologies for Fortran.

**Table 2** Domain optimization technologies for Fortran
Domain Optimization	Description
Equivalent transformation	Vectorization is enabled by converting power expansion to multiplication.
Elimination of redundant common operators	Common subsequences are extracted and are stored in temporary arrays. Extracting common subsequences across blocks eliminates redundant calculations.
Unit step calculation optimization	The sign function in the judgment and assignment statements in a loop is converted to a step function (max/min/merge) call, thus enabling vectorization.
Precision-consistent conversion of division to multiplication	The reciprocal calculation is hoisted out of the loop to convert the division calculation into the multiplication calculation of the same precision.
Search algorithm optimization	The code of implementing searches is identified and replaced with the code of the binary search algorithm to improve search performance.
Large data dimension reduction	n-dimensional arrays are defined in the code, but only m (m < n)-dimensional arrays are used. In this case, memory access can be optimized by rebuilding the arrays as m-dimensional arrays.
Communication hiding	Some code snippets are irrelevant to communication variables before and after a blocking communication function is called. Those irrelevant code snippets are identified and moved to the end of the function and the blocking communication function is changed to a non-blocking communication function, aiming to improve code parallelism.
Parallelization of reduction calculation	When reduction calculation exists in a loop, the loop is expanded to reduce the dependency of variables on themselves and increase the degree of parallelism.
Directive statement optimization	Directive statements are used to implement vectorization and prefetch optimization for the compiler.
Sin/Cos operator fusion	Sin/Cos calculations are combined to reduce function calls and accelerate performance.
Exp calculation simplification	The multiplication calculation of multiple exp functions is replaced with the addition calculation within a single exp function. This replacement reduces exp function calls to lessen calculation workload and accelerate performance.
Loop fusion	Adjacent loops are merged to reduce the loop overhead, improve data locality and accelerating performance.
Constant calculation elimination	Constant division is converted into equivalent multiplication to improve computing performance.
Array dimension transposition	Dimensions of multi-dimensional arrays are transposed to optimize the memory access performance of arrays.
Loop interchange–based memory access optimization	To address discontinuous memory access caused by inconsistencies between the loop iteration order and the array storage layout, data flow analysis is conducted to split the iteration process and reorder the sequence.
Collective communication optimization	Point-to-point communication flows in the source code are automatically analyzed and optimized into broadcast collective communication.
Vector mask calculation optimization	Vector statements with masks are automatically optimized into Max, Min, and Merge operations.
Random number algorithm optimization	The synchronous random number generation algorithm used in parallel computing is optimized to asynchronous random number generation.
SpMV vectorization	Sparse matrix-vector multiplication (SpMV) in CSR format enables automatic vectorization.

Prerequisites

You have logged in to the Kunpeng DevKit.

/opt is the default installation directory of the tool. The following uses this directory as an example. Replace it with the actual directory.
In the IDE, the tool plugin can scan local projects. If the source code is included in a compressed package, decompress the package and select the decompressed folder.

Procedure

On the left pane of the page, choose Affinity Analyzer > Matricization Check and click

to create a task. See Figure 1.

Figure 1 Matricization check

**Table 3** Matricization check parameters
Parameter	Description
Task Name	A task name is automatically generated by default, which can be modified as required.
Source File Path	Set this parameter using either of the following methods: Enter the absolute path of the source file. Click Select Folder on the right and select the folder for storing the source file.
Files or Folders to Scan	Select one or more files or folders to be scanned. You can select them in a tree structure.
Optimization Method	The options are: Scalable Matrix Extension (SME) Stencil GEMV FFT Domain optimization Computing optimization: equivalent transformation, elimination of redundant common operators, unit step calculation optimization, precision-consistent conversion of division to multiplication, search algorithm optimization, parallelization of reduction calculation, directive statement optimization, Sin/Cos operator fusion, exp calculation simplification, loop fusion, cut-off radius branch elimination, iterative calculation and lookup splitting, inter-particle forces iteration optimization, unrolling of intermolecular forces iteration, full unrolling of fixed-length loops, adjacency table aggregation, vectorized calculation of adjacency particle forces, optimized default number of OpenMP threads, constant calculation elimination, vector mask calculation optimization, random number algorithm optimization, SPMV vectorization, and loop interchange–based memory access optimization. Memory access optimization: large data dimension reduction and array dimension transposition. Communication optimization: communication hiding optimization and collective communication optimization.
Compiler Options	Select a compilation method. The options are: Fill in the compile command Upload the compile_commands.json file. For details about how to upload the JSON file, see Generating a JSON File.
Build Tool	Select a build tool. The options are: Make CMake

Click Check. After the check is complete, the check report page is displayed. See Figure 2. Click the Task Information tab page to view the task details.

Figure 2 Matricization check report

You can click to sort the source files to be modified by path or number of recommended items.
In the upper right corner of the page, you can click Download Report. Then click Download Report (.csv) or Download Report (.html) to download the analysis report. Alternatively, you can click next to the task name and click Download Report (.csv) or Download Report (.html) to download the analysis report.
If the scanned file or folder contains FFT technical points, click Download Report (.html) to download the offline report. The downloaded file is a ZIP package. After the package is decompressed, the HTML offline report file and the FFT-SME library package (fftm.zip) package are generated.
If the scanned file or folder contains Stencil technical points, click Download Report (.html) to download the offline report. The downloaded file is a ZIP package. After the package is decompressed, the HTML offline report file and the optimization file (optimization_files.zip) are generated.

**Table 4** Parameters in the report
Parameter	Description
Source File Statistics
Files to Modify	Total number of files to be modified in the source file path.
Code Lines to Modify	Number of code lines to be modified.
Total Number of Suggestions	Total number of items recommended for modification. Modify these items to enhance application performance on the Kunpeng platform.
Source Files to Modify	Source files to be modified and suggestions for each file. You can click View Suggested Source Code in the Operation column to quickly go to the source file suggestion page.

If the check result suggests that there are source files need to be modified, click View Suggested Source Code in the Operation column. See Figure 3.
Figure 3 Source code modification suggestion
- The tool supports concurrent running of multiple matricization check tasks.
- To cancel a task, click Close during the task running process.
- To modify the configuration of a successful or failed task, click on the right of the task name to restart the task.
- If the check fails or the check result indicates that no modification is required, an empty report is generated.

Parent topic: Affinity Analyzer