我要评分
获取效率
正确性
完整性
易理解

Checking Matricization

The matricization check function checks matricizable code fragments and provides modification suggestions.

Introduction

The matricization check tool checks and optimizes code snippets that incorporate the Stencil, General matrix-vector multiplication (GEMV), or Fast Fourier Transform (FFT) technology. The tool can check and optimize C, C++, and Fortran source code. The check process is performed on the abstract syntax tree (AST). The C and C++ source code uses Clang to generate the AST, and the Fortran source code uses Fparser to generate the AST. The tuning process is closely related to each computing mode.

  • Stencil computation is an important kind of computation widely used in scientific applications, such as partial differential equations, the Gauss–Seidel method, computational fluid dynamics, and earth system simulation. It iteratively updates the values of the spatial grid points over multiple time steps according to a given pattern. The fixed pattern in which each point in the spatial grid is updated based on a subset of its neighbors is called Stencil.
  • GEMV is a common linear algebra operation that can be highly optimized to take advantage of the parallelism and vectorization instructions of modern computer architectures. In computer science, GEMV is usually used as part of matrix multiplication, that is, multiplying a matrix with a vector.
  • FFT is an efficient and fast method for calculating the discrete Fourier transform (DFT). This method features a high calculation efficiency since it can complete the calculation within the time complexity of O(nlogn), where n is the length of the sequence. At the same time, it has good flexibility for it supports different decomposition methods and calculation algorithms.

    FFT implements optimization based on the Fastest Fourier Transform in the West (FFTW) acceleration library. During a scan, you need to add the corresponding FFTW acceleration library header files (fftw3.h and fftw3-mpi.h) to the directory where the file to be identified is located. After that, you can obtain an optimization suggestion result.

Table 1 describes the 12 domain optimization technologies for C/C++.

Table 1 Domain optimization technologies for C/C++

Domain Optimization

Description

Equivalent transformation

Vectorization is enabled by converting power expansion to multiplication.

Precision-consistent conversion of division to multiplication

The reciprocal calculation is hoisted out of the loop to convert the division calculation into the multiplication calculation of the same precision.

Communication hiding

Some code snippets are irrelevant to communication variables before and after a blocking communication function is called. Those irrelevant code snippets are identified and moved to the end of the function and the blocking communication function is changed to a non-blocking communication function, aiming to improve code parallelism.

Cut-off radius branch elimination

The conditional branch statements that depend on loop variables in a loop are replaced with conditional expressions. This reduces the penalties caused by branch prediction failures.

Iterative calculation and lookup splitting

For conditional branch statements in a loop whose code structure is complex, it is difficult to directly use the optimization code of cut-off radius branch elimination. Example code is provided to split the loop structure based on the conditional branch.

Inter-particle forces iteration optimization

For conditional branch statements in a loop, a temporary array is introduced to store the result of the conditional branch. The original loop is rewritten into two, and the conditional judgment is moved to the previous loop and rewritten into a ternary conditional operation. The next loop is executed based on the result stored in the temporary array.

Unrolling of intermolecular forces iteration

Calculations of intermolecular force kernel functions are identified and cyclic iterations are unrolled to improve the instruction parallelism degree.

Full unrolling of fixed-length loops

For the innermost loop that has the fixed upper and lower bounds, compilation directive statements are automatically added to fully unroll the loops, reducing the loop branch overhead.

Adjacency table aggregation

Particles in the same calculation threshold range are aggregated into the same group, so that the same calculation method can be used for the particles in the same group, to eliminate the need to determine the particle distances.

Vectorized calculation of adjacency particle forces

The code in a loop is rewritten by moving, copying, or defining temporary arrays, and directive statements are added to the vectorized loop so that the compiler implements vectorization.

Optimized default number of OpenMP threads

The default number of OpenMP threads in function declaration is changed to the default maximum available number of OpenMP threads. This optimization enables multi-thread acceleration.

SpMV vectorization

Sparse matrix-vector multiplication (SpMV) in CSR format enables automatic vectorization.

Table 2 describes the 19 domain optimization technologies for Fortran.

Table 2 Domain optimization technologies for Fortran

Domain Optimization

Description

Equivalent transformation

Vectorization is enabled by converting power expansion to multiplication.

Elimination of redundant common operators

Common subsequences are extracted and are stored in temporary arrays. Extracting common subsequences across blocks eliminates redundant calculations.

Unit step calculation optimization

The sign function in the judgment and assignment statements in a loop is converted to a step function (max/min/merge) call, thus enabling vectorization.

Precision-consistent conversion of division to multiplication

The reciprocal calculation is hoisted out of the loop to convert the division calculation into the multiplication calculation of the same precision.

Search algorithm optimization

The code of implementing searches is identified and replaced with the code of the binary search algorithm to improve search performance.

Large data dimension reduction

n-dimensional arrays are defined in the code, but only m (m < n)-dimensional arrays are used. In this case, memory access can be optimized by rebuilding the arrays as m-dimensional arrays.

Communication hiding

Some code snippets are irrelevant to communication variables before and after a blocking communication function is called. Those irrelevant code snippets are identified and moved to the end of the function and the blocking communication function is changed to a non-blocking communication function, aiming to improve code parallelism.

Parallelization of reduction calculation

When reduction calculation exists in a loop, the loop is expanded to reduce the dependency of variables on themselves and increase the degree of parallelism.

Directive statement optimization

Directive statements are used to implement vectorization and prefetch optimization for the compiler.

Sin/Cos operator fusion

Sin/Cos calculations are combined to reduce function calls and accelerate performance.

Exp calculation simplification

The multiplication calculation of multiple exp functions is replaced with the addition calculation within a single exp function. This replacement reduces exp function calls to lessen calculation workload and accelerate performance.

Loop fusion

Adjacent loops are merged to reduce the loop overhead, improve data locality and accelerating performance.

Constant calculation elimination

Constant division is converted into equivalent multiplication to improve computing performance.

Array dimension transposition

Dimensions of multi-dimensional arrays are transposed to optimize the memory access performance of arrays.

Loop interchange–based memory access optimization

To address discontinuous memory access caused by inconsistencies between the loop iteration order and the array storage layout, data flow analysis is conducted to split the iteration process and reorder the sequence.

Collective communication optimization

Point-to-point communication flows in the source code are automatically analyzed and optimized into broadcast collective communication.

Vector mask calculation optimization

Vector statements with masks are automatically optimized into Max, Min, and Merge operations.

Random number algorithm optimization

The synchronous random number generation algorithm used in parallel computing is optimized to asynchronous random number generation.

SpMV vectorization

Sparse matrix-vector multiplication (SpMV) in CSR format enables automatic vectorization.

Command Function

Checks matricizable code snippets.

Syntax

devkit advisor matrix-check [-h | --help] {-i INPUT_PATH | --input INPUT_PATH} [-s SCAN_DIR | --scan-dir SCAN_DIR] [-b {make,cmake} | --build-tool {make,cmake}] [-c COMMAND | --cmd COMMAND] [-j COMPILE_JSON_PATH | --compile-command-json COMPILE_JSON_PATH] [-o OUTPUT_PATH | --output OUTPUT_PATH] [-r {all,html,csv} | --report-type {all,html,csv}] {-p {sme,domain} | --optimization {sme,domain}} [-m {compute,memory_access,communication} | --module {compute,memory_access,communication}] [-f CONFIGURE_FILE | --configure-file CONFIGURE_FILE] [-l {0,1,2,3} | --log-level {0,1,2,3}][--set-timeout TIMEOUT]

Parameter Description

Table 3 Parameter description

Parameter

Option

Description

-h/--help

-

Obtains help information. This parameter is optional.

-i/--input

-

Absolute path to the source code folder to be scanned. This parameter is mandatory.

-s/--scan-dir

-

Relative path to the file or folder to be scanned in the source folder. Use spaces to separate them if there are multiple paths. This parameter is optional.

-b/--build-tool

make/cmake

Build tool, which defaults to make. Set either -b or -j but not both. This parameter is optional.

-c/--cmd

-

Source code build command, which defaults to make. This parameter is optional. Builds source code. If there are multiple build commands, separate them with semicolons (;) and enclose them with single quotation marks (') or double quotation marks ("). If a command contains spaces, enclose it with single or double quotation marks. Set either -c or -j but not both.

Example: "mkdir build;cd build;cmake ..;make"

NOTE:

The source code build command in the command line tool does not support variable setting and environment variable export.

Example: "CFLAGS='-O0 -g';make" or "export CFLAGS='-O0 -g';make"

-j/--compile-command-json

-

Path to the compile_commands.json file. This parameter is optional. For details about how to generate a JSON file, see Generating a JSON File.

Set either -b/-c or -j but not both.

-o/--output

-

Path for storing scan reports. By default, scan reports are stored in the current execution path. A report name is the format of Module-name_Timestamp. This parameter is optional.

-r/--report-type

all/html/csv

Scan report format, which defaults to all. This parameter is optional.

  • all: generates reports in HTML and CSV formats.
  • html: generates a report only in HTML format.
  • csv: generates a report only in CSV format.

-p/--optimization

sme/domain

Matricization optimization method. This parameter is mandatory.

  • sme: Scalable Matrix Extension (SME)-based matricization, covering Stencil, GEMV, and FFT.
  • domain: domain-specific optimization, which must be used together with the -m option.

-m/--module

compute/memory_access/communication

Domain-specific optimization method. This parameter is optional.

  • compute: computing optimization, covering equivalent transformation, elimination of redundant common operators, unit step calculation optimization, precision-consistent conversion of division to multiplication, search algorithm optimization, parallelization of reduction calculation, directive statement optimization, Sin/Cos operator fusion, exp calculation simplification, loop fusion, cut-off radius branch elimination, iterative calculation and lookup splitting, inter-particle forces iteration optimization, unrolling of intermolecular forces iteration, full unrolling of fixed-length loops, adjacency table aggregation, vectorized calculation of adjacency particle forces, optimized default number of OpenMP threads, constant calculation elimination, vector mask calculation optimization, random number algorithm optimization, SPMV vectorization, and loop interchange–based memory access optimization.
  • memory_access: memory access optimization, that is, Large data dimension reduction and array dimension transposition.
  • communication: communication optimization, that is, communication hiding optimization and collective communication optimization.

When you have selected the domain option for matricization optimization, select at least one domain-specific optimization method.

-f/--configure-file

-

Generates an optimized file according to the line number range for source code optimization specified in the configuration file. This parameter is optional. The value must be an absolute path. For details about the format, see Configuration File Use (-f/--configure-file).

-l/--log-level

0/1/2/3

Log level, which defaults to 1. This parameter is optional.
  • 0: DEBUG
  • 1: INFO
  • 2: WARNING
  • 3: ERROR

--set-timeout

-

Timeout interval of a task, in minutes. If the execution duration exceeds the timeout interval, the task exits. This parameter is optional. By default, there is no timeout interval. The task will be executed until it is complete.

Example

In this example, the SurfHop.f90 source code in the /home/test_code/data directory is scanned, the build tool is make, the build command is make, and the matricization optimization methods are sme and domain. Replace the example parameter values with the actual ones.

1
devkit advisor matrix-check -i /home/test_code/data -s SurfHop.f90 -c make -b make -o /home/out/ -p domain,sme -m compute,memory_access,communication

The following information is displayed and a report is generated:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
Executing matricization check task, please wait...
Current progress: ################################# [100%]
Scanned time: 2025/11/18 09:37:36

Configuration:
    Scan source code path: /home/test_code/data
    Generate report path: /home/out
    Generate report type: all
    Task Timeout Interval: The timeout period is not set.
    Log level: info

Summary:
    Scanned 1 file, there are 9 suggestions.

For the details information, please check:
    /home/out/matrix-check_20251118093736_116c.html
    /home/out/matrix-check_20251118093736_116c.csv

The random_lib.f90 is in  /home/out/matrix-check_20251118093736_116c

Configuration File Use (-f/--configure-file)

By default, the tool generates an optimized file after a scan is complete. You can configure the JSON file to specify the line numbers of the source file to control the optimization scope.

The following gives an example:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
{
    "/home/demo/file1.c": [
        [
            20,
            30
        ]
    ],
    "/home/demo/file2.c": [
        [
            76,
            81
        ],
        [
            93,
            105
        ]
    ]
}
  • /home/demo/file1.c: indicates the source file. It must be an absolute path.
  • [20,30]: indicates that the optimization items of the code snippets within the current line number range are retained when the optimized file is generated.

Output Report

Table 4 Output report parameters

Parameter

Description

Configuration

Displays the software source file path.

Source File to Be Modified

Displays information such as the path to the source file that needs to be modified.