Cruiser

Tool Overview

Cruiser is a smart precision analysis tool that can automatically analyze and insert the source code and quickly locate precision differences in the source code. The tool consists of two components: cruiser.exe (automatic locating program) and libcruiser.so (precision analysis library), which have been integrated into the DevKit toolchain.

The functions of the Cruiser components are as follows:

Automatic locating program
- Analyzes the source code of the NLP application.
- Creates an index call chain.
- Monitors the entire life cycle of variables.
- Automatically inserts analysis statements.
Precision analysis library
- Analyzes the statement implementation and provides the bottom-layer precision analysis capability.
- Checks value differences and abnormal values (NaN/iNF).

Constraints

The source code can be compiled and run successfully on the x86 and Kunpeng platforms.

Usage Guide

Obtain Cruiser. Download URL: https://gitee.com/openeuler/hpcrunner/tree/master/software/utils/cruiser (including linux_x86 and linux_arm versions)
Use Cruiser to perform instrumentation on the source code.
- Upload the source code and Cruiser to the same server.
- Run the following command to perform automatic instrumentation on the source code:
  1
  ./cruiser.exe --hook-mode main --root APP_DIR

Compile and run the instrumented source code on the x86 and Kunpeng platforms. Log files are generated.
Use Cruiser to analyze the calculation precision difference.
- Upload the source code, Cruiser, and log files generated on the x86 and Kunpeng platforms to the same server.
- Run the following command to analyze the calculation precision difference between the x86 and Kunpeng platforms:
  1
  ./cruiser.exe --root APP_DIR --log-arm ARM_LOG_DIR --log-x86 x86_LOG_DIR --log OUT_LOG_DIR

Optimization Cases

Case 1: Marine science and numerical modeling (MASNUM) optimization

According to the automatic instrumentation analysis, the output of the max function on the Kunpeng platform is inconsistent with that on the x86 platform.

Before instrumentation:

After instrumentation:

Modification method: Ensure that all parameters of the max function are double-precision.

After the modification, the precision of the core indicator ee is the same as that of the x86.

Case 2: Precision optimization for a carbon source/sink application

Source code tuning principle: Cruiser uses NLP technology to parse the source code and establish the topology relationship of the source code. Then, it specifies the main function or the difference point obtained from the analysis of the difference log as the root node to monitor the entire lifecycle. And it creates an index call chain of the root on the topology relationship. Next, it automatically inserts the analysis statement based on the call chain. After running the application on the Arm and x86 platforms, Cruiser analyzes logs to locate the differences and performs specific tuning on the differences. The following figure shows the working principle:

Creating an index call chain of the root

Analyzing differences automatically

Inserting analysis statements into the source code

The following problems are located:
- There are fused multiply-add calculations in the application, and the results are inconsistent.
- There are some scenarios where math libraries are used, and the results are inconsistent.
- There are a large number of grid data structures and arrays in the application, which can be used to optimize the memory.
Tuning solutions:
1. Remove options which are not friendly to precision.
  Delete the option -ffast-math that is not friendly to precision from the arch/configure_new.defaults compilation file. The option will optimize the application radically, affecting the precision parameters of variables.
2. Add options which are friendly to precision.
  Add the precision-friendly compilation option -ffp-compatibility=18, which maintains the same precision as that of ICC v18.
3. Disable the fused multiply-add calculation optimization.
  Add the compilation option -ffp-contract=off to disable the optimization of the fused multiply-add calculation. By default, the compiler optimizes fused multiply-add calculations radically, which causes precision differences. Therefore, add this option to disable it.
4. Optimize memory usage.
  Add the compilation option -ljemalloc for memory tuning. It can reduce memory fragments and improve concurrency performance.
5. Optimize the math library.
  Add the compilation option -lkm_l9 for math library tuning to reduce the impact of math library interfaces.
Optimization effect: After the tuning performed by Cruiser, the difference between the prediction value of the Kunpeng cluster mode and the actual data from the ground observation station is the smallest.

Case 3: NaN value problem locating for the Ensemble Kalman Filter (EnKF)

EnKF is a data assimilation method. In Kalman filter, it estimates the background error covariance by using the ensemble prediction, and achieves the optimal estimation of the target by minimizing the error covariance of the observed value and the simulated value.

Check the output result file. A large number of NaNf values exist.

Insert the analysis statements for checking the nan value into the source code.

Use Cruiser to analyze whether there are NaN values in variables.

Locating process:

Through source code tracing, the NaN values are located in the multiplication and addition calculation of the AddState function in the bva_p/Util_Module.f90 file, which is caused by the abnormal input of XB%FV.
Through monitoring, it is detected that the NaN value problem occurs in the cyclic execution of the BatchDA function of the bva_p/BatchVA_Module.f90 file, and the precision deviation occurs in the 15th round of the cyclic execution (the NaN value is found in the process array variable calculation). The subsequent iterative calculation amplifies the precision deviation, which affects the PSFC variable in the result file. Continue tracing the source code. There is an exception in the calculation of the variable DD in the Obs1Reject function of the bva_p/ObsReject.f90 file. As a result, the Ireject variable is not assigned the value of IdxRejc and 0 is returned. Therefore, the upper-layer function ObtainBH in the bva_p/ObtainBH_Module.f90 file is not executed, thereby failing to execute the loop in the BatchDA function in the bva_p/BatchVA_Module.f90 file. A round of loop calculation is missing.
Continue tracing the source code. It is found that the previous problem is caused by the value of Pij in the INTP_Bilinear function of the bva_p/INTP_Bilinear.f90 file. The difference is caused by the PP array.
Continue tracing the source code. It is found that the PP variable is initialized in the AllocateState function of the bva_p/Util_Module.f90 file and is not modified subsequently.
Output the second-layer data of the third dimension of the PP variable, that is, output XA%FV(:,:,2). It is found that after the variable requests for memory, all values in the variable are random values.
It is finally located that dirty data exists when the memory of the data block is requested, which affects subsequent calculations and operations.

Tuning solutions

Based on the preceding analysis, the client application has defects in memory usage. Add the memory optimization library -ljemalloc for recompilation.
There are precision differences when fused multiply-add calculations are performed. Add the -ffp-model=precise -ffp-contract=off compilation option for tuning.
In addition, add the -mcpu=tsv110 -ffp-compatibility=18 -fconvert=big-endian compilation option for common performance and precision tuning.

Parent topic: Precision Tuning Tools