我要评分
获取效率
正确性
完整性
易理解

Performing Configuration Before Installation

The optimization feature has been written into the patch file. After the patch file is applied to the source code, the feature is ready for use.

Using SVE to Accelerate Vector Computing

Prerequisites

The CPU supports SVE instruction optimization.

Check Method

Run the following command to check whether the CPU supports SVE instruction optimization:

lscpu

If the Flags line in the command output contains sve, the CPU supports SVE instruction optimization.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
Architecture:           aarch64
  CPU op-mode(s):       64-bit
  Byte Order:           Little Endian
CPU(s):                 320
  On-line CPU(s) list:  0-319
Vendor ID:              HiSilicon
  BIOS Vendor ID:       HiSilicon
  BIOS Model name:      Kunpeng 920 7285Z
  Model:                0
  Thread(s) per core:   2
  Core(s) per socket:   80
  Socket(s):            2
  Stepping:             0x0
  Frequency boost:      disabled
  CPU max MHz:          3000.0000
  CPU min MHz:          400.0000
  BogoMIPS:             200.00
  Flags:                fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics fphp asimdhp cpuid asimdrdm jscvt fcma lrcpc dcpop sha3 sm3 sm4 asimddp sha512 sve asimdfhm dit uscat ilrcpc fla
                        gm ssbs sb paca pacg dcpodp flagm2 frint svei8mm svef32mm svef64mm svebf16 i8mm bf16 dgh rng ecv
Caches (sum of all):
  L1d:                  10 MiB (160 instances)
  L1i:                  10 MiB (160 instances)
  L2:                   200 MiB (160 instances)
  L3:                   280 MiB (4 instances)
NUMA:
  NUMA node(s):         4
  NUMA node0 CPU(s):    0-79
  NUMA node1 CPU(s):    80-159
  NUMA node2 CPU(s):    160-239
  NUMA node3 CPU(s):    240-319
Vulnerabilities:
  Gather data sampling: Not affected
  Itlb multihit:        Not affected
  L1tf:                 Not affected
  Mds:                  Not affected
  Meltdown:             Not affected
  Mmio stale data:      Not affected
  Retbleed:             Not affected
  Spec rstack overflow: Not affected
  Spec store bypass:    Mitigation; Speculative Store Bypass disabled via prctl
  Spectre v1:           Mitigation; __user pointer sanitization
  Spectre v2:           Not affected
  Srbds:                Not affected
  Tsx async abort:      Not affected

Compilation Option

During GCC compilation, the -march compilation option specifies the Arm architecture version and the extended instruction set. In this patch, SVE is specified through the following compilation options:

1
-march=armv8-a+sve -msve-vector-bits=256

The former indicates that the SVE instruction is used, and the latter specifies the number of bits of the SVE vector length.

Using PF to Accelerate Data Processing

In the cases of frequent cyclic operations, parallel computing, and high cache miss rate, PF can be used to accelerate data processing and improve system performance.

Hardware Prefetch

Hardware prefetch is to read instructions and data addresses to the cache in advance by tracing the changes of instructions and data addresses. You are advised to enable the prefetch function in the BIOS.

  1. Restart the server and enter the BIOS.
  2. In the BIOS, choose Advanced > MISC Config and press Enter.
  3. Set CPU Prefetching Configuration to Enabled, and press F10.

Software Prefetch

In the Arm architecture, the Prefetch Memory (PRFM) instruction can be used to prefetch data. The prefetch instruction loads data to the cache, but the data is not immediately used by the processor. The prefetched data is usually stored in the L1 data cache. If the L1 cache is full, data may be stored in the L2 cache or L3 cache (if any). In addition, the prefetching effect may be further optimized by adjusting the prefetch step.

The following describes the implementation of the prefetch instruction in C++.

1
#define PLDL1KEEP_OFF(ptr, off) __asm__ volatile("prfm PLDL1KEEP, [%0, #(%1)]"::"r"(ptr), "i"(off):)

In this feature, the prefetch instruction is added before the SVE load instruction to combine the two optimization methods.