Rate This Document
Findability
Accuracy
Completeness
Readability

Software Prefetch

Principles

The data prefetch function loads data in code that may be used later to the cache in advance, reducing the CPU wait time for data to be loaded from the memory, increasing the cache hit ratio, and further improving the software running efficiency. The format of a prefetch instruction is as follows:

1
PRFM  prfop,  [Xn|SP{, #pimm}]
  1. prfop consists of type, target, and policy.
    • The options of type are as follows:
      • PLD: data preload
      • PLI: instruction prefetch
      • PST: data pre-storage
    • The options of target are as follows:
      • L1
      • L2
      • L3

      L1, L2, and L3 indicate three cache levels.

    • The options of policy are as follows:
      • KEEP: Data is stored for a certain period of time after being prefetched. This policy applies to scenarios where data is used for multiple times.
      • STRM: streaming or non-temporary prefetch. This policy applies to scenarios where data is used only once and will be eliminated after being used.
  1. Xn|SP indicates a 64-bit general register or stack pointer, which is usually the prefetch start address.
  1. pimm indicates the offset in bytes, indicating the length of the bytes to be prefetched. The value is an integer multiple of 8 and ranges from 0 to 32760. The default value is 0. The prefetch data length can be set based on service requirements. You are advised to prefetch data of various lengths to obtain the optimal value.

In terms of instruction composition, the core part of the prefetch instruction is prfop, which determines the prefetch type, prefetch cache level, and usage of prefetched data. This section describes PLD data prefetch. Other types are similar. Table 1 lists the core instructions for data prefetch.

Table 1 Core instructions

Instruction

Description

PLDL1KEEP

Data is prefetched to the L1 cache in keep mode and stored for a certain period of time after being used.

PLDL2KEEP

Data is prefetched to the L2 cache in keep mode and stored for a certain period of time after being used.

PLDL3KEEP

Data is prefetched to the L3 cache in keep mode and stored for a certain period of time after being used.

PLDL1STRM

Data is prefetched to the L1 cache in strm mode and eliminated from the cache after being used.

PLDL2STRM

Data is prefetched to the L2 cache in strm mode and eliminated from the cache after being used.

PLDL3STRM

Data is prefetched to the L3 cache in strm mode and eliminated from the cache after being used.

The GCC compiler has a builtin function for prefetch. The format is as follows:

1
__builtin_prefetch (const void *addr, int rw, int locality)
  • addr: data memory address.
  • rw: optional parameter. rw can be set to 0 or 1. 0 indicates the read operation, and 1 indicates the write operation.
  • locality: optional parameter. locality can be set to a value ranging from 0 to 3 (3 by default), indicating the duration for storing data in the cache. 0 indicates that the accessed data will be eliminated from the cache and cannot be accessed later. 3 indicates that the accessed data will be accessed again. 1 and 2 indicate short and medium duration respectively.

For more information about prefetch instructions, see the Arm instruction set manual:

https://developer.arm.com/documentation/ddi0596/2021-06/Base-Instructions/PRFM--immediate---Prefetch-Memory--immediate--?lang=en

Modification Method

You can observe the context code of data loading instructions such as LDR in hotspot functions and embed data prefetch operations into the code. Generally, data prefetch is performed in a loop. In C/C++ code, the prefetch instruction is called in the form of inline assembly and declared as inline. The following is an example:
// Prefetch 128-byte data from the PTR.
void inline Prefetch(int *ptr)  
{
    __asm__ volatile("prfm PLDL1KEEP, [%0, #(%1)]"::"r"(ptr), "i"(128));
}

The format of the PLDL1KEEP instruction is as follows:

PLDL1KEEP indicates that data is prefetched to the L1 cache and in keep mode. Data is retained for a certain period of time after being used. This applies to scenarios where data is used for multiple times.

The following sample code is used to multiply the elements of two arrays. Multiple pieces of data are prefetched each time, and loop unrolling is performed to improve the computing performance.

for (int i = 0; i < ARRAYLEN; i++) {
    arrayC[i] = arrayA[i] * arrayB[i];
}   

Add prefetch commands:

int i;
Prefetch(&arrayA[0]);
Prefetch(&arrayB[0]);
for (i = 0; i < ARRAYLEN - ARRAYLEN % 4; i+=4) {
    Prefetch(&arrayA[i + 4]);
    Prefetch(&arrayB[i + 4]);
    arrayC[i] = arrayA[i] * arrayB[i];
    arrayC[i + 1] = arrayA[i + 1] * arrayB[i + 1];
    arrayC[i + 2] = arrayA[i + 2] * arrayB[i + 2];
    arrayC[i + 3] = arrayA[i + 3] * arrayB[i + 3];
}
for (; i < ARRAYLEN; i++) {
    arrayC[i] = arrayA[i] * arrayB[i];
}

According to the test results, it takes 9,359 μs when the prefetch function is not used. This value turns 5,569 μs when the prefetch function is used. The code performance is greatly improved.