Prefetch
For data stored in the memory, the CPU needs to first fetch data from the memory to L3, then from L3 to L2 and L2 to L1. Finally, the data in L1 is fetched to the register. Then, the CPU can process the data. If the data to be processed next time by the CPU is in L1, the performance of the program is improved. Prefetch can be classified into hardware prefetch and software prefetch. Hardware fetches a possible memory access unit into the cache in advance according to historical information of memory access, so that the cache does not fail when data is needed. Hardware prefetch is universal. Software prefetch means that a programmer prefetches a specific location by writing a prefetch instruction in service code. Software prefetch is specific.
This document describes only the programming based on the Kunpeng platform. Therefore, only software prefetch is described.
Software Prefetch
Software prefetch is implemented by using a prefetch instruction, and prefetch instructions provided by architectures are different. On the Kunpeng platform, the format of the prefetch instruction is as follows:
PRFM prfop, [Xn|SP{, #pimm}]
prfop consists of type, target, and policy.
- The options of type are as follows:
- PLD: data preload
- PLI: instruction prefetch
- PST: data pre-storage
- The options of target are as follows:
- L1
- L2
- L3
Operations are performed on the specified cache layer.
- The options of policy are as follows:
- KEEP: Data is stored for a certain period of time after being prefetched. This policy applies to scenarios where data is used for multiple times.
- STRM: streaming or non-temporary prefetch. This policy applies to scenarios where data is used only once and will be eliminated after being used.
- Xn|SP: a 64-bit general register or stack pointer, which is usually the prefetch start address.
- pimm: offset in bytes. The value is an integer multiple of 8 and ranges from 0 to 32760. The default value is 0.
In terms of instruction composition, the core part of the prefetch instruction is prfop, which determines the prefetch type, prefetch cache level, and usage of prefetched data. This section describes PLD data prefetch. Other modes are similar. The core instructions of data prefetch and their functions are as follows:
Instruction |
Description |
|---|---|
PLDL1KEEP |
Data is prefetched to the L1 cache in keep mode and stored for a certain period of time after being used. |
PLDL2KEEP |
Data is prefetched to the L2 cache in keep mode and stored for a certain period of time after being used. |
PLDL3KEEP |
Data is prefetched to the L3 cache in keep mode and stored for a certain period of time after being used. |
PLDL1STRM |
Data is prefetched to the L1 cache in strm mode and eliminated from the cache after being used. |
PLDL2STRM |
Data is prefetched to the L2 cache in strm mode and eliminated from the cache after being used. |
PLDL3STRM |
Data is prefetched to the L3 cache in strm mode and eliminated from the cache after being used. |
The GCC compiler provides a builtin function for prefetch. The format is as follows:
__builtin_prefetch (const void *addr, int rw, int locality)
In this function:
- addr indicates the memory address of the data.
- rw is optional and can be set to 0 or 1. The value 0 indicates the read operation, and the value 1 indicates the write operation.
- locality is an optional parameter, indicating the duration for which data is stored in the cache, that is, the validity period. It can be set to 0 to 3. 0 indicates that the accessed data will be eliminated from the cache and cannot be accessed later. 3 indicates that the accessed data will be accessed again. 1 and 2 indicate short and medium duration respectively. The default value is 3.
Not using software prefetch:
int add_vector(int *dst, int *src1, int *src2, int size)
{
for (int index =0 ; index < size; index += 4) {
… // do something
}
}
Using software prefetch:
static inline void prefetch(const void* data)
{
__asm__ __volatile__(
"prfm PLDL1STRM, [%[data]] \n\t"
:: [data] "r" (data));
}
int add_vector(int *dst, int *src1, int *src2, int size)
{
for (int index =0 ; index < size; index += 4) {
prefetch(src1 + index + 256);
prefetch(src2 + index + 256);
… // do something
}
}