Memory Operations

In HLL, most basic objects are variables and constants stored in different data structures and data types. For assembly, most instruction operations are implemented by registers. However, the number of registers is limited, which cannot meet the variable requirements of programmers. In this case, interactions between registers and memory are frequent.

Memory operations are classified into two types: load and store. The usage specifications and functions of the two types are in one-to-one mapping. Take the load operation as an example. The most common instructions are ldr and ldp. The ldr instruction is used to load the content of a register, and the ldp instruction is used to load two registers. The basic instruction format is as follows:

ldr (ldp) register name (register name 2), memory addressing

Memory addressing mode

A memory address must be pointed to for both the load and the store operations. As shown in the following table, with the ldr instruction as an example, there are three most common memory addressing manners:

Type	ldr Instruction	Description (Assume that the initial address to be loaded is stored in the reg_addr register with a custom name. The address is addr.)
Base Plus offset	ldr x1, [reg_addr, offset]	Reads eight bytes (64-bit register) from the offset byte of addr and loads them to the x1 register. When the offset is 0, the address is loaded without offset.
Pre-indexed	ldr x1, [reg_addr, offset]!	The same as above. After the loading is complete, the addr value is saved in reg_addr with offset.
Post-indexed	ldr x1, [reg_addr], offset	Reads eight bytes from addr and loads them to the x1 register. After the loading is complete, the addr value is saved in reg_addr with offset.

Type

ldr Instruction

Description

(Assume that the initial address to be loaded is stored in the reg_addr register with a custom name. The address is addr.)

Base Plus offset

ldr x1, [reg_addr, offset]

Reads eight bytes (64-bit register) from the offset byte of addr and loads them to the x1 register. When the offset is 0, the address is loaded without offset.

Pre-indexed

ldr x1, [reg_addr, offset]!

The same as above. After the loading is complete, the addr value is saved in reg_addr with offset.

Post-indexed

ldr x1, [reg_addr], offset

Reads eight bytes from addr and loads them to the x1 register. After the loading is complete, the addr value is saved in reg_addr with offset.

According to the instruction description, the base plus offset mode has only the loading function, and the pre-indexed and post-indexed addressing modes change the initial addressing address. Therefore, data operation instructions such as ADD and SUB are added when the CPU executes the instructions.

Continuous memory copy

When a program needs to read a segment of continuous memory, selecting a proper load instruction addressing mode may optimize performance to some extent. As shown in the following table, 32 bytes need to be loaded. Generally, the implementation of mode 2 is better than that of mode 1.

Implementation Mode 1	Implementation Mode 2
... ... Ldr x1, [addr], #8 Ldr x2, [addr], #8 Ldr x3, [addr], #8 Ldr x4, [addr], #8	... ... ldr x1, [addr] ldr x2, [addr, #8] ldr x3, [addr, #16] ldr x4, [addr, #24]!

Implementation Mode 1

Implementation Mode 2

... ...

Ldr x1, [addr], #8

Ldr x2, [addr], #8

Ldr x3, [addr], #8

Ldr x4, [addr], #8

... ...

ldr x1, [addr]

ldr x2, [addr, #8]

ldr x3, [addr, #16]

ldr x4, [addr, #24]!

If only these instructions are used, there may be no theoretical effect. However, when the logic is embedded in a loop, the effect will be gradually reflected. Note that no matter which addressing mode is used, pay special attention to the reading and changing of addresses under boundary conditions. Otherwise, the function of your assembly code may be abnormal.

Tail processing to prevent memory corruption

When string functions are optimized, a large number of memory operations are processed. As described above, most memory operations are performed by loop expansion and a fixed byte length. However, when a memory operation is processed to a tail, if an instruction in the loop is still used to perform the last operation, the number of bytes to be processed may exceed the number of remaining bytes, which causes memory corruption. This situation may be avoided during memory allocation of the operating system. However, you still need to pay attention to it.

Let's assume a scenario where the number of remaining bytes is less than 16 but uncertain. The following two implementation modes are provided:

Implementation Mode 1	Implementation Mode 2
... ... L(Last16Bytes): ... ... ldrb w1, [addr], #1 subs count, count, #1 b.hi L(Last16Bytes) ... ...	/* Assume we calculated and labeled addrend from addr and count before*/ add addrend, addr, count ... ... ldr x1, [addrend, -8] ldr x2, [addrend, -16] ... ...

Implementation Mode 1

Implementation Mode 2

... ...

L(Last16Bytes):

... ...

ldrb w1, [addr], #1

subs count, count, #1

b.hi L(Last16Bytes)

... ...

/* Assume we calculated and labeled addrend from addr and count before*/

add addrend, addr, count

... ...

ldr x1, [addrend, -8]

ldr x2, [addrend, -16]

... ...

The common logic is similar to that in mode 1. The system reads single bytes cyclically and continuously determines whether the end condition is met. In mode 2, the last addr is calculated based on related parameters and marked as addrend. Then, two octets are loaded based on the address. As shown in the following figure, some memory areas in the middle are repeatedly operated with those completed by the previous instruction. However, the overlap does not affect functions. On the whole, the number of instructions is greatly reduced, and cyclic judgment is not required.

In addition to the preceding description about the usage of the ldr and ldp instructions, Armv8 in the load instruction implements many classes that are applicable to different scenarios but are seldom used, such as non-temporary memory operations, exclusive memory operations (depending on lock-related scenarios), and non-privileged memory operations (simulating the register operations of the EL0 system). You are advised to read the Armv8 instruction manual and give it a try.

Parent topic: Replacement and Optimization Cases on Arm64 in the glibc Scenario