SME指令和Intrinsic

SME指令集包含以下几类指令：

Loads和Stores
ZA与vector间的数据移动
Vector与ZA tile slice相加
SVE向量外积
读取SVL
清零ZA tile

ACLE中提供了大部分SME指令的intrinsic接口，intrinsic接口的定义中有以下常见的组成：

tile：操作的ZA tile的编号，如数据类型为8bit时，仅有一个tile，编号为0；tile必须为整型常量表达式。
slice：操作的slice的编号；slice可以为变量。
vnum：当intrinsic有vnum参数时，操作的slice编号为slice参数与vnum参数的加和；vnum参数可以为变量。
pg：控制是否操作slice上具体的一个元素。
_ver：垂直方向操作对应的ZA tile。
_hor：水平方向操作对应的ZA tile。
za8/za16/za32/za64/za128：代表ZA中元素的数据类型为8/16/32/64/128bit。

Loads和Stores

以Load指令为例，SME中包含两类load指令：

LD1B/LD1H/LD1W/LD1D/LD1Q：将ZA看作是一系列二维数组，每个数组根据数组元素大小，由多个切片（slice）组成，每个slice的大小为SVL bytes。
LDR：将ZA看作是一个由 SVL.B个SVL bytes宽度向量组成的向量数组。

LD1B/LD1H/LD1W/LD1D/LD1Q

从内存连续读取单slice长度的数据存放在指定ZA tile。

对应的intrinsic格式有以下两种类型（以水平方向为例）：

不含vnum参数：

// Also for _za16, _za32, _za64 and _za128 (with the same prototype).
void svld1_hor_za8(uint64_t tile, uint32_t slice, svbool_t pg,
const void *ptr)
__arm_streaming __arm_inout("za");

含vnum参数：

// Synthetic intrinsic: adds vnum to slice and vnum * svcntsb() to the
// address given by ptr.
//
// Also for _za16, _za32, _za64 and _za128 (with the same prototype).
void svld1_hor_vnum_za8(uint64_t tile, uint32_t slice, svbool_t pg,
const void *ptr, int64_t vnum)
__arm_streaming __arm_inout("za");

如图所示，绿色表示不含vnum参数场景，数据从ptr地址load到za对应slice位置。黄色标识含有vnum参数的场景，数据从ptr+vnum*SVL.B地址load到za对应slice+vnum位置。SVL.B通过svcntsb()获取。

示例：读取一个数据宽度为64bit，大小8x8的方阵存放在ZA0 tile中（SVL.B=64）

void test_svld1_hor_vnum_za64(uint32_t slice_base, svbool_t pg, const void *ptr)
    __arm_streaming __arm_out("za"){
    for (int i = 0; i < 8; i++) {
        svld1_hor_vnum_za64(0, slice_base, pg, ptr, i);
    }
} 

LDR

从内存连续读取SVL字节的数据存放在指定的ZA vector，该指令在非流模式也可以运行，对应的intrinsic格式：

void svldr_za(uint32_t slice, const void *ptr) __arm_streaming __arm_inout("za");

示例：从ptr指定地址中读取数据到ZA对应slice中

void test_svldr_vnum_za(uint32_t slice, const void *ptr) __arm_out("za") {
    svldr_vnum_za(slice, ptr, 0);
}

ZA与vector间的数据移动

MOVA

从指定的ZA tile读取某个切片上的数据，存放到目标向量寄存器中，或从向量寄存器的数据移动到ZA指定tile上，ACLE仅提供了前者对应的intrinsic，列表如下（以水平方向为例）：

// And similarly for u8.
svint8_t svread_hor_za8[_s8]_m(svint8_t zd, svbool_t pg,
uint64_t tile, uint32_t slice)
__arm_streaming __arm_in("za");
// And similarly for u16, bf16 and f16.
svint16_t svread_hor_za16[_s16]_m(svint16_t zd, svbool_t pg,
__arm_streaming __arm_in("za");
// And similarly for u32 and f32.
svint32_t svread_hor_za32[_s32]_m(svint32_t zd, svbool_t pg,
uint64_t tile, uint32_t slice)
__arm_streaming __arm_in("za");
// And similarly for u64 and f64.
svint64_t svread_hor_za64[_s64]_m(svint64_t zd, svbool_t pg,
uint64_t tile, uint32_t slice)
__arm_streaming __arm_in("za");
// And similarly for s16, s32, s64, u8, u16, u32, u64, bf16, f16, f32, f64
svint8_t svread_hor_za128[_s8]_m(svint8_t zd, svbool_t pg,
uint64_t tile, uint32_t slice)
__arm_streaming __arm_in("za");

示例：读取za0对应slice的数据到目标寄存器中，数据类型为int64

void test_svread_hor_za64(svint64_t zd, svbool_t pg, uint32_t slice)  __arm_streaming __arm_in("za") {
    svread_hor_za64_s64_m(zd, pg, 0, slice); 
}

Vector与ZA tile slice相加

ADDHA/ADDVA

将zn寄存器中的数据与ZA tile slice的数据相加，结果覆盖传入的ZA tile slice，intrinsic列表如下（以水平方向为例）：

void svaddha_za32[_s32]_m(uint64_t tile, svbool_t pn, svbool_t pm,
svint32_t zn)
__arm_streaming __arm_inout("za");
void svaddha_za32[_u32]_m(uint64_t tile, svbool_t pn, svbool_t pm,
svuint32_t zn)
__arm_streaming __arm_inout("za");
// 920 72F8不支持
void svaddha_za64[_s64]_m(uint64_t tile, svbool_t pn, svbool_t pm,
svint64_t zn)
__arm_streaming __arm_inout("za");
// 920 72F8不支持
void svaddha_za64[_u64]_m(uint64_t tile, svbool_t pn, svbool_t pm,
svuint64_t zn)
__arm_streaming __arm_inout("za");

示例：将zn中的元素与ZA tile的水平slice元素相加，只操作active元素

void test_svaddha_za64_u64(svbool_t pn, svbool_t pm, svuint64_t zn) __arm_streaming __arm_inout("za") {
    svaddha_za64_u64_m(0, pn, pm, zn);
}

SVE向量外积

向量外积操作包含向量外积累加和向量外积累减两种操作，将两个向量数据做外积运算，计算结果与对应ZA tile数据做累加或累减，两种操作形式一致，以下仅介绍向量外积累加操作。

BFMOPA/FMOPA/SMOPA/UMOPA

前缀代表输入操作数的数据类型，BF--BFloat16、F--Float、S--Signed、U--Unsigned。最后一个字母A代表是累加和，如果是S则代表减法操作。对应的intrinsic接口参数功能一致，参数类型上有区别。

相比其它指令，FMOPA区分widening和non-widening类型，区别在于前者输入数据的类型半精度浮点，输出类型为单精度浮点类型，需要做精度扩展；而后者输入输出浮点类型保持一致，均为单精度浮点。

void svmopa_za32[_bf16]_m(uint64_t tile, svbool_t pn, svbool_t pm,
svbfloat16_t zn, svbfloat16_t zm)
__arm_streaming __arm_inout("za");
void svmopa_za32[_f16]_m(uint64_t tile, svbool_t pn, svbool_t pm,
svfloat16_t zn, svfloat16_t zm)
__arm_streaming __arm_inout("za");
void svmopa_za32[_s8]_m(uint64_t tile, svbool_t pn, svbool_t pm,
svint8_t zn, svint8_t zm)
__arm_streaming __arm_inout("za");
void svmopa_za32[_u8]_m(uint64_t tile, svbool_t pn, svbool_t pm,
svuint8_t zn, svuint8_t zm)
__arm_streaming __arm_inout("za");
// 920 72F8不支持
void svmopa_za64[_s16]_m(uint64_t tile, svbool_t pn, svbool_t pm,
svint16_t zn, svint16_t zm)
__arm_streaming __arm_inout("za");
// 920 72F8不支持
void svmopa_za64[_u16]_m(uint64_t tile, svbool_t pn, svbool_t pm,
svuint16_t zn, svuint16_t zm)
__arm_streaming __arm_inout("za");
 
// FMOPA (non-widening)
void svmopa_za32[_f32]_m(uint64_t tile, svbool_t pn, svbool_t pm,
svfloat32_t zn, svfloat32_t zm)
__arm_streaming __arm_inout("za");
void svmopa_za64[_f64]_m(uint64_t tile, svbool_t pn, svbool_t pm,
svfloat64_t zn, svfloat64_t zm)
__arm_streaming __arm_inout("za");

SUMOPA/USMOPA

以SUMOPA指令为例，第一个字母S代表操作数zn类型为signed，第二个字母U代表操作数zm为unsigned类型。本操作适用于由8位整数组成的32位元素ZA tile 和16位整数组成的64位元素ZA tile。

void svsumopa_za32[_s8]_m(uint64_t tile, svbool_t pn, svbool_t pm,
svint8_t zn, svuint8_t zm)
__arm_streaming __arm_inout("za");
// 920 72F8不支持
void svsumopa_za64[_s16]_m(uint64_t tile, svbool_t pn, svbool_t pm,
svint16_t zn, svuint16_t zm)
__arm_streaming __arm_inout("za");
void svusmopa_za32[_u8]_m(uint64_t tile, svbool_t pn, svbool_t pm,
svuint8_t zn, svint8_t zm)
__arm_streaming __arm_inout("za");
// 920 72F8不支持
void svusmopa_za64[_u16]_m(uint64_t tile, svbool_t pn, svbool_t pm,
svuint16_t zn, svint16_t zm)
__arm_streaming __arm_inout("za");

示例：对 zn，zm两个输入向量做外积累加运算

void test_svsumopa_za32_s8( svbool_t pn, svbool_t pm, svint8_t zn, svuint8_t zm)
    __arm_streaming __arm_inout("za"){
    svsumopa_za32_s8_m(0, pn, pm, zn, zm);
}

读取SVL

读取当前的SVL长度，返回对应元素类型的个数，intrinsic列表如下：

// 返回streaming vector宽度对应的byte数量
uint64_t svcntsb() __arm_streaming_compatible;
// 返回streaming vector宽度对应的half word数量
uint64_t svcntsh() __arm_streaming_compatible;
// 返回streaming vector宽度对应的word数量
uint64_t svcntsw() __arm_streaming_compatible;
// 返回streaming vector宽度对应的double word数量
uint64_t svcntsd() __arm_streaming_compatible;

示例：读取streaming vector中bytes的个数

uint64_t test_svcntsb() {
  return svcntsb();
}

清零ZA tile

ZERO

清零整块ZA

void svzero_za() __arm_streaming_compatible __arm_out("za");

清零部分ZA
tile_mask必须为常量，范围在0~255；清零的粒度以64-bit tile为单位，即ZAn.d，tile_mask的二进制从低到高位分别表示ZA0.d~ZA7.d，如1表示清零ZA0.d：
1 2
void svzero_mask_za(uint64_t tile_mask) __arm_streaming_compatible __arm_inout("za");

示例：清零ZA4.d，ZA5.d，ZA7.d

void test_svzero_mask_za_1() {
  svzero_mask_za(176); // 176 = 16+32+128；16，32，128分别代表ZA4.d，ZA5.d，ZA7.d
}

矩阵乘示例

#include <stdio.h>
#include <arm_sve.h>
#include <arm_sme.h>
#define MASK_ZACC ((1 << 0) | (1 << 4))

#define ZACC 0

// Multiply matrixes A [M x K] and B [K x N] and store the result in C [M X N].
// A, B, and C are double arrays.
__attribute__((noinline)) __arm_new("za") void
GEMMKernel(float *matA, float *matB, float *matC,
           unsigned long M, unsigned long N, unsigned long K,
           float alpha) __arm_streaming {
  uint64_t vscale = svcntw();
  svbool_t pm, pn, pk;
  svfloat32_t src1, src2;

  // Divide C into multiple [vscale x vscale] tiles, with the (i, j) pair in
  // each iteration indicating the top-left coordinate of a tile in C.
  for (size_t i = 0; i < M; i += vscale) {
    pm = svwhilelt_b32_u32(i, M); // predicate for rows of matrixes A and C
    for (size_t j = 0; j < N; j += vscale) {
      pn = svwhilelt_b32_u32(j, N); // predicate for columns of matrixes B and C

      svzero_mask_za(MASK_ZACC);

      // The matrix multiplication of two [vscale x vscale] tiles is equal to
      // the sum of outer products of each column of the first tile and each row
      // of the second.
      for (size_t k = 0; k < K; k += vscale) {
        pk = svwhilelt_b32_u32(k, K); // predicate for columns of A and rows of B

        // Multiply columns of the [vscale x vscale] tile starting at A[k][i]
        // with rows of the tile starting at B[k][j].
        for (size_t t = 0; t < vscale; t++) {
          // Tiles along the right hand side of matrix A will only have (K %
          // vscale) columns. Tiles along the bottom of matrix B will only have
          // (K % vscale) rows. Exit early if we have reached the limit.
          if (k + t == K)
            break;

          // pm will prevent loading more column-wise elements than available
          // when loading the consecutive elements starting at A[k + t][i].
          src1 = svld1_f32(pm, matA + (k + t) * M + i);

          // pn will prevent extracting more row-wise elements than available
          // when loading the consecutive elements starting at B[k + t][j].
          src2 = svld1_f32(pn, matB + (k + t) * N + j);

          // Multiply with alpha.
          src2 = svmul_n_f32_m(pn, src2, alpha);

          // Accumulate the outer product of one column from a tile of A and
          // one row from a tile of B.
          svmopa_za32_f32_m(ZACC, pm, pn, src1, src2);
        }
      }

      // Copy the content of the accumulator tile, row-wise, into the
      // corresponding tile of C.
      for (size_t t = 0; t < vscale; t++) {
        // Tiles along the bottom of matrix C will only have (M % vscale) rows.
        // Exit early if we have stored all rows available.
        if (i + t == M)
          break;

        // pn will prevent storing more row-wise elements than necessary when
        // storing to consecutive elements starting at C[i][j].
        svst1_hor_za32(ZACC, t, pn, matC + (i + t) * N + j);
      }
    }
  }
}

FORTRAN混合调用ACLE接口示例

以下示例会使用FORTRAN混合调用ACLE的SME接口实现矩阵转置的操作。

首先需要在C文件中使用ACLE的SME接口实现矩阵转置的操作，C文件的实例如下：

//test.c
#include <stdio.h>
#include <stdlib.h>
#include <malloc.h>
#include <arm_sve.h>
#include <arm_sme.h>
 
__attribute__((noinline)) __arm_new("za") void
my_transpose2(uint32_t *matA, uint32_t M, uint32_t N, uint32_t *ans) __arm_streaming {
    uint64_t vscale = svcntw();
    svbool_t pm, pn;
    for (size_t i = 0; i < M; i += vscale) {
        pm = svwhilelt_b32_u32(i, M);
        for (size_t j = 0; j < N; j += vscale) {
            pn = svwhilelt_b32_u32(j, N);
            svzero_mask_za(0b00010001);
            for (size_t t = 0; t < vscale; t++) {
                if (j + t == N)
                    break;
                svld1_ver_za32(0, t, pm, matA + (j + t) * M + i);
            }
            for (size_t t = 0; t < vscale; t++) {
                if (i + t == M)
                    break;
                svst1_hor_za32(0, t, pn, ans + (i + t) * N + j);
            }
        }
    }
    return;
}
 
void my_transpose(uint32_t *matA, uint32_t *M, uint32_t *N, uint32_t *ans) {
    my_transpose2(matA, *M, *N, ans);
    return;
}

之后，可以在FORTRAN文件中通过bind(c)声明并调用在C文件中定义的接口函数"my_transpose"：

!test.f90
program main

    interface 
        subroutine my_transpose(n1,n2,n3,n4) bind (c)
                use iso_c_binding
                integer*4 :: n1(n2, n3)
                integer*4 :: n2
                integer*4 :: n3
                integer*4 :: n4(n3, n2)
        end subroutine
    end interface

    integer*4, dimension(5,7) :: a
    integer*4, dimension(7,5) :: b
    integer*4, dimension(7,5) :: c
    integer :: i, j

    do i = 1, 5
        do j = 1, 7
            a(i, j) = i+j*2
        end do
    end do

    print *, 'Matrix Multiplication: A Matrix'

    do i = lbound(a,1), ubound(a,1)
        write(*,*) (a(i,j), j = lbound(a,2), ubound(a,2))
    end do

    b = transpose(a)
    Print*, 'Matrix Multiplication: B Matrix'

    do i = lbound(b,1), ubound(b,1)
        write(*,*) (b(i,j), j = lbound(b,2), ubound(b,2))
    end do

    call my_transpose(a, 5, 7, c)
    Print*, 'Matrix Multiplication: C Matrix'

    do i = lbound(c,1), ubound(c,1)
        write(*,*) (c(i,j), j = lbound(c,2), ubound(c,2))
    end do

end program main

之后需要将C文件和F90文件分别编译成.o文件，再链接成可执行文件，编译命令如下：

clang test.c -c -o c.o -mcpu=hip11
flang test.f90 -c -o f.o -mcpu=hip11
flang c.o f.o -o a.out -mcpu=hip11 --rtlib=compiler-rt

父主题： SME优化指导