Rate This Document
Findability
Accuracy
Completeness
Readability

Replacing the x86 pcmpestrm Assembly Instruction

Symptom

Error: unknown mnemonic 'pcmpestrm' -- 'pcmpestrm'

Cause

The pcmpestrm instruction is an instruction in the x86 SSE4 instruction set. It is used to determine whether the bytes of str2 appear in str1 based on the specified comparison mode and return the comparison result of each byte (a maximum of 16 bytes). As a typical x86 complex instruction, pcmpestrm implements complex string matching by using only one instruction. There is no similar implementation in the Kunpeng architecture. Therefore, you need to implement the same function using C code on the Kunpeng platform.

For details about this instruction, see:

https://software.intel.com/sites/landingpage/IntrinsicsGuide/#techs=SSE4_2&expand=835

https://docs.microsoft.com/en-us/previous-versions/visualstudio/visual-studio-2010/bb514080(v=vs.100)

Procedure

The following code calls the pcmpestrm instruction in Impala. The pcmpestrm instruction is encapsulated into SSE4_cmpestrm by referring to the implementation of the Intel _mm_cmpestrm interface.

template<int MODE> 
static inline __m128i SSE4_cmpestrm(__m128i str1, int len1, __m128i str2, int len2) { 
#ifdef __clang__ 
  /// Use asm reg rather than Yz output constraint to workaround LLVM bug 13199 - 
  /// clang doesn't support Y-prefixed asm constraints. 
  register volatile __m128i result asm ("xmm0"); 
  __asm__ __volatile__ ("pcmpestrm %5, %2, %1": "=x"(result) : "x"(str1), "xm"(str2), "a"(len1), "d"(len2), "i"(MODE) : "cc"); 
#else 
  __m128i result; 
  __asm__ __volatile__ ("pcmpestrm %5, %2, %1": "=Yz"(result) : "x"(str1), "xm"(str2), "a"(len1), "d"(len2), "i"(MODE) : "cc"); 
#endif 
  return result; 
}

According to the instruction description, the operations vary with the comparison mode, and too many lines of code are required to implement the instruction's function. Based on the interface called in the code, the mode PCMPSTR_EQUAL_ANY |PCMPSTR_UBYTE_OPS is used. That is, the system checks whether each character in str2 appears in str1 based on the byte length. If yes, the system sets the corresponding bit to 1.

Code implementation on the Kunpeng platform:

#include <arm_neon.h>
typedefunion __attribute__((aligned(16))) __oword {
int32x4_t m128i;
uint8_tm128i_u8[16];
} __oword;
template <intMODE>
staticinlineuint16_tSSE4_cmpestrm(int32x4_tstr1, intlen1, int32x4_tstr2, intlen2){
    __oword a, b;
    a.m128i = str1;
    b.m128i = str2;
    uint16_t result = 0;
    uint16_t i = 0;
    uint16_t j = 0;
/ / Mode used in Impala: STRCHR_MODE = PCMPSTR_EQUAL_ANY | PCMPSTR_UBYTE_OPS
    for (i = 0; i < len2; i++)
    {
        for (j = 0; j < len1; j++)
        {
            if (a.m128i_u8[j] == b.m128i_u8[i])
            {
                result |= (1 << i);
            }
        }
    }
    return result;
}

If there is no instruction to replace the x86 instruction, analyze the instruction function and the functions to be implemented, and then write the code. Do not copy the code directly.