Introduction

Based on the Kunpeng 920 processors, this document describes how to deploy a benchmarking system on the openEuler operating system (OS) to measure the inference performance of search and recommendation models, covering server and client environment setup and performance evaluation during the inference phase.

Models

ModelZoo is a collection of search and recommendation models. Currently, it includes five models: Wide_and_Deep, Deep Learning Recommendation Model (DLRM), Factorization Machine with Deep Neural Network (DeepFM), Domain Facilitated Feature Modeling (DFFM), and Deep Structured Semantic Model (DSSM).

Wide_and_Deep

The Wide_and_Deep model is a machine learning architecture proposed by Google for recommendation systems. It combines the advantages of width (linear models) and depth (deep neural networks). Linear models capture explicit relationships in sparse data by memorizing known feature combinations, while deep neural networks learn new potential feature interactions through generalization. This architecture can process both high-dimensional sparse features and low-dimensional dense features to facilitate personalized recommendation. It is applicable to various scenarios such as ad click-through rate (CTR) estimation.

Figure 1 Schematic diagram of Wide_and_Deep

Wide component: Process cross-product transformations of sparse features through a linear layer.
Deep component: Transform categorical and ID-based sparse features, represented by one-hot encoding, into low-dimensional vectors through an embedding layer. Feed these vectors into a multilayer perceptron (MLP) together with normalized dense features such as age and income.

DLRM

DLRM is a deep learning recommendation model proposed by Facebook. This model is designed to process sparse features. It uses the embedding layer to convert high-dimensional sparse features into low-dimensional dense vectors, and captures complex relationships between features through the interaction layer. DLRM combines low-order and high-order feature interaction, uses the dot product to calculate feature combinations, and outputs prediction results through the multi-layer perceptron (MLP). DLRM is widely used in personalized services such as advertising and recommendation.

Figure 2 Schematic diagram of DLRM

There are two categories of features. One is discrete features of the category and ID types. They are usually encoded using one-hot encoding to generate sparse features. The other is numeric continuous features. Discrete features become particularly sparse after one-hot encoding, which is not suitable for the deep learning model to learn from. Generally, the discrete features are mapped to dense continuous values through embeddings.

After the embeddings are applied, all features, including discrete features and continuous features, can be further converted through the MLP, as shown in the triangle part in Figure 2. The features processed by the MLP then enter the interaction layer for feature crossing. The interaction layer takes the dot product on every two of the embedding results to implement feature crossing. Then, the crossed features are combined with the previous embedding results and sent to the MLP for the final output.

DeepFM

DeepFM is a CTR model proposed in 2017. It is a recommendation system model that integrates the factorization machine (FM) and deep neural networks (DNNs). This model automates feature combination learning, removing the burden of feature engineering. Its FM effectively captures the second-order combination relationship between features, while the DNNs deeply explores the high-order feature crosses. DeepFM has excellent performance in processing sparse data and can memorize known combinations and generalize new combinations, which is applicable to scenarios such as CTR estimation and personalized recommendation.

Figure 3 Schematic diagram of DeepFM

Similar to other methods, one-hot encoding is performed on the sparse features, and then the sparse features are input into the embedding layer, while the dense features are normalized.

FM:
- Linear part: Weighted summation of raw features.
- Second-order crossing: Second-order crosses between all features are captured through the inner product.
DNN: MLP is used to extract high-order feature representations.
Output prediction: Combine the outputs of FM and DNN, and generate the final recommendation probability or regression value.

DFFM

DFFM is an enhanced recommendation algorithm that integrates domain awareness and feature modeling. By introducing domain information, DFFM emphasizes the importance of different domain features besides considering the crossing between features. This model uses the deep learning architecture to accurately capture user preferences and behavior patterns during cross-domain data processing, improving the accuracy and personalization of the recommendation system. It is especially applicable to multi-domain or cross-platform recommendation scenarios.

Figure 4 Schematic diagram of DFFM

Features are classified into domain-agnostic feature E^a, domain-specific feature E^d, target item feature E^t, and historical behavior feature E^h.

Domain-enhanced inner product processing is performed on E^a and E^d, and then E^a and E^d are input into the fully connected layers (FC layers) to generate domain-enhanced features. After E^d and E^t are concatenated, an attention weighting operation is performed on E^h and the concatenated E^d and E^t to generate the domain facilitated user behavior (DFUB) features. Concatenate the domain-enhanced features and DFUB features, and input them to the FC layers for the final result.

DSSM

DSSM is a semantic model based on the deep network. It calculates a similarity by mapping user features and item features to the semantic space of the common dimension to predict the CTR.

Figure 5 Schematic diagram of DSSM

After both the user features and item features pass through the embedding layers, the DNNs generate vector representations in the semantic space of the common dimension, and then calculate a similarity of the vectors.

Test Objective

Benchmark testing is a scientific method used to determine and measure the performance of a system or model. It can quantitatively and repeatedly evaluate the performance of the tested object by using properly-designed test methods, leveraging proper test tools, and setting up effective test environments. This document describes how to set up and use the benchmarking system for search and recommendation models. Search and recommendation models from ModelZoo are used for inference performance evaluation and acceptance testing.

Based on the Kunpeng 920 processors, this test evaluates the throughput of the five models from ModelZoo in the inference phase. During the test, TensorFlow Serving 2.15 is used as the server, and perf_analyzer (Triton Server) is used as the client. The test conditions are as follows: batch ≤ 1024, P99 latency ≤ 40 ms.

Test Procedure

Figure 6 Throughput test procedure in the training and inference phases