Feature Engineering
Scenarios
Feature engineering is the process of transforming raw data into model training data. It aims to extract most relevant features so that a machine learning model can improve its accuracy. In big data scenarios, there can be up to tens of millions of data feature dimensions, which means overfitting may occur if the number of samples is insufficient. In addition, if the data volume is excessively large, the algorithm performance will be degraded. Take the dimension-reduction method principal component analysis (PCA) as an example. It is found that 99% of the compute time is spent on invoking and running the singular value decomposition (SVD) algorithm at the bottom layer, which makes it difficult to do data analysis in scenarios such as personalized recommendation, key object identification, and redundant information reduction. To rise up to this challenge, Kunpeng BoostKit optimizes the restart technology and reduces iterations to accelerate the convergence of the SVD algorithm and improve the adaptability of SVD to non-singular value decomposition, large singular value range, and high-dimensional data scenarios.

Principles
- DTB
Decision Tree Bucket (DTB) is a popular supervised data binning or discretization method based on a decision tree model. Data binning or discretization is a way to partition continuous data into discrete intervals. Discretization of numerical features maps finite elements in infinite space to finite space, effectively reducing the time and space overhead of subsequent algorithm processing.
- Word2Vec
The Word2Vec algorithm aims to represent a word with a vector, so that the relationship between words can be represented by the distance between vectors. This process is called distributed representation. In addition to word embedding, we also have embedding for encoding categorical variables. Unlike one-hot encoding, Word2Vec extracts dense features of a fixed length and contains more context information, which thereby improves the precision and performance of downstream algorithms. It takes a lot of time in iteration and convergence when Spark's open source Word2Vec algorithm is used to process data with a large vocabulary. The distributed Word2Vec algorithm based on Spark's open source Word2Vec is much more efficient.