Feature Engineering

Scenarios

Feature engineering is the process of transforming raw data into model training data. It aims to extract most relevant features so that a machine learning model can improve its accuracy. In big data scenarios, there can be up to tens of millions of data feature dimensions, which means overfitting may occur if the number of samples is insufficient. In addition, if the data volume is excessively large, the algorithm performance will be degraded. Take the dimension-reduction method principal component analysis (PCA) as an example. It is found that 99% of the compute time is spent on invoking and running the singular value decomposition (SVD) algorithm at the bottom layer, which makes it difficult to do data analysis in scenarios such as personalized recommendation, key object identification, and redundant information reduction. To rise up to this challenge, Kunpeng BoostKit optimizes the restart technology and reduces iterations to accelerate the convergence of the SVD algorithm and improve the adaptability of SVD to non-singular value decomposition, large singular value range, and high-dimensional data scenarios.

Principles

PCA
Principal component analysis (PCA) is a popular data analysis method used for dimension reduction, feature extraction, exception detection, and more. Performing PCA on matrix A_m×n is to find the first k principal components [v_1, v_2, ..., v_k] of the matrix and their weights [s_1, s_2, ..., s_k].
SPCA
The principal component analysis for sparse matrix (SPCA) algorithm is used to perform PCA on a sparse matrix. It reduces the dimension of sparse data from n to k (k < n) and retains as much original information as possible.
SVD
Singular value decomposition (SVD) is an important technique to decompose a matrix in linear algebra. It is commonly used for extracting information in fields including bioinformatics, signal processing, finance, and statistics. In machine learning, SVD can be used for data compression, dimension reduction, recommendation systems, and natural language processing. Performing SVD on matrix A_m×n is to decompose matrix A_m×n into A=USV^T. U_m×k is a left singular matrix, V_n×k is a right singular matrix, and S_k×k is a diagonal matrix. Elements on a diagonal are called singular values (arranged in descending order). Both U and V are unitary matrices.

Covariance
The Covariance algorithm measures the joint change degree of two random variables in probability theory and statistics. Variance is a special case of covariance, that is, the covariance between a variable and itself.
Pearson
The Pearson correlation coefficient measures the linear correlation between two variables X and Y in the field of statistics and natural sciences. The correlation value ranges from –1 to 1. +1 indicates perfect positive correlation, 0 indicates no correlation, and -1 indicates perfect negative correlation.
Spearman
The Spearman's rank correlation coefficient, denoted by the Greek letter ρ in statistics, is a non-parametric indicator that measures the dependency between two variables. It uses a monotonic equation to assess the correlation between two variables. If there are no duplicate values in the data and the two variables are completely monotonically correlated, the Spearman correlation coefficient is +1 or -1.
IDF
The inverse document frequency (IDF) algorithm measures how important a term is to a document set or a given document in a corpus. It is often used to mine keywords in documents and is often used by the industry to clean text data.
DTB
Decision Tree Bucket (DTB) is a popular supervised data binning or discretization method based on a decision tree model. Data binning or discretization is a way to partition continuous data into discrete intervals. Discretization of numerical features maps finite elements in infinite space to finite space, effectively reducing the time and space overhead of subsequent algorithm processing.
Word2Vec
The Word2Vec algorithm aims to represent a word with a vector, so that the relationship between words can be represented by the distance between vectors. This process is called distributed representation. In addition to word embedding, we also have embedding for encoding categorical variables. Unlike one-hot encoding, Word2Vec extracts dense features of a fixed length and contains more context information, which thereby improves the precision and performance of downstream algorithms. It takes a lot of time in iteration and convergence when Spark's open source Word2Vec algorithm is used to process data with a large vocabulary. The distributed Word2Vec algorithm based on Spark's open source Word2Vec is much more efficient.

Programming Example

This example describes the programming with the Pearson algorithm.

The Pearson algorithm uses ML APIs.

Model API Type	Function API
ML API	def corr(dataset: Dataset[_], column: String): DataFrame def corr(dataset: Dataset[_],column: String, method: String): DataFrame

Model API Type

Function API

ML API

def corr(dataset: Dataset[_], column: String): DataFrame

def corr(dataset: Dataset[_],column: String, method: String): DataFrame

Figure 1 shows the time sequence of the Pearson algorithm.

Figure 1 Time sequence of the Pearson algorithm

Function description
Output the clustering result after you input sample data in the dataset format and call the fitPredict API.

Input and output

Package name: org.apache.spark.ml.stat
Class name: Correlation
Method name: corr

Input: training sample data (Dataset[_]). The following are mandatory fields.

Parameter	Value Type	Description
data	Dataset[Vector]	Matrix, which is stored by row
column	String	Specifies columns for correlation matrix calculation.
method	String	Matrix method. The value can be spearman or pearson (default).

Parameters optimized based on native algorithms

Parameter	Value Type	Default Value	Description
method	String	pearson	Method for solving the correlation matrix. The default value is pearson.

Code API example:

            
                 val mat = stat.Correlation.corr(data, "matrix")

Output: Pearson correlation matrix

Parameter	Value Type	Description
df	DataFrame	Pearson correlation matrix. The column name is column + method.

Example

val mat = stat.Correlation.corr(data, "matrix")
val mat = stat.Correlation.corr(data, "matrix", "Pearson")

Parent topic: Machine Learning Algorithms