我要评分
获取效率
正确性
完整性
易理解

Feature Engineering

Scenarios

Feature engineering is the process of transforming raw data into model training data. It aims to extract most relevant features so that a machine learning model can improve its accuracy. In big data scenarios, there can be up to tens of millions of data feature dimensions, which means overfitting may occur if the number of samples is insufficient. In addition, if the data volume is excessively large, the algorithm performance will be degraded. Take the dimension-reduction method principal component analysis (PCA) as an example. It is found that 99% of the compute time is spent on invoking and running the singular value decomposition (SVD) algorithm at the bottom layer, which makes it difficult to do data analysis in scenarios such as personalized recommendation, key object identification, and redundant information reduction. To rise up to this challenge, Kunpeng BoostKit optimizes the restart technology and reduces iterations to accelerate the convergence of the SVD algorithm and improve the adaptability of SVD to non-singular value decomposition, large singular value range, and high-dimensional data scenarios.

Principles

  • PCA

    Principal component analysis (PCA) is a popular data analysis method used for dimension reduction, feature extraction, exception detection, and more. Performing PCA on matrix Am×n is to find the first k principal components [v_1, v_2, ..., v_k] of the matrix and their weights [s_1, s_2, ..., s_k].

  • SPCA

    The principal component analysis for sparse matrix (SPCA) algorithm is used to perform PCA on a sparse matrix. It reduces the dimension of sparse data from n to k (k < n) and retains as much original information as possible.

  • SVD

    Singular value decomposition (SVD) is an important technique to decompose a matrix in linear algebra. It is commonly used for extracting information in fields including bioinformatics, signal processing, finance, and statistics. In machine learning, SVD can be used for data compression, dimension reduction, recommendation systems, and natural language processing. Performing SVD on matrix Am×n is to decompose matrix Am×n into A=USVT. Um×k is a left singular matrix, Vn×k is a right singular matrix, and Sk×k is a diagonal matrix. Elements on a diagonal are called singular values (arranged in descending order). Both U and V are unitary matrices.

  • Covariance

    The Covariance algorithm measures the joint change degree of two random variables in probability theory and statistics. Variance is a special case of covariance, that is, the covariance between a variable and itself.

  • Pearson

    The Pearson correlation coefficient measures the linear correlation between two variables X and Y in the field of statistics and natural sciences. The correlation value ranges from –1 to 1. +1 indicates perfect positive correlation, 0 indicates no correlation, and -1 indicates perfect negative correlation.

  • Spearman

    The Spearman's rank correlation coefficient, denoted by the Greek letter ρ in statistics, is a non-parametric indicator that measures the dependency between two variables. It uses a monotonic equation to assess the correlation between two variables. If there are no duplicate values in the data and the two variables are completely monotonically correlated, the Spearman correlation coefficient is +1 or -1.

  • IDF

    The inverse document frequency (IDF) algorithm measures how important a term is to a document set or a given document in a corpus. It is often used to mine keywords in documents and is often used by the industry to clean text data.

  • DTB

    Decision Tree Bucket (DTB) is a popular supervised data binning or discretization method based on a decision tree model. Data binning or discretization is a way to partition continuous data into discrete intervals. Discretization of numerical features maps finite elements in infinite space to finite space, effectively reducing the time and space overhead of subsequent algorithm processing.

  • Word2Vec

    The Word2Vec algorithm aims to represent a word with a vector, so that the relationship between words can be represented by the distance between vectors. This process is called distributed representation. In addition to word embedding, we also have embedding for encoding categorical variables. Unlike one-hot encoding, Word2Vec extracts dense features of a fixed length and contains more context information, which thereby improves the precision and performance of downstream algorithms. It takes a lot of time in iteration and convergence when Spark's open source Word2Vec algorithm is used to process data with a large vocabulary. The distributed Word2Vec algorithm based on Spark's open source Word2Vec is much more efficient.

Programming Example

This example describes the programming with the Pearson algorithm.

The Pearson algorithm uses ML APIs.

Model API Type

Function API

ML API

def corr(dataset: Dataset[_], column: String): DataFrame

def corr(dataset: Dataset[_],column: String, method: String): DataFrame

Figure 1 shows the time sequence of the Pearson algorithm.

Figure 1 Time sequence of the Pearson algorithm
  • Function description

    Output the clustering result after you input sample data in the dataset format and call the fitPredict API.

  • Input and output
    1. Package name: org.apache.spark.ml.stat
    2. Class name: Correlation
    3. Method name: corr
    4. Input: training sample data (Dataset[_]). The following are mandatory fields.

      Parameter

      Value Type

      Description

      data

      Dataset[Vector]

      Matrix, which is stored by row

      column

      String

      Specifies columns for correlation matrix calculation.

      method

      String

      Matrix method. The value can be spearman or pearson (default).

    5. Parameters optimized based on native algorithms

      Parameter

      Value Type

      Default Value

      Description

      method

      String

      pearson

      Method for solving the correlation matrix. The default value is pearson.

      Code API example:

      1
      val mat = stat.Correlation.corr(data, "matrix")
      
    6. Output: Pearson correlation matrix

      Parameter

      Value Type

      Description

      df

      DataFrame

      Pearson correlation matrix. The column name is column + method.

  • Example
    val mat = stat.Correlation.corr(data, "matrix")
    val mat = stat.Correlation.corr(data, "matrix", "Pearson")