Principles

GBDT
Gradient boosting decision tree (GBDT) is a popular decision tree–based ensemble algorithm used for classification and regression tasks. It iteratively trains decision trees to minimize a loss function. Spark GBDT enables binary classification and regression, supports continuous features and categorical features, and uses distributed computing for training and inference in big data scenarios.
Random Forest
The Random Forest algorithm trains multiple decision trees simultaneously to obtain a classification model or regression model based on given sample data that includes feature vectors and label values. The output model can predict the label value with the highest probability after the feature vectors are input.
SVM
Support vector machine (SVM) is a generalized linear classifier that performs binary classification on data in a supervised learning manner. Its decision-making boundary is the maximum-margin hyperplane for solving learning samples. SVM is a sparse and robust classifier that uses the hinge loss function to calculate empirical risks and adds regularization items to the problem-solving system to relieve structural risks. The LinearSVC algorithm of Spark introduces two optimization policies: reducing the times of invoking the f functions (distributed computing of loss and gradient of the target functions) through algorithm principle optimization, and accelerating convergence by increasing momentum parameter updates.
K-means
The K-means clustering (K-means) algorithm is derived from a vector quantization method in signal processing, and is now more popular in the field of data mining as a clustering analysis method. K-means clustering aims to divide n points (which may be an observation of a sample or an instance) into k clusters, so that each point belongs to a cluster corresponding to the mean value (that is, a cluster center) closest to the point. Problems solved by this algorithm are related to division of the data space into Voronoi cells.
Decision Tree
The Decision Tree algorithm is widely used in fields such as machine learning and computer vision for classification and regression. It trains a binary tree to obtain a classification model or regression model based on given sample data that contains feature vectors and label values. The output model can predict the label value with the highest probability after the feature vectors are input.
Linear Regression
Regression algorithms are supervised learning algorithms used to find possible relationships between the independent variable X and the observable variable Y. If the observable variable is continuous, it is called "regression". In machine learning, Linear Regression uses a linear model to model the relationship between the independent variable X and the observable variable Y. The unknown model parameters are estimated from the training data.
Logistic Regression
Although the Logistic Regression algorithm has "regression" in its name, it is actually a classification method. It uses a linear model to model the relationship between the independent variable X and the observable variable Y. The unknown model parameters are estimated from the training data.
PCA
Principal component analysis (PCA) is a popular data analysis method used for dimension reduction, feature extraction, exception detection, and more. Performing PCA on matrix A_m×n is to find the first k principal components [v_1, v_2, ..., v_k] of the matrix and their weights [s_1, s_2, ..., s_k].
SPCA
The principal component analysis for sparse matrix (SPCA) algorithm is used to perform PCA on a sparse matrix. It reduces the dimension of sparse data from n to k (k < n) and retains as much original information as possible.
SVD
Singular value decomposition (SVD) is an important technique to decompose a matrix in linear algebra. It is commonly used for extracting information in fields including bioinformatics, signal processing, finance, and statistics. In machine learning, SVD can be used for data compression, dimension reduction, recommendation systems, and natural language processing. Performing SVD on matrix A_m×n is to decompose matrix A_m×n into A=USV^T. U_m×k is a left singular matrix, V_n×k is a right singular matrix, and S_k×k is a diagonal matrix. Elements on a diagonal are called singular values (arranged in descending order). Both U and V are unitary matrices.
LDA
Latent Dirichlet Allocation (LDA) is a topic model that generates topics from a set of documents. It is also known as a three-level Bayesian model, including documents, topics, and words. LDA is an unsupervised machine learning technology that uses distributed computing to process training and inference in big data scenarios.
PrefixSpan
PrefixSpan is a typical algorithm for frequent pattern mining. It is used to mine frequent sequences that meet the minimum support level. PrefixSpan is efficient because no candidate sequences need to be generated, projected databases keep shrinking quickly, and the memory usage is stable during frequent sequential pattern mining.
ALS
Alternating least squares (ALS) is a collaborative recommendation algorithm.
KNN
K-nearest neighbors (KNN) is a non-parametric algorithm in machine learning that is used to find k samples closest to a given sample. It can be used for classification, regression, and information retrieval.
Covariance
The Covariance algorithm measures the joint change degree of two random variables in probability theory and statistics. Variance is a special case of covariance, that is, the covariance between a variable and itself.
DBSCAN
Density-based spatial clustering of applications with noise (DBSCAN) is a density-based spatial clustering algorithm that requires that the number of objects contained in a certain area in the clustering space be greater than or equal to a given threshold. DBSCAN can effectively process noise and discover spatial clustering of any shape.
Pearson
The Pearson correlation coefficient measures the linear correlation between two variables X and Y in the field of statistics and natural sciences. The correlation value ranges from –1 to 1. +1 indicates perfect positive correlation, 0 indicates no correlation, and -1 indicates perfect negative correlation.
Spearman
The Spearman's rank correlation coefficient, denoted by the Greek letter ρ in statistics, is a non-parametric indicator that measures the dependency between two variables. It uses a monotonic equation to assess the correlation between two variables. If there are no duplicate values in the data and the two variables are completely monotonically correlated, the Spearman correlation coefficient is +1 or -1.
XGBoost
XGBoost is a deeply-optimized distributed gradient boosting algorithm library that is efficient, flexible, and portable. It implements machine learning algorithms in the framework of gradient boosting, and provides a parallel tree boosting algorithm, which can quickly and accurately solve many data science problems.
IDF
The inverse document frequency (IDF) algorithm measures how important a term is to a document set or a given document in a corpus. It is often used to mine keywords in documents and is often used by the industry to clean text data.
SimRank
SimRank is a similarity measure that applies to any domain that has object-to-object relationships. It measures the similarity between objects based on their relationships with other objects. The similarity is obtained by iteratively solving SimRank equations.
DTB
Decision Tree Bucket (DTB) is a popular supervised data binning or discretization method based on a decision tree model. Data binning or discretization is a way to partition continuous data into discrete intervals. Discretization of numerical features maps finite elements in infinite space to finite space, effectively reducing the time and space overhead of subsequent algorithm processing.
Word2Vec
The Word2Vec algorithm aims to represent a word with a vector, so that the relationship between words can be represented by the distance between vectors. This process is called distributed representation. In addition to word embedding, we also have embedding for encoding categorical variables. Unlike one-hot encoding, Word2Vec extracts dense features of a fixed length and contains more context information, which thereby improves the precision and performance of downstream algorithms. It takes a lot of time in iteration and convergence when Spark's open source Word2Vec algorithm is used to process data with a large vocabulary. The distributed Word2Vec algorithm based on Spark's open source Word2Vec is much more efficient.

Parent topic: Feature Description