Principles

This section describes the algorithm principles.

SVM
Support vector machine (SVM) is a generalized linear classifier that performs binary classification on data in a supervised learning manner. Its decision-making boundary is the maximum-margin hyperplane for solving learning samples. SVM is a sparse and robust classifier that uses the hinge loss function to calculate empirical risks and adds regularization items to the problem-solving system to relieve structural risks. The LinearSVC algorithm of Spark introduces two optimization policies: reducing the times of invoking the f functions (distributed computing of loss and gradient of the target functions) through algorithm principle optimization, and accelerating convergence by increasing momentum parameter updates.
DBSCAN
Density-based spatial clustering of applications with noise (DBSCAN) is a density-based spatial clustering algorithm that requires that the number of objects contained in a certain area in the clustering space be greater than or equal to a given threshold. DBSCAN can effectively process noise and discover spatial clustering of any shape.
DTB
Decision Tree Bucket (DTB) is a popular supervised data binning or discretization method based on a decision tree model. Data binning or discretization is a way to partition continuous data into discrete intervals. Discretization of numerical features maps finite elements in infinite space to finite space, effectively reducing the time and space overhead of subsequent algorithm processing.
Word2Vec
The Word2Vec algorithm aims to represent a word with a vector, so that the relationship between words can be represented by the distance between vectors. This process is called distributed representation. In addition to word embedding, we also have embedding for encoding categorical variables. Unlike one-hot encoding, Word2Vec extracts dense features of a fixed length and contains more context information, which thereby improves the precision and performance of downstream algorithms. It takes a lot of time in iteration and convergence when Spark's open source Word2Vec algorithm is used to process data with a large vocabulary. The distributed Word2Vec algorithm based on Spark's open source Word2Vec is much more efficient.

Parent topic: Feature Description