Algorithm Overview
Algorithm Principles
- GBDT
GBDT is a popular decision tree–based ensemble algorithm used for classification and regression tasks. It iteratively trains decision trees to minimize a loss function. Spark GBDT enables binary classification and regression, supports continuous features and categorical features, and uses distributed computing for training and inference in big data scenarios.
- RF
The RF algorithm trains multiple decision trees simultaneously to obtain a classification model or regression model based on given sample data that includes feature vectors and label values. The output model can predict the label value with the highest probability after the feature vectors are input.
- SVM
SVM is a generalized linear classifier that performs binary classification on data in a supervised learning manner. Its decision-making boundary is the maximum-margin hyperplane for solving learning samples. SVM is a sparse and robust classifier that uses the hinge loss function to calculate empirical risks and adds regularization items to the problem-solving system to relieve structural risks. The LinearSVC algorithm of Spark introduces two optimization policies: reducing the times of invoking the f functions (distributed computing of loss and gradient of the target functions) through algorithm principle optimization, and accelerating convergence by increasing momentum parameter updates.
- K-means
The K-means algorithm is derived from a vector quantization method in signal processing, and is now more popular in the field of data mining as a clustering analysis method. The purpose of K-means clustering is to divide n points (which may be an observation of a sample or an instance) into k clusters, so that each point belongs to a cluster corresponding to the mean value (that is, a cluster center) closest to the point. Problems solved by this algorithm are related to division of the data space into Voronoi cells.
- DecisionTree
The DecisionTree algorithm is widely used in fields such as machine learning and computer vision for classification and regression. The DecisionTree algorithm trains a binary tree to obtain a classification model or regression model based on given sample data that contains feature vectors and label values. The output model can predict the label value with the highest probability after the feature vectors are input.
- LinearRegression
Regression algorithms are supervised learning algorithms used to find possible relationships between the independent variable X and the observable variable Y. If the observable variable is continuous, it is called "regression". In machine learning, LinearRegression uses a linear model to model the relationship between the independent variable X and the observable variable Y. The unknown model parameters are estimated from the training data.
- LogisticRegression
Although the Logistic Regression algorithm has "regression" in its name, it is actually a classification method. It uses a linear model to model the relationship between the independent variable X and the observable variable Y. The unknown model parameters are estimated from the training data.
- PCA
PCA is a popular data analysis method used for dimension reduction, feature extraction, exception detection, and more. Performing PCA on the matrix Am×n is to find the first k principal components [v_1, v_2, ..., v_k] of the matrix Am×n and their weights [s_1, s_2, ..., s_k].
- SVD
SVD is an important technique to decompose a matrix in linear algebra. It is commonly used for extracting information in fields including bioinformatics, signal processing, finance, and statistics. In machine learning, SVD can be used for data compression, dimension reduction, recommendation systems, and natural language processing. Performing SVD on the matrix Am×n is to decompose the matrix Am×n into A=USVT. Um×k is a left singular matrix, Vn×k is a right singular matrix, and Sk×k is a diagonal matrix. Elements on a diagonal are called singular values (arranged in descending order). Both U and V are unitary matrices.
- LDA
LDA is a topic model that generates topics from a set of documents. It is also known as a three-level Bayesian model, including documents, topics, and words. LDA is an unsupervised machine learning technology that uses distributed computing to process training and inference in big data scenarios.
- PrefixSpan
PrefixSpan is a typical algorithm for frequent pattern mining. It is used to mine frequent sequences that meet the minimum support level. PrefixSpan is efficient because no candidate sequences need to be generated, projected databases keep shrinking quickly, and the memory usage is stable during frequent sequential pattern mining.
- ALS
ALS is a collaborative recommendation algorithm that uses alternating least squares to predict missing values.
- KNN
KNN is a non-parametric algorithm that is used to find k samples closest to a given sample. It can be used for classification, regression, and information retrieval.
Application Scenarios
Algorithm Classification |
Algorithm Name |
Application Industries |
||
Carrier |
Finance |
Transportation |
||
Machine learning algorithms |
GBDT |
Identification of high-value customers from other networks Analysis on full-frequency dual-card terminals Non-compliant terminal sales |
Customer credit assessment Credit risk assessment Debt risk rating and warning Post-loan risk rating Customer financial profiling Insurance customer risk analysis Insurance customer churn analysis Marketing strategy development of insurance enterprises |
Traffic accident detection Vehicle identification |
RF |
High-value customer segmentation Terminal life cycle analysis Analysis of subscriber device change behaviors |
Insurance fraud identification Online transaction fraud detection Credit risk assessment Debt risk rating and warning |
Street racing analysis Ticket scalper analysis Traffic signal timing optimization |
|
SVM |
Identification and attraction of high-value customers Identification and escalation of customers for upsell |
Price forecast for the international carbon financial market Enterprise bankruptcy prediction Vehicle insurance pricing |
Recognition of vehicles with cloned or fake license plates Traffic flow prediction for road networks Traffic flow prediction Street racing analysis |
|
Kmeans |
Reactivation of inactive subscribers Targeted tariff design Subscriber package adaptation |
Plan for financial IC card promotion in cities Classification of de facto exchange rate systems Insurance customer credit analysis Analysis of consumers' willingness to buy insurance on Internet |
Vehicle origin-destination (OD) analysis Checkpoint data governance High-risk area identification |
|
Decision Tree |
Warning of broadband subscriber churn Warning of expired broadband subscribers |
Customer classification for Internet finance precision marketing Customer classification for commercial bank telemarketing Quantitative investment strategy development Credit card approval Post-loan risk rating |
Street racing analysis Ticket scalper analysis Traffic accident detection |
|
Logistic Regression |
Fraud warning Risk evaluation Intelligent energy consumption prediction |
Credit risk analysis of Internet finance P2P services Post-loan risk analysis Identification of large-amount foreign exchange fund transactions Customer credit assessment Credit rating of listed companies Warning of extreme risks in the financial market |
Traffic flow prediction for road networks Driving safety index modeling Road traffic capability evaluation Recognition of vehicles with cloned or fake license plates Traffic flow prediction Street racing analysis |
|
Linear Regression |
International toll call and roaming service analysis Credit rating |
Identification of financial report fraud of listed companies Warning of commercial bank financial risk Customer credit risk factor assessment Small and medium-sized enterprise credit risk assessment Supply chain financial risk assessment |
Road traffic capability evaluation Recognition of vehicles with cloned or fake license plates Traffic flow prediction for road networks Traffic situation analysis |
|
PCA |
Extraction of key subscriber features Subscriber identification Subscriber credit characteristics Data engineering of recommendation model Data engineering of risk assessment model |
Data engineering of motor vehicle insurance fraud identification Data engineering of supply chain financial credit risk assessment model Warning of overdue repayment |
Traffic sign image recognition Road safety prediction Cause analysis of traffic accidents and association analysis Urban traffic intersection correlation analysis |
|
SVD |
Abnormal order traffic detection Network poisoning attack detection and location Network cloud transmission data compression Supplier selection Supplier evaluation methods |
Efficiency analysis of financial support for strategic emerging industries (data engineering) and commercial bank customer value segmentation (data engineering) Factor dimension reduction of quantitative investment stock selection Equity portfolio recommendation |
Traffic data preprocessing Extraction of vehicle travel behavior characteristics Traffic data compression Periodic traffic characteristics extraction |
|
LDA |
Inappropriate information governance Content recommendation |
Stock clustering for financial knowledge services Analysis of the relationship between financial and technology media sentiment and online loan market Acquisition of financial decision-making support knowledge Knowledge findings in corporate annual reports Extraction of financial time information |
Identification of traffic choke points Digitalization of traffic law enforcement cases |
|
PrefixSpan |
Segmentation of mobile number portability (MNP) port-in subscribers Port-out subscriber prediction Intelligent O&M: fault detection and prediction Intelligent energy consumption management: base station/server energy consumption prediction |
Debt risk rating and warning User consumption behavior prediction and risk analysis Fund return forecast Forecast of top holdings within a portfolio Insurance customer risk analysis Insurance customer churn analysis Marketing strategy development of insurance enterprises |
Traffic congestion analysis Traffic signal timing optimization Travel mode recommendation Personal profiling/holographic archiving (analysis of residence, age, gender, consumption level, occupation, etc.) |
|
ALS |
Port-in customer product adaptation Campus/Return-to-hometown marketing Level-1 electronic channel precision marketing Tourist services Identification and escalation of customers for upsell Business recommendation Content recommendation |
Intelligent app recommendation Participating life insurance pricing Structural difference analysis of life insurance demands Investor sentiment measurement American option pricing simulation |
Dangerous driving behavior detection Similar route recommendation |
|
KNN |
Terminal app insight Campus marketing Resident compound identification |
Financial data exception monitoring Medical insurance review |
Abnormal traffic scenario analysis Accompanying person analysis |
|