Algorithm Overview

Algorithm Principles

GBDT
GBDT is a popular decision tree–based ensemble algorithm used for classification and regression tasks. It iteratively trains decision trees to minimize a loss function. Spark GBDT enables binary classification and regression, supports continuous features and categorical features, and uses distributed computing for training and inference in big data scenarios.
RF
The RF algorithm trains multiple decision trees simultaneously to obtain a classification model or regression model based on given sample data that includes feature vectors and label values. The output model can predict the label value with the highest probability after the feature vectors are input.
SVM
SVM is a generalized linear classifier that performs binary classification on data in a supervised learning manner. Its decision-making boundary is the maximum-margin hyperplane for solving learning samples. SVM is a sparse and robust classifier that uses the hinge loss function to calculate empirical risks and adds regularization items to the problem-solving system to relieve structural risks. The LinearSVC algorithm of Spark introduces two optimization policies: reducing the times of invoking the f functions (distributed computing of loss and gradient of the target functions) through algorithm principle optimization, and accelerating convergence by increasing momentum parameter updates.
K-means
The K-means algorithm is derived from a vector quantization method in signal processing, and is now more popular in the field of data mining as a clustering analysis method. K-means clustering aims to divide n points (which may be an observation of a sample or an instance) into k clusters, so that each point belongs to a cluster corresponding to the mean value (that is, a cluster center) closest to the point. Problems solved by this algorithm are related to division of the data space into Voronoi cells.
DecisionTree
The DecisionTree algorithm is widely used in fields such as machine learning and computer vision for classification and regression. The DecisionTree algorithm trains a binary tree to obtain a classification model or regression model based on given sample data that contains feature vectors and label values. The output model can predict the label value with the highest probability after the feature vectors are input.
LinearRegression
Regression algorithms are supervised learning algorithms used to find possible relationships between the independent variable X and the observable variable Y. If the observable variable is continuous, it is called "regression". In machine learning, LinearRegression uses a linear model to model the relationship between the independent variable X and the observable variable Y. The unknown model parameters are estimated from the training data.
LogisticRegression
LogisticRegression is a classification method that uses a linear model to model the relationship between the independent variable X and the observable variable Y. The unknown model parameters are estimated from the training data.
PCA
PCA is a popular data analysis method used for dimension reduction, feature extraction, exception detection, and more. Performing PCA on the matrix A_m×n is to find the first k principal components [v_1, v_2, ..., v_k] of the matrix A_m×n and their weights [s_1, s_2, ..., s_k].
SVD
SVD is an important technique to decompose a matrix in linear algebra. It is commonly used for extracting information in fields including bioinformatics, signal processing, finance, and statistics. In machine learning, SVD can be used for data compression, dimension reduction, recommendation systems, and natural language processing. Performing SVD on the matrix A_m×n is to decompose the matrix A_m×n into A=USV^T. U_m×k is a left singular matrix, V_n×k is a right singular matrix, and S_k×k is a diagonal matrix. Elements on a diagonal are called singular values (arranged in descending order). Both U and V are unitary matrices.
LDA
LDA is a topic model that generates topics from a set of documents. It is also known as a three-level Bayesian model, including documents, topics, and words. LDA is an unsupervised machine learning technology that uses distributed computing to process training and inference in big data scenarios.
PrefixSpan
PrefixSpan is a typical algorithm for frequent pattern mining. It is used to mine frequent sequences that meet the minimum support level. PrefixSpan is efficient because no candidate sequences need to be generated, projected databases keep shrinking quickly, and the memory usage is stable during frequent sequential pattern mining.
ALS
ALS is a collaborative recommendation algorithm that uses alternating least squares to predict missing values.
KNN
KNN is a non-parametric algorithm that is used to find k samples closest to a given sample. It can be used for classification, regression, and information retrieval.
Covariance
The Covariance algorithm measures the joint change degree of two random variables in probability theory and statistics. Variance is a special case of covariance, that is, the covariance between a variable and itself.
DBSCAN
DBSCAN is a density-based spatial clustering algorithm that requires that the number of objects contained in a certain area in the clustering space be greater than or equal to a given threshold. DBSCAN can effectively process noise and discover spatial clustering of any shape.
Pearson
The Pearson correlation coefficient measures the linear correlation between two variables X and Y in the field of statistics and natural sciences. The correlation value ranges from –1 to 1. +1 indicates perfect positive correlation, 0 indicates no correlation, and -1 indicates perfect negative correlation.
Spearman
The Spearman's rank correlation coefficient, denoted by the Greek letter ρ in statistics, is a non-parametric indicator that measures the dependency between two variables. It uses a monotonic equation to assess the correlation between two variables. If there are no duplicate values in the data and the two variables are completely monotonically correlated, the Spearman correlation coefficient is +1 or -1.
XGBoost
XGBoost is a deeply-optimized distributed gradient boosting algorithm library that is efficient, flexible, and portable. The library implements machine learning algorithms in the framework of gradient boosting, and provides a parallel tree boosting algorithm, which can quickly and accurately solve many data science problems.

Application Scenarios

Algorithm Classification	Algorithm Name	Application Industries
Carrier	Finance	Transportation
Machine learning algorithms	GBDT	Identification of high-value customers from other networks Non-compliant sales of full-frequency and terminal devices	Customer credit assessment Credit risk assessment Debt risk rating and warning Post-loan risk rating Customer financial profiling Insurance customer risk analysis Insurance customer churn analysis Marketing strategy development of insurance enterprises	Traffic accident detection Vehicle identification
RF	High-value customer segmentation Terminal life cycle analysis Analysis of subscriber device change behaviors	Insurance fraud identification Online transaction fraud detection Credit risk assessment Debt risk rating and warning	Street racing analysis Ticket scalper analysis Traffic signal timing optimization
SVM	Identification and attraction of high-value customers Identification and escalation of customers for upsell	Price forecast for the international carbon financial market Enterprise bankruptcy prediction Vehicle insurance pricing	Recognition of vehicles with cloned or fake license plates Traffic flow prediction for road networks Traffic flow prediction Street racing analysis
K-means	Reactivation of inactive subscribers Targeted tariff design Subscriber package adaptation	Plan for financial IC card promotion in cities Classification of de facto exchange rate systems Insurance customer credit analysis Analysis of consumers' willingness to buy insurance on Internet	Vehicle origin-destination (OD) analysis Checkpoint data governance High-risk area identification
DecisionTree	Warning of broadband subscriber churn Warning of expired broadband subscribers	Customer classification for Internet finance precision marketing Customer classification for commercial bank telemarketing Quantitative investment strategy development Credit card approval Post-loan risk rating	Street racing analysis Ticket scalper analysis Traffic accident detection
LogisticRegression	Fraud warning Risk evaluation Intelligent energy consumption prediction	Credit risk analysis of Internet finance P2P services Post-loan risk analysis Identification of large-amount foreign exchange fund transactions Customer credit assessment Credit rating of listed companies Warning of extreme risks in the financial market	Traffic flow prediction for road networks Driving safety index modeling Road traffic capability evaluation Recognition of vehicles with cloned or fake license plates Traffic flow prediction Street racing analysis
LinearRegression	International toll call and roaming service analysis Credit rating	Identification of financial report fraud of listed companies Warning of commercial bank financial risk Customer credit risk factor assessment Small and medium-sized enterprise credit risk assessment Supply chain financial risk assessment	Road traffic capability evaluation Recognition of vehicles with cloned or fake license plates Traffic flow prediction for road networks Traffic situation analysis
PCA	Extraction of key subscriber features Subscriber identification Subscriber credit characteristics Data engineering of recommendation model Data engineering of risk assessment model	Data engineering of motor vehicle insurance fraud identification Data engineering of supply chain financial credit risk assessment model Warning of overdue repayment	Traffic sign image recognition Road safety prediction Cause analysis of traffic accidents and association analysis Urban traffic intersection correlation analysis
SVD	Abnormal order traffic detection Network poisoning attack detection and location Network cloud transmission data compression Supplier selection Supplier evaluation methods	Efficiency analysis of financial support for strategic emerging industries (data engineering) and commercial bank customer value segmentation (data engineering) Factor dimension reduction of quantitative investment stock selection Equity portfolio recommendation	Traffic data preprocessing Extraction of vehicle travel behavior characteristics Traffic data compression Periodic traffic characteristics extraction
LDA	Inappropriate information governance Content recommendation	Stock clustering for financial knowledge services Analysis of the relationship between financial and technology media sentiment and online loan market Acquisition of financial decision-making support knowledge Knowledge findings in corporate annual reports Extraction of financial time information	Identification of traffic choke points Digitalization of traffic law enforcement cases
PrefixSpan	Segmentation of mobile number portability (MNP) port-in subscribers Port-out subscriber prediction Intelligent O&M: fault detection and prediction Intelligent energy consumption management: base station/server energy consumption prediction	Debt risk rating and warning User consumption behavior prediction and risk analysis Fund return forecast Forecast of top holdings within a portfolio Insurance customer risk analysis Insurance customer churn analysis Marketing strategy development of insurance enterprises	Traffic congestion analysis Traffic signal timing optimization Travel mode recommendation Personal profiling/holographic archiving (analysis of residence, age, gender, consumption level, occupation, etc.)
ALS	Port-in customer product adaptation Campus/Return-to-hometown marketing Level-1 electronic channel precision marketing Tourist services Identification and escalation of customers for upsell Service recommendation Content recommendation	Intelligent app recommendation Dividend life insurance pricing Structural difference analysis of life insurance demands Investor sentiment measurement American option pricing simulation	Dangerous driving behavior detection Similar route recommendation
KNN	Terminal app insight Campus marketing Resident compound identification	Financial data exception monitoring Medical insurance review	Abnormal traffic scenario analysis Accompanying person analysis
Covariance	User loyalty analysis User preference analysis User churn analysis Illegal sales of voucher cards Channel standby card	Stock correlation analysis Investment portfolio analysis Asset configuration analysis Asset risk value model analysis	Road condition prediction Congestion propagation analysis Trajectory matching analysis Intelligent order dispatching Detection of abnormal traffic trajectory
DBSCAN	Customer family group identification Identification and attraction of campus customers Identification and attraction of customers from other networks Customer group distribution	Segmentation of commercial bank customer values Bank loan risk management Insurance fraud monitoring Identification of business risks among small- and medium-sized banks CRM customer segmentation model for insurance industry	Thermal analysis of rail transportation sites Thermal analysis of rail transportation groups Analysis of commuting lines Parking location analysis
Pearson	Mobile station location Accompanying person analysis Abnormal order traffic detection Identification and attraction of migrated customers User matching policy	Market risk management Asset risk value model analysis Insurance claim analysis	Road pass time prediction Multi-sensor vehicle information convergence Intelligent order dispatching Detection of abnormal traffic trajectory
Spearman	User matching policy Benefits-preferred users User churn analysis Mobile network driven by fixed network	Credit card registration recommendation Customer benefits recommendation Fraud gang analysis Insurance customer profiling	Passenger flow prediction and analysis Mining of congested urban areas Detection of abnormal traffic trajectory Intelligent order dispatching
XGBoost	Segmentation of mobile number portability (MNP) port-in subscribers Port-out subscriber prediction Intelligent O&M: fault detection and prediction Intelligent energy consumption management: base station/server energy consumption prediction	Debt risk rating and warning Online transaction fraud detection User consumption behavior prediction and risk analysis Fund return forecast Forecast of top holdings within a portfolio Insurance customer risk analysis Insurance customer churn analysis Marketing strategy development of insurance enterprises	Traffic congestion analysis Traffic signal timing optimization Travel mode recommendation Vehicle surveillance deployment Personal profiling/holographic archiving (analysis of residence, age, gender, consumption level, occupation, etc.) Object trajectory prediction

Algorithm Classification

Algorithm Name

Application Industries

Carrier

Finance

Transportation

Machine learning algorithms

GBDT

Identification of high-value customers from other networks

Non-compliant sales of full-frequency and terminal devices

Customer credit assessment

Credit risk assessment

Debt risk rating and warning

Post-loan risk rating

Customer financial profiling

Insurance customer risk analysis

Insurance customer churn analysis

Marketing strategy development of insurance enterprises

Traffic accident detection

Vehicle identification

High-value customer segmentation

Terminal life cycle analysis

Analysis of subscriber device change behaviors

Insurance fraud identification

Online transaction fraud detection

Credit risk assessment

Debt risk rating and warning

Street racing analysis

Ticket scalper analysis

Traffic signal timing optimization

SVM

Identification and attraction of high-value customers

Identification and escalation of customers for upsell

Price forecast for the international carbon financial market

Enterprise bankruptcy prediction

Vehicle insurance pricing

Recognition of vehicles with cloned or fake license plates

Traffic flow prediction for road networks

Traffic flow prediction

Street racing analysis

K-means

Reactivation of inactive subscribers

Targeted tariff design

Subscriber package adaptation

Plan for financial IC card promotion in cities

Classification of de facto exchange rate systems

Insurance customer credit analysis

Analysis of consumers' willingness to buy insurance on Internet

Vehicle origin-destination (OD) analysis

Checkpoint data governance

High-risk area identification

DecisionTree

Warning of broadband subscriber churn

Warning of expired broadband subscribers

Customer classification for Internet finance precision marketing

Customer classification for commercial bank telemarketing

Quantitative investment strategy development

Credit card approval

Post-loan risk rating

Street racing analysis

Ticket scalper analysis

Traffic accident detection

LogisticRegression

Fraud warning

Risk evaluation

Intelligent energy consumption prediction

Credit risk analysis of Internet finance P2P services

Post-loan risk analysis

Identification of large-amount foreign exchange fund transactions

Customer credit assessment

Credit rating of listed companies

Warning of extreme risks in the financial market

Traffic flow prediction for road networks

Driving safety index modeling

Road traffic capability evaluation

Recognition of vehicles with cloned or fake license plates

Traffic flow prediction

Street racing analysis

LinearRegression

International toll call and roaming service analysis

Credit rating

Identification of financial report fraud of listed companies

Warning of commercial bank financial risk

Customer credit risk factor assessment

Small and medium-sized enterprise credit risk assessment

Supply chain financial risk assessment

Road traffic capability evaluation

Recognition of vehicles with cloned or fake license plates

Traffic flow prediction for road networks

Traffic situation analysis

PCA

Extraction of key subscriber features

Subscriber identification

Subscriber credit characteristics

Data engineering of recommendation model

Data engineering of risk assessment model

Data engineering of motor vehicle insurance fraud identification

Data engineering of supply chain financial credit risk assessment model

Warning of overdue repayment

Traffic sign image recognition

Road safety prediction

Cause analysis of traffic accidents and association analysis

Urban traffic intersection correlation analysis

SVD

Abnormal order traffic detection

Network poisoning attack detection and location

Network cloud transmission data compression

Supplier selection

Supplier evaluation methods

Efficiency analysis of financial support for strategic emerging industries (data engineering) and commercial bank customer value segmentation (data engineering)

Factor dimension reduction of quantitative investment stock selection

Equity portfolio recommendation

Traffic data preprocessing

Extraction of vehicle travel behavior characteristics

Traffic data compression

Periodic traffic characteristics extraction

LDA

Inappropriate information governance

Content recommendation

Stock clustering for financial knowledge services

Analysis of the relationship between financial and technology media sentiment and online loan market

Acquisition of financial decision-making support knowledge

Knowledge findings in corporate annual reports

Extraction of financial time information

Identification of traffic choke points

Digitalization of traffic law enforcement cases

PrefixSpan

Segmentation of mobile number portability (MNP) port-in subscribers

Port-out subscriber prediction

Intelligent O&M: fault detection and prediction

Intelligent energy consumption management: base station/server energy consumption prediction

Debt risk rating and warning

User consumption behavior prediction and risk analysis

Fund return forecast

Forecast of top holdings within a portfolio

Insurance customer risk analysis

Insurance customer churn analysis

Marketing strategy development of insurance enterprises

Traffic congestion analysis

Traffic signal timing optimization

Travel mode recommendation

Personal profiling/holographic archiving (analysis of residence, age, gender, consumption level, occupation, etc.)

ALS

Port-in customer product adaptation

Campus/Return-to-hometown marketing

Level-1 electronic channel precision marketing

Tourist services

Identification and escalation of customers for upsell

Service recommendation

Content recommendation

Intelligent app recommendation

Dividend life insurance pricing

Structural difference analysis of life insurance demands

Investor sentiment measurement

American option pricing simulation

Dangerous driving behavior detection

Similar route recommendation

KNN

Terminal app insight

Campus marketing

Resident compound identification

Financial data exception monitoring

Medical insurance review

Abnormal traffic scenario analysis

Accompanying person analysis

Covariance

User loyalty analysis

User preference analysis

User churn analysis

Illegal sales of voucher cards

Channel standby card

Stock correlation analysis

Investment portfolio analysis

Asset configuration analysis

Asset risk value model analysis

Road condition prediction

Congestion propagation analysis

Trajectory matching analysis

Intelligent order dispatching

Detection of abnormal traffic trajectory

DBSCAN

Customer family group identification

Identification and attraction of campus customers

Identification and attraction of customers from other networks

Customer group distribution

Segmentation of commercial bank customer values

Bank loan risk management

Insurance fraud monitoring

Identification of business risks among small- and medium-sized banks

CRM customer segmentation model for insurance industry

Thermal analysis of rail transportation sites

Thermal analysis of rail transportation groups

Analysis of commuting lines

Parking location analysis

Pearson

Mobile station location

Accompanying person analysis

Abnormal order traffic detection

Identification and attraction of migrated customers

User matching policy

Market risk management

Asset risk value model analysis

Insurance claim analysis

Road pass time prediction

Multi-sensor vehicle information convergence

Intelligent order dispatching

Detection of abnormal traffic trajectory

Spearman

User matching policy

Benefits-preferred users

User churn analysis

Mobile network driven by fixed network

Credit card registration recommendation

Customer benefits recommendation

Fraud gang analysis

Insurance customer profiling

Passenger flow prediction and analysis

Mining of congested urban areas

Detection of abnormal traffic trajectory

Intelligent order dispatching

XGBoost

Segmentation of mobile number portability (MNP) port-in subscribers

Port-out subscriber prediction

Intelligent O&M: fault detection and prediction

Intelligent energy consumption management: base station/server energy consumption prediction

Debt risk rating and warning

Online transaction fraud detection

User consumption behavior prediction and risk analysis

Fund return forecast

Forecast of top holdings within a portfolio

Insurance customer risk analysis

Insurance customer churn analysis

Marketing strategy development of insurance enterprises

Traffic congestion analysis

Traffic signal timing optimization

Travel mode recommendation

Vehicle surveillance deployment

Personal profiling/holographic archiving (analysis of residence, age, gender, consumption level, occupation, etc.)

Object trajectory prediction

Parent topic: Introduction