Algorithm Overview

Algorithm Principles

GBDT
GBDT is a popular decision tree–based ensemble algorithm used for classification and regression tasks. It iteratively trains decision trees to minimize a loss function. Spark GBDT enables binary classification and regression, supports continuous features and categorical features, and uses distributed computing for training and inference in big data scenarios.
RF
The RF algorithm trains multiple decision trees simultaneously to obtain a classification model or regression model based on given sample data that includes feature vectors and label values. The output model can predict the label value with the highest probability after the feature vectors are input.
SVM
SVM is a generalized linear classifier that performs binary classification on data in a supervised learning manner. Its decision-making boundary is the maximum-margin hyperplane for solving learning samples. SVM is a sparse and robust classifier that uses the hinge loss function to calculate empirical risks and adds regularization items to the problem-solving system to relieve structural risks. The LinearSVC algorithm of Spark introduces two optimization policies: reducing the times of invoking the f functions (distributed computing of loss and gradient of the target functions) through algorithm principle optimization, and accelerating convergence by increasing momentum parameter updates.
K-means
The K-means algorithm is derived from a vector quantization method in signal processing, and is now more popular in the field of data mining as a clustering analysis method. The purpose of K-means clustering is to divide n points (which may be an observation of a sample or an instance) into k clusters, so that each point belongs to a cluster corresponding to the mean value (that is, a cluster center) closest to the point. Problems solved by this algorithm are related to division of the data space into Voronoi cells.
DecisionTree
The DecisionTree algorithm is widely used in fields such as machine learning and computer vision for classification and regression. The DecisionTree algorithm trains a binary tree to obtain a classification model or regression model based on given sample data that contains feature vectors and label values. The output model can predict the label value with the highest probability after the feature vectors are input.
LinearRegression
Regression algorithms are supervised learning algorithms used to find possible relationships between the independent variable X and the observable variable Y. If the observable variable is continuous, it is called "regression". In machine learning, LinearRegression uses a linear model to model the relationship between the independent variable X and the observable variable Y. The unknown model parameters are estimated from the training data.
LogisticRegression
Although the Logistic Regression algorithm has "regression" in its name, it is actually a classification method. It uses a linear model to model the relationship between the independent variable X and the observable variable Y. The unknown model parameters are estimated from the training data.
PCA
PCA is a popular data analysis method used for dimension reduction, feature extraction, exception detection, and more. Performing PCA on the matrix A_m×n is to find the first k principal components [v_1, v_2, ..., v_k] of the matrix A_m×n and their weights [s_1, s_2, ..., s_k].
SVD
SVD is an important technique to decompose a matrix in linear algebra. It is commonly used for extracting information in fields including bioinformatics, signal processing, finance, and statistics. In machine learning, SVD can be used for data compression, dimension reduction, recommendation systems, and natural language processing. Performing SVD on the matrix A_m×n is to decompose the matrix A_m×n into A=USV^T. U_m×k is a left singular matrix, V_n×k is a right singular matrix, and S_k×k is a diagonal matrix. Elements on a diagonal are called singular values (arranged in descending order). Both U and V are unitary matrices.
LDA
LDA is a topic model that generates topics from a set of documents. It is also known as a three-level Bayesian model, including documents, topics, and words. LDA is an unsupervised machine learning technology that uses distributed computing to process training and inference in big data scenarios.
PrefixSpan
PrefixSpan is a typical algorithm for frequent pattern mining. It is used to mine frequent sequences that meet the minimum support level. PrefixSpan is efficient because no candidate sequences need to be generated, projected databases keep shrinking quickly, and the memory usage is stable during frequent sequential pattern mining.
ALS
ALS is a collaborative recommendation algorithm that uses alternating least squares to predict missing values.
KNN
KNN is a non-parametric algorithm that is used to find k samples closest to a given sample. It can be used for classification, regression, and information retrieval.

Application Scenarios

Algorithm Classification

Algorithm Name

Application Industries

Carrier

Finance

Transportation

Machine learning algorithms

GBDT

Identification of high-value customers from other networks

Analysis on full-frequency dual-card terminals

Non-compliant terminal sales

Customer credit assessment

Credit risk assessment

Debt risk rating and warning

Post-loan risk rating

Customer financial profiling

Insurance customer risk analysis

Insurance customer churn analysis

Marketing strategy development of insurance enterprises

Traffic accident detection

Vehicle identification

High-value customer segmentation

Terminal life cycle analysis

Analysis of subscriber device change behaviors

Insurance fraud identification

Online transaction fraud detection

Credit risk assessment

Debt risk rating and warning

Street racing analysis

Ticket scalper analysis

Traffic signal timing optimization

SVM

Identification and attraction of high-value customers

Identification and escalation of customers for upsell

Price forecast for the international carbon financial market

Enterprise bankruptcy prediction

Vehicle insurance pricing

Recognition of vehicles with cloned or fake license plates

Traffic flow prediction for road networks

Traffic flow prediction

Street racing analysis

Kmeans

Reactivation of inactive subscribers

Targeted tariff design

Subscriber package adaptation

Plan for financial IC card promotion in cities

Classification of de facto exchange rate systems

Insurance customer credit analysis

Analysis of consumers' willingness to buy insurance on Internet

Vehicle origin-destination (OD) analysis

Checkpoint data governance

High-risk area identification

Decision Tree

Warning of broadband subscriber churn

Warning of expired broadband subscribers

Customer classification for Internet finance precision marketing

Customer classification for commercial bank telemarketing

Quantitative investment strategy development

Credit card approval

Post-loan risk rating

Street racing analysis

Ticket scalper analysis

Traffic accident detection

Logistic Regression

Fraud warning

Risk evaluation

Intelligent energy consumption prediction

Credit risk analysis of Internet finance P2P services

Post-loan risk analysis

Identification of large-amount foreign exchange fund transactions

Customer credit assessment

Credit rating of listed companies

Warning of extreme risks in the financial market

Traffic flow prediction for road networks

Driving safety index modeling

Road traffic capability evaluation

Recognition of vehicles with cloned or fake license plates

Traffic flow prediction

Street racing analysis

Linear Regression

International toll call and roaming service analysis

Credit rating

Identification of financial report fraud of listed companies

Warning of commercial bank financial risk

Customer credit risk factor assessment

Small and medium-sized enterprise credit risk assessment

Supply chain financial risk assessment

Road traffic capability evaluation

Recognition of vehicles with cloned or fake license plates

Traffic flow prediction for road networks

Traffic situation analysis

PCA

Extraction of key subscriber features

Subscriber identification

Subscriber credit characteristics

Data engineering of recommendation model

Data engineering of risk assessment model

Data engineering of motor vehicle insurance fraud identification

Data engineering of supply chain financial credit risk assessment model

Warning of overdue repayment

Traffic sign image recognition

Road safety prediction

Cause analysis of traffic accidents and association analysis

Urban traffic intersection correlation analysis

SVD

Abnormal order traffic detection

Network poisoning attack detection and location

Network cloud transmission data compression

Supplier selection

Supplier evaluation methods

Efficiency analysis of financial support for strategic emerging industries (data engineering) and commercial bank customer value segmentation (data engineering)

Factor dimension reduction of quantitative investment stock selection

Equity portfolio recommendation

Traffic data preprocessing

Extraction of vehicle travel behavior characteristics

Traffic data compression

Periodic traffic characteristics extraction

LDA

Inappropriate information governance

Content recommendation

Stock clustering for financial knowledge services

Analysis of the relationship between financial and technology media sentiment and online loan market

Acquisition of financial decision-making support knowledge

Knowledge findings in corporate annual reports

Extraction of financial time information

Identification of traffic choke points

Digitalization of traffic law enforcement cases

PrefixSpan

Segmentation of mobile number portability (MNP) port-in subscribers

Port-out subscriber prediction

Intelligent O&M: fault detection and prediction

Intelligent energy consumption management: base station/server energy consumption prediction

Debt risk rating and warning

User consumption behavior prediction and risk analysis

Fund return forecast

Forecast of top holdings within a portfolio

Insurance customer risk analysis

Insurance customer churn analysis

Marketing strategy development of insurance enterprises

Traffic congestion analysis

Traffic signal timing optimization

Travel mode recommendation

Personal profiling/holographic archiving (analysis of residence, age, gender, consumption level, occupation, etc.)

ALS

Port-in customer product adaptation

Campus/Return-to-hometown marketing

Level-1 electronic channel precision marketing

Tourist services

Identification and escalation of customers for upsell

Business recommendation

Content recommendation

Intelligent app recommendation

Participating life insurance pricing

Structural difference analysis of life insurance demands

Investor sentiment measurement

American option pricing simulation

Dangerous driving behavior detection

Similar route recommendation

KNN

Terminal app insight

Campus marketing

Resident compound identification

Financial data exception monitoring

Medical insurance review

Abnormal traffic scenario analysis

Accompanying person analysis

Parent topic: Introduction