Algorithm Overview
Algorithm Principles
- GBDT
GBDT is a popular decision tree–based ensemble algorithm used for classification and regression tasks. It iteratively trains decision trees to minimize a loss function. Spark GBDT enables binary classification and regression, supports continuous features and categorical features, and uses distributed computing for training and inference in big data scenarios.
- RF
The RF algorithm trains multiple decision trees simultaneously to obtain a classification model or regression model based on given sample data that includes feature vectors and label values. The output model can predict the label value with the highest probability after the feature vectors are input.
- SVM
SVM is a generalized linear classifier that performs binary classification on data in a supervised learning manner. Its decision-making boundary is the maximum-margin hyperplane for solving learning samples. SVM is a sparse and robust classifier that uses the hinge loss function to calculate empirical risks and adds regularization items to the problem-solving system to relieve structural risks. The LinearSVC algorithm of Spark introduces two optimization policies: reducing the times of invoking the f functions (distributed computing of loss and gradient of the target functions) through algorithm principle optimization, and accelerating convergence by increasing momentum parameter updates.
- K-means
The K-means algorithm is derived from a vector quantization method in signal processing, and is now more popular in the field of data mining as a clustering analysis method. K-means clustering aims to divide n points (which may be an observation of a sample or an instance) into k clusters, so that each point belongs to a cluster corresponding to the mean value (that is, a cluster center) closest to the point. Problems solved by this algorithm are related to division of the data space into Voronoi cells.
- DecisionTree
The DecisionTree algorithm is widely used in fields such as machine learning and computer vision for classification and regression. The DecisionTree algorithm trains a binary tree to obtain a classification model or regression model based on given sample data that contains feature vectors and label values. The output model can predict the label value with the highest probability after the feature vectors are input.
- LinearRegression
Regression algorithms are supervised learning algorithms used to find possible relationships between the independent variable X and the observable variable Y. If the observable variable is continuous, it is called "regression". In machine learning, LinearRegression uses a linear model to model the relationship between the independent variable X and the observable variable Y. The unknown model parameters are estimated from the training data.
- LogisticRegression
LogisticRegression is a classification method that uses a linear model to model the relationship between the independent variable X and the observable variable Y. The unknown model parameters are estimated from the training data.
- PCA
PCA is a popular data analysis method used for dimension reduction, feature extraction, exception detection, and more. Performing PCA on the matrix Am×n is to find the first k principal components [v_1, v_2, ..., v_k] of the matrix Am×n and their weights [s_1, s_2, ..., s_k].
- SVD
SVD is an important technique to decompose a matrix in linear algebra. It is commonly used for extracting information in fields including bioinformatics, signal processing, finance, and statistics. In machine learning, SVD can be used for data compression, dimension reduction, recommendation systems, and natural language processing. Performing SVD on the matrix Am×n is to decompose the matrix Am×n into A=USVT. Um×k is a left singular matrix, Vn×k is a right singular matrix, and Sk×k is a diagonal matrix. Elements on a diagonal are called singular values (arranged in descending order). Both U and V are unitary matrices.
- LDA
LDA is a topic model that generates topics from a set of documents. It is also known as a three-level Bayesian model, including documents, topics, and words. LDA is an unsupervised machine learning technology that uses distributed computing to process training and inference in big data scenarios.
- PrefixSpan
PrefixSpan is a typical algorithm for frequent pattern mining. It is used to mine frequent sequences that meet the minimum support level. PrefixSpan is efficient because no candidate sequences need to be generated, projected databases keep shrinking quickly, and the memory usage is stable during frequent sequential pattern mining.
- ALS
ALS is a collaborative recommendation algorithm that uses alternating least squares to predict missing values.
- KNN
KNN is a non-parametric algorithm that is used to find k samples closest to a given sample. It can be used for classification, regression, and information retrieval.
- Covariance
The Covariance algorithm measures the joint change degree of two random variables in probability theory and statistics. Variance is a special case of covariance, that is, the covariance between a variable and itself.
- DBSCAN
DBSCAN is a density-based spatial clustering algorithm that requires that the number of objects contained in a certain area in the clustering space be greater than or equal to a given threshold. DBSCAN can effectively process noise and discover spatial clustering of any shape.
- Pearson
The Pearson correlation coefficient measures the linear correlation between two variables X and Y in the field of statistics and natural sciences. The correlation value ranges from –1 to 1. +1 indicates perfect positive correlation, 0 indicates no correlation, and -1 indicates perfect negative correlation.
- Spearman
The Spearman's rank correlation coefficient, denoted by the Greek letter ρ in statistics, is a non-parametric indicator that measures the dependency between two variables. It uses a monotonic equation to assess the correlation between two variables. If there are no duplicate values in the data and the two variables are completely monotonically correlated, the Spearman correlation coefficient is +1 or -1.
- XGBoost
XGBoost is a deeply-optimized distributed gradient boosting algorithm library that is efficient, flexible, and portable. The library implements machine learning algorithms in the framework of gradient boosting, and provides a parallel tree boosting algorithm, which can quickly and accurately solve many data science problems.
Application Scenarios
Algorithm Classification |
Algorithm Name |
Application Industries |
||
|---|---|---|---|---|
Carrier |
Finance |
Transportation |
||
Machine learning algorithms |
GBDT |
Identification of high-value customers from other networks Non-compliant sales of full-frequency and terminal devices |
Customer credit assessment Credit risk assessment Debt risk rating and warning Post-loan risk rating Customer financial profiling Insurance customer risk analysis Insurance customer churn analysis Marketing strategy development of insurance enterprises |
Traffic accident detection Vehicle identification |
RF |
High-value customer segmentation Terminal life cycle analysis Analysis of subscriber device change behaviors |
Insurance fraud identification Online transaction fraud detection Credit risk assessment Debt risk rating and warning |
Street racing analysis Ticket scalper analysis Traffic signal timing optimization |
|
SVM |
Identification and attraction of high-value customers Identification and escalation of customers for upsell |
Price forecast for the international carbon financial market Enterprise bankruptcy prediction Vehicle insurance pricing |
Recognition of vehicles with cloned or fake license plates Traffic flow prediction for road networks Traffic flow prediction Street racing analysis |
|
K-means |
Reactivation of inactive subscribers Targeted tariff design Subscriber package adaptation |
Plan for financial IC card promotion in cities Classification of de facto exchange rate systems Insurance customer credit analysis Analysis of consumers' willingness to buy insurance on Internet |
Vehicle origin-destination (OD) analysis Checkpoint data governance High-risk area identification |
|
DecisionTree |
Warning of broadband subscriber churn Warning of expired broadband subscribers |
Customer classification for Internet finance precision marketing Customer classification for commercial bank telemarketing Quantitative investment strategy development Credit card approval Post-loan risk rating |
Street racing analysis Ticket scalper analysis Traffic accident detection |
|
LogisticRegression |
Fraud warning Risk evaluation Intelligent energy consumption prediction |
Credit risk analysis of Internet finance P2P services Post-loan risk analysis Identification of large-amount foreign exchange fund transactions Customer credit assessment Credit rating of listed companies Warning of extreme risks in the financial market |
Traffic flow prediction for road networks Driving safety index modeling Road traffic capability evaluation Recognition of vehicles with cloned or fake license plates Traffic flow prediction Street racing analysis |
|
LinearRegression |
International toll call and roaming service analysis Credit rating |
Identification of financial report fraud of listed companies Warning of commercial bank financial risk Customer credit risk factor assessment Small and medium-sized enterprise credit risk assessment Supply chain financial risk assessment |
Road traffic capability evaluation Recognition of vehicles with cloned or fake license plates Traffic flow prediction for road networks Traffic situation analysis |
|
PCA |
Extraction of key subscriber features Subscriber identification Subscriber credit characteristics Data engineering of recommendation model Data engineering of risk assessment model |
Data engineering of motor vehicle insurance fraud identification Data engineering of supply chain financial credit risk assessment model Warning of overdue repayment |
Traffic sign image recognition Road safety prediction Cause analysis of traffic accidents and association analysis Urban traffic intersection correlation analysis |
|
SVD |
Abnormal order traffic detection Network poisoning attack detection and location Network cloud transmission data compression Supplier selection Supplier evaluation methods |
Efficiency analysis of financial support for strategic emerging industries (data engineering) and commercial bank customer value segmentation (data engineering) Factor dimension reduction of quantitative investment stock selection Equity portfolio recommendation |
Traffic data preprocessing Extraction of vehicle travel behavior characteristics Traffic data compression Periodic traffic characteristics extraction |
|
LDA |
Inappropriate information governance Content recommendation |
Stock clustering for financial knowledge services Analysis of the relationship between financial and technology media sentiment and online loan market Acquisition of financial decision-making support knowledge Knowledge findings in corporate annual reports Extraction of financial time information |
Identification of traffic choke points Digitalization of traffic law enforcement cases |
|
PrefixSpan |
Segmentation of mobile number portability (MNP) port-in subscribers Port-out subscriber prediction Intelligent O&M: fault detection and prediction Intelligent energy consumption management: base station/server energy consumption prediction |
Debt risk rating and warning User consumption behavior prediction and risk analysis Fund return forecast Forecast of top holdings within a portfolio Insurance customer risk analysis Insurance customer churn analysis Marketing strategy development of insurance enterprises |
Traffic congestion analysis Traffic signal timing optimization Travel mode recommendation Personal profiling/holographic archiving (analysis of residence, age, gender, consumption level, occupation, etc.) |
|
ALS |
Port-in customer product adaptation Campus/Return-to-hometown marketing Level-1 electronic channel precision marketing Tourist services Identification and escalation of customers for upsell Service recommendation Content recommendation |
Intelligent app recommendation Dividend life insurance pricing Structural difference analysis of life insurance demands Investor sentiment measurement American option pricing simulation |
Dangerous driving behavior detection Similar route recommendation |
|
KNN |
Terminal app insight Campus marketing Resident compound identification |
Financial data exception monitoring Medical insurance review |
Abnormal traffic scenario analysis Accompanying person analysis |
|
Covariance |
User loyalty analysis User preference analysis User churn analysis Illegal sales of voucher cards Channel standby card |
Stock correlation analysis Investment portfolio analysis Asset configuration analysis Asset risk value model analysis |
Road condition prediction Congestion propagation analysis Trajectory matching analysis Intelligent order dispatching Detection of abnormal traffic trajectory |
|
DBSCAN |
Customer family group identification Identification and attraction of campus customers Identification and attraction of customers from other networks Customer group distribution |
Segmentation of commercial bank customer values Bank loan risk management Insurance fraud monitoring Identification of business risks among small- and medium-sized banks CRM customer segmentation model for insurance industry |
Thermal analysis of rail transportation sites Thermal analysis of rail transportation groups Analysis of commuting lines Parking location analysis |
|
Pearson |
Mobile station location Accompanying person analysis Abnormal order traffic detection Identification and attraction of migrated customers User matching policy |
Market risk management Asset risk value model analysis Insurance claim analysis |
Road pass time prediction Multi-sensor vehicle information convergence Intelligent order dispatching Detection of abnormal traffic trajectory |
|
Spearman |
User matching policy Benefits-preferred users User churn analysis Mobile network driven by fixed network |
Credit card registration recommendation Customer benefits recommendation Fraud gang analysis Insurance customer profiling |
Passenger flow prediction and analysis Mining of congested urban areas Detection of abnormal traffic trajectory Intelligent order dispatching |
|
XGBoost |
Segmentation of mobile number portability (MNP) port-in subscribers Port-out subscriber prediction Intelligent O&M: fault detection and prediction Intelligent energy consumption management: base station/server energy consumption prediction |
Debt risk rating and warning Online transaction fraud detection User consumption behavior prediction and risk analysis Fund return forecast Forecast of top holdings within a portfolio Insurance customer risk analysis Insurance customer churn analysis Marketing strategy development of insurance enterprises |
Traffic congestion analysis Traffic signal timing optimization Travel mode recommendation Vehicle surveillance deployment Personal profiling/holographic archiving (analysis of residence, age, gender, consumption level, occupation, etc.) Object trajectory prediction |
|