Rate This Document
Findability
Accuracy
Completeness
Readability

Advantages

Comparison Between Popular Solutions

The popular solutions in the industry for data analysis and prediction include rule-based analysis and open source algorithm libraries. Table 1 shows the comparison results of the popular solutions and the Kunpeng BoostKit for Big Data algorithm library.

Table 1 Comparison between popular solutions

Item

Rule-based Analysis

Open Source Algorithm Library

Kunpeng BoostKit for Big Data Algorithm Library

Usage

Relies on databases. ISVs customize SQL statements or SQL-like analysis technologies.

Based on single-node Python algorithm library or native Spark algorithms

Improved based on Spark distributed algorithms, with more algorithms and better algorithm accuracy and performance

Advantages

  • Easy to interpret and understand
  • Easy to use based on the SQL technology
  • Supports complex data analysis, such as classification prediction, clustering, and community mining.
  • Distributed memory computing, higher performance than SQL
  • A wide range of distributed algorithms for all scenarios
  • High algorithm accuracy for better performance
  • Supports large-scale dataset analysis.

Disadvantages

  • Manual rule customization, low accuracy
  • Long data analysis time
  • Does not support complex analysis such as trend prediction.
  • Limited computing power of single-node algorithms, which makes it cannot be used to analyze large-scale datasets.
  • Limited distributed algorithms and inadequate scenario coverage

N/A

Application Scenario

  • Small volumes of data
  • Accurate rules available
  • Medium volumes of data
  • Entry-level Spark in scenarios with low performance requirements
  • Massive volumes of data
  • High-precision and high-performance scenarios

Product Competitiveness of the Algorithm Library

The Kunpeng BoostKit for Big Data algorithm library has the following advantages:

  1. High performance: Compared with open source algorithms, the algorithm library improves the algorithm performance by multiple times and supports larger datasets.
    • The PCA algorithm delivers 10x higher performance and supports 1,000x larger feature scale (from tens of thousands to tens of millions) than the open source algorithm. PCA supports tens of millions of samples and tens of millions of feature dimensions.
    • The DBSCAN algorithm yields 24x higher performance and supports 5x larger feature dimensions than the open source algorithm (from 2 dimensions to 10 dimensions). DBSCAN supports computing of up to 20-dimension samples.
  2. Full coverage: The algorithm library includes common algorithms such as classification and regression, feature engineering, backbone analysis, clustering, and pattern mining.
  3. Easy to deploy: The algorithm library has the same class and interface definitions as the native Spark algorithm, and no modification is required for upper-layer applications.