Clustering

Scenarios

Clustering is widely used. Specifically, in business, it helps market analysts distinguish different consumer groups from the consumer database and summarize the consumption patterns or habits of each group of consumers. For example, if the K-means algorithm is used to measure a distance between two vectors in a sample, an excessively large dimension means an excessively large amount of data involved in the computation, which causes severe computing resource consumption. Based on the Kunpeng architecture's hardware advantages, Kunpeng BoostKit exploits the characteristics of Kunpeng cache blocks to improve the cache hit ratio and reduce the latency by maintaining the continuity of memory access and computing. In this way, the performance of machine learning algorithms such as LDA, K-means, and KNN is improved by more than 50%.

Principles

LDA
Latent Dirichlet allocation (LDA) is a topic model that generates topics from a set of documents. It is also known as a three-level Bayesian model, including documents, topics, and words. LDA is an unsupervised machine learning technology that uses distributed computing to process training and inference in big data scenarios.

K-means
The K-means clustering (K-means) algorithm is derived from a vector quantization method in signal processing, and is now more popular in the field of data mining as a clustering analysis method. K-means clustering aims to divide n points (which may be an observation of a sample or an instance) into k clusters, so that each point belongs to a cluster corresponding to the mean value (that is, a cluster center) closest to the point. Problems solved by this algorithm are related to division of the data space into Voronoi cells.

Parent topic: Machine Learning Algorithms