Journal of Beijing University of Posts and Telecommunications

  • EI核心期刊

Journal of Beijing University of Posts and Telecommunications ›› 2020, Vol. 43 ›› Issue (3): 118-124.doi: 10.13190/j.jbupt.2019-078

• Reports • Previous Articles     Next Articles

A Fast Clustering Algorithm for Massive Data

HE Qian1, LI Shuang-fu1,2, HUANG Huan1, XU Hong1   

  1. 1. State and Local Joint Engineering Research Center for Satellite Navigation and Location Service, Guilin University of Electronic Technology, Guilin 541004, China;
    2. Guangxi Jiaoke Group Company Limited, Nanning 530007, China
  • Received:2019-05-11 Online:2020-06-28 Published:2020-06-24
  • Supported by:
     

Abstract: To meet the requirements of massive data processing, a grid-based K-means fast clustering algorithm (SPGK) is proposed. Selection for optimal clustering initial point and the number of clusters algorithm is presented. The grids of different clusters are meshed to obtain the centroid of each grid. These centroid points are used as sample points for K-means clustering, thereby reducing the number of Euclidean distance calculations of K-means. SPGK realizes parallel computation based on Spark platform, which further improves the running efficiency of the algorithm. SPGK not only obtains good clustering effect but also greatly reduces the number of Euclidean distance calculations, which is suitable for fast clustering of mass data. With 10 millions of data, the experiments show that SPGK is superior to the existing K-means++ and recursive partition based K-means clustering algorithms obviously.

Key words: fast clustering, Spark, best initial clustering point, grid generation

CLC Number: