北京邮电大学学报

  • EI核心期刊

北京邮电大学学报 ›› 2020, Vol. 43 ›› Issue (3): 118-124.doi: 10.13190/j.jbupt.2019-078

• 研究报告 • 上一篇    下一篇

一种海量数据快速聚类算法

何倩1, 李双富1,2, 黄焕1, 徐红1   

  1. 1. 桂林电子科技大学 卫星导航定位与位置服务国家地方联合工程研究中心, 桂林 541004;
    2. 广西交科集团有限公司, 南宁 530007
  • 收稿日期:2019-05-11 出版日期:2020-06-28 发布日期:2020-06-24
  • 作者简介:何倩(1979-),男,教授,博士生导师,E-mail:heqian@guet.edu.cn.
  • 基金资助:
    国家自然科学基金项目(61661015,61967005);广西创新驱动重大专项项目(AA17202024);广西科技创新团队项目(2019GXNSFGA245004)

A Fast Clustering Algorithm for Massive Data

HE Qian1, LI Shuang-fu1,2, HUANG Huan1, XU Hong1   

  1. 1. State and Local Joint Engineering Research Center for Satellite Navigation and Location Service, Guilin University of Electronic Technology, Guilin 541004, China;
    2. Guangxi Jiaoke Group Company Limited, Nanning 530007, China
  • Received:2019-05-11 Online:2020-06-28 Published:2020-06-24
  • Supported by:
     

摘要: 为满足海量数据处理要求,提出了一种基于网格的K-means快速聚类算法(SPGK).设计基于网格质心的聚类簇个数选取算法,对数据进行网格划分得到每个网格的质心,将质心作为K-means聚类的样本点,从而减少K-means的欧氏距离计算次数.该算法基于Spark平台实现并行计算,进一步地提高了算法的运行效率.SPGK不但能够获得良好的聚类效果,而且缩减了欧氏距离计算次数,适用于海量数据的快速聚类.在千万级数据集上的实验结果表明,SPGK的性能明显优于现有的K-means++和基于K均值聚类的递归划分方法.

关键词: 快速聚类, Spark, 最佳聚类初始点, 网格划分

Abstract: To meet the requirements of massive data processing, a grid-based K-means fast clustering algorithm (SPGK) is proposed. Selection for optimal clustering initial point and the number of clusters algorithm is presented. The grids of different clusters are meshed to obtain the centroid of each grid. These centroid points are used as sample points for K-means clustering, thereby reducing the number of Euclidean distance calculations of K-means. SPGK realizes parallel computation based on Spark platform, which further improves the running efficiency of the algorithm. SPGK not only obtains good clustering effect but also greatly reduces the number of Euclidean distance calculations, which is suitable for fast clustering of mass data. With 10 millions of data, the experiments show that SPGK is superior to the existing K-means++ and recursive partition based K-means clustering algorithms obviously.

Key words: fast clustering, Spark, best initial clustering point, grid generation

中图分类号: