一种海量数据快速聚类算法

doi:10.13190/j.jbupt.2019-078

北京邮电大学学报 ›› 2020, Vol. 43 ›› Issue (3): 118-124.doi: 10.13190/j.jbupt.2019-078

一种海量数据快速聚类算法

何倩¹, 李双富^1,2, 黄焕¹, 徐红¹

1. 桂林电子科技大学卫星导航定位与位置服务国家地方联合工程研究中心, 桂林 541004;
2. 广西交科集团有限公司, 南宁 530007

收稿日期:2019-05-11 出版日期:2020-06-28 发布日期:2020-06-24
作者简介:何倩(1979-),男,教授,博士生导师,E-mail:heqian@guet.edu.cn.
基金资助:
国家自然科学基金项目（61661015，61967005）；广西创新驱动重大专项项目（AA17202024）；广西科技创新团队项目（2019GXNSFGA245004）

A Fast Clustering Algorithm for Massive Data

HE Qian¹, LI Shuang-fu^1,2, HUANG Huan¹, XU Hong¹

1. State and Local Joint Engineering Research Center for Satellite Navigation and Location Service, Guilin University of Electronic Technology, Guilin 541004, China;
2. Guangxi Jiaoke Group Company Limited, Nanning 530007, China

Received:2019-05-11 Online:2020-06-28 Published:2020-06-24
Supported by:

摘要/Abstract

摘要： 为满足海量数据处理要求，提出了一种基于网格的K-means快速聚类算法（SPGK）.设计基于网格质心的聚类簇个数选取算法，对数据进行网格划分得到每个网格的质心，将质心作为K-means聚类的样本点，从而减少K-means的欧氏距离计算次数.该算法基于Spark平台实现并行计算，进一步地提高了算法的运行效率.SPGK不但能够获得良好的聚类效果，而且缩减了欧氏距离计算次数，适用于海量数据的快速聚类.在千万级数据集上的实验结果表明，SPGK的性能明显优于现有的K-means++和基于K均值聚类的递归划分方法.

关键词: 快速聚类, Spark, 最佳聚类初始点, 网格划分

Abstract: To meet the requirements of massive data processing, a grid-based K-means fast clustering algorithm (SPGK) is proposed. Selection for optimal clustering initial point and the number of clusters algorithm is presented. The grids of different clusters are meshed to obtain the centroid of each grid. These centroid points are used as sample points for K-means clustering, thereby reducing the number of Euclidean distance calculations of K-means. SPGK realizes parallel computation based on Spark platform, which further improves the running efficiency of the algorithm. SPGK not only obtains good clustering effect but also greatly reduces the number of Euclidean distance calculations, which is suitable for fast clustering of mass data. With 10 millions of data, the experiments show that SPGK is superior to the existing K-means++ and recursive partition based K-means clustering algorithms obviously.

Key words: fast clustering, Spark, best initial clustering point, grid generation

中图分类号:

TP311

何倩, 李双富, 黄焕, 徐红. 一种海量数据快速聚类算法[J]. 北京邮电大学学报, 2020, 43(3): 118-124.

HE Qian, LI Shuang-fu, HUANG Huan, XU Hong. A Fast Clustering Algorithm for Massive Data[J]. Journal of Beijing University of Posts and Telecommunications, 2020, 43(3): 118-124.

参考文献

[1] Gahar R M, Arfaoui O, Hidri M S, et al. An ontology-driven MapReduce framework for association rules mining in massive data[J]. Procedia Computer Science, 2018, 126:224-233.
[2] Hidri M S, Zoghlami M A, Ayed R B. Speeding up the large-scale consensus fuzzy clustering for handling big data[J]. Fuzzy Ets and Systems, 2018(348):50-74.
[3] Ester M, Kriegel H P, Xu X. A density-based algorithm for discovering clusters in large spatial databases with noise[C]//International Conference on Knowledge Discovery & Data Mining. New York:ACM, 1996:226-231.
[4] Wang W. STING:a statistical information grid approach to spatial data mining[J]. Proc of Very Large Database Conf, 1997(15):186-195.
[5] Hartigan J A, Wong M A. A K-means clustering algorithm[J]. Journal of the Royal Statistical Society:Series C (Applied Statistics), 1979, 28(1):100-108.
[6] Wu X, Kumar V, Ross J, et al. Top 10 algorithms in data mining[J]. Knowledge And Information Systems, 2007(14):1-37.
[7] Arthur D, Vassilvitskii S. K-means++:the advantages of careful seeding[J]. Proceedings of Theghteenth Annual Acm Siam Symposiumon Discrete Algorithms Society for Industrial & Applied Mathematics, 2007, 11(6):1027-1035.
[8] Shmeis Z, Jaber M. Fine and coarse grained composition and adaptation of spark applications[J]. Future Generation Computer Systems, 2018:629-640.
[9] He Qian, Chen Yiting, Dong Qinghe, et al. A parallel clustering and test partitioning techniques based mining trajectory algorithm for moving objects[C]//2017 13th International Conference on Natural Computation, Fuzzy Systems and Knowledge Discovery (ICNC-FSKD). Guilin:Cuilin University of Electronic Technology, 2017:455-462.
[10] Tibshirani R, Hastie W T. Estimating the umber of clusters in a data Et via the gap statistic[J]. Journal of the Royal Statistical Society, 2001, 63(2):411-423.
[11] Ishioka T. Extended K-means with an efficient estimation of the number of clusters[C]//Intelligent Data Engineering & Automated Learning-ideal, Data Mining, Financial Engineering, & Intelligent Agents, Second International Conference. HongKong:Morgan Kautmann Publishs Inc, 2000.
[12] Capo M, Perez A, Lozano J A. An efficient approximation to the K-means clustering for massive data[J]. Knowledge-Based Systems, 2017, 117(2):56-69.
[13] Wu Kehe, Zeng Wenjing, Wu Tingting, et al. Research and improve on K-means algorithm based on hadoop[C]//IEEE International Conference on Software Engineering & Service Science. Piscataway:IEEE, 2015:334-337.
[14] Wang Bowen, Yin Jun, Hua Qi, et al. Parallelizing K-Means Based Clustering on Spark[C]//International Conference on Advanced Cloud and Big Data. Piscataway:IEEE, 2016:31-36.
[15] 徐晓, 丁世飞, 孙统风, 等. 基于网格筛选的大规模密度峰值聚类算法[J]. 计算机研究与发展, 2018, 55(11):79-89. Xu Xiao, Ding Shifei, Sun Tongfeng, et al. Large-scale density peaks clustering algorithm based on grid screening[J]. Journal of Computer Research and Development, 2018, 55(11):79-89.
[16] 于彦伟, 贾召飞, 曹磊, 等. 面向位置大数据的快速密度聚类算法[J]. 软件学报, 2018, 29(8):2470-2484. Yu Yanwei, Jia Zhaofei, Cao Lei, et al. Fast density-based clustering algorithm for location big data[J]. Journal of Software, 2018, 29(8):2470-2484.
[17] Yang Jie, Ma Yan. Zhang Xiangfen, et al. An initialization method based on hybrid distance for K-means algorithm[J]. Neural Computation, 2017, 29(11):3094-3117.

一种海量数据快速聚类算法

A Fast Clustering Algorithm for Massive Data

PDF

可视化

摘要/Abstract

引用本文

使用本文

参考文献

相关文章 3

编辑推荐

Metrics

本文评价

[1]	阎逸飞, 王智立, 邱雪松, 王嘉潞. Spark环境下基于数据倾斜模型的Shuffle分区优化方案[J]. 北京邮电大学学报, 2020, 43(2): 116-121.
[2]	游思晴, 周丽, 赵东杰, 薛菲. 基于粒子群优化算法的协同过滤推荐并行化研究[J]. 北京邮电大学学报, 2018, 41(6): 115-122.
[3]	杨娟, 张鹏业. 基于Spark的UCSLIM推荐算法研究及实现[J]. 北京邮电大学学报, 2016, 39(s1): 37-41.