基于PageRank的改进K均值聚类算法

北京邮电大学学报 ›› 2025, Vol. 48 ›› Issue (2): 18-27.

基于PageRank的改进K均值聚类算法

张冰涛, 魏丹, 沈瑜, 罗维薇

兰州交通大学电子与信息工程学院

收稿日期:2023-05-16 修回日期:2024-06-27 出版日期:2025-04-30 发布日期:2025-04-30
通讯作者: 张冰涛 E-mail:zhangbingtao321@163.com
基金资助:
国家自然科学基金项目

Improved K-means Clustering Algorithm based on PageRank

Received:2023-05-16 Revised:2024-06-27 Online:2025-04-30 Published:2025-04-30

摘要/Abstract

摘要： 针对K均值聚类算法(K-means)初始聚类中心选择的随机性问题,提出一种基于高斯赛德尔(Gauss-Seidel)迭代的网页排名(PageRank)重要性度量算法对K-means进行改进。首先,通过PageRank将聚类数据模拟为有向图上随机游走模型,即一阶马尔可夫链。随机游走遍历节点,建立聚类数据线性方程组。引入Gauss-Seidel,利用其在线性方程组求解过程持续更新未知变量特性,保证节点PageRank值的精准性,进而选择前k个节点作为初始聚类中心。接着,基于最小化损失函数反复迭代优化目标函数值,以精准划分各节点至最佳所属簇。最后,在加州大学欧文分校(UCI)数据集和经典图像上进行系列对比实验,结果表明,对于离散数据,改进K-means的兰德指数、调整兰德指数、标准互信息等性能评价指标均优于原始K-means;对于经典图像,改进K-means能够解决原始K-means边缘分割模糊问题,有效分割目标图像主体与背景。

关键词: K均值聚类, 聚类中心, 高斯赛德尔迭代, 分割图像

Abstract: To address the randomness problem in initial cluster center selection of K-means clustering algorithm, a PageRank importance measurement algorithm based on Gauss-Seidel iteration is proposed to improve K-means. By simulating the clustering data as a random walk model on a directed graph using PageRank, in other words, a first-order Markov chain. The clustering data is traversed randomly, and a linear equation system is established. Gauss-Seidel is introduced, and its continuous updating characteristic of unknown variables in the process of solving linear equation systems is used to ensure the accuracy of node PageRank values, and then the top k nodes are selected as the initial cluster centers. Then, the objective function value is optimized iteratively based on minimizing the loss function to accurately divide each node to the best cluster. Finally, a series of comparative experiments are performed on University of California Irvine (UCI) dataset and classic images, and experimental results show that the rand index, adjusted rand index, normalized mutual information and other performance evaluation indicators of improved K-means are better than the original K-means for discrete data, improving K-means can solve the edge segmentation fuzzy problem of the original K-means for classic images, effectively segmenting the target image subject and background.

Key words: K-means clustering, cluster center, Gaussian seidel iteration, segment images

中图分类号:

TP391

张冰涛, 魏丹, 沈瑜, 罗维薇. 基于PageRank的改进K均值聚类算法[J]. 北京邮电大学学报, 2025, 48(2): 18-27.