北京邮电大学学报

  • EI核心期刊

北京邮电大学学报 ›› 2009, Vol. 32 ›› Issue (4): 122-127.doi: 10.13190/jbupt.200904.122.shenxy

• 研究报告 • 上一篇    下一篇

可并行中文同主题词聚类新算法

沈筱彦 陈俊亮 孟祥武 张玉洁   

  1. 北京邮电大学网络与交换技术国家重点实验室 北京邮电大学网络与交换技术国家重点实验室 北京邮电大学网络与交换技术国家重点实验室
  • 收稿日期:2008-11-20 修回日期:2009-06-01 出版日期:2009-08-28 发布日期:2009-08-28
  • 通讯作者: 沈筱彦

A Parallable Algorithm for Chinese CoTopic Words Clustering

Jun-Liang Chen Xiang-Wu Meng   

  • Received:2008-11-20 Revised:2009-06-01 Online:2009-08-28 Published:2009-08-28

摘要:

提出了一种高效的自动按照主题对中文词进行聚类的算法.该算法利用顿号(、)切分抽取语料库句子中的并列中文词,并以抽取出的中文词为节点构建一个共引用图; 然后对每个中文词节点产生若干个locality sensitive Hashing (LSH)签名组合; 最后将至少有1个相同LSH签名组合的任意2个中文词标记为同一个主题类.实验表明,该算法运算速度快,且易并行实现,在海量语料库的支持下,执行效率高,聚类效果较好.

关键词: 中文词聚类, 共引用图, LSH签名, 连通分支, 并行化

Abstract:

A simple but powerful algorithm for automatically clustering Chinese cotopic words is presented. The method first uses punctuation ‘、’ to split and extract paratactic Chinese words within sentences from a corpus and constructs a cocitation graph by treating Chinese words as nodes. Second, the method generates several locality sensitive Hashing (LSH) signature combinations for each node in the cocitation graph. Those nodes shared at least one LSH signature combination, are grouped together and most of them may belong to the same topic. The main advantages of the algorithm are the fast speed of calculation and high convenience of implementation in parallel. Experimental results indicate the high efficiency and good clustering effect.

Key words: Chinese word clustering, co-citation graph, connected component, LSH signature, parallable