北京邮电大学学报

  • EI核心期刊

北京邮电大学学报 ›› 2014, Vol. 37 ›› Issue (3): 32-37.doi: 10.13190/j.jbupt.2014.03.007

• 论文 • 上一篇    下一篇

利用改进LSH算法进行层次化新闻话题检测

卢美莲, 王梓, 李佳珊   

  1. 北京邮电大学 网络与交换技术国家重点实验室, 北京 100876
  • 收稿日期:2013-08-08 出版日期:2014-06-28 发布日期:2014-06-08
  • 作者简介:卢美莲(1967-),女,副教授,E-mail:mllu@bupt.edu.cn.
  • 基金资助:

    国家自然科学基金项目(61002011);国家重大专项项目(2012ZX03005010-003);国家高技术研究发展计划项目(2014AA01A706)

Hierarchical News Topic Detection Using Improved LSH

LU Mei-lian, WANG Zi, LI Jia-shan   

  1. State Key Laboratory of Networking and Switching Technology, Beijing University of Posts and Telecommunications, Beijing 100876, China
  • Received:2013-08-08 Online:2014-06-28 Published:2014-06-08

摘要:

针对回顾式话题检测方法存在的话题检测时效性较差的问题,提出了改进的位置敏感哈希(LSH)算法,并应用于互联网新闻层次化话题检测. 在挖掘新闻内容特征的同时,应用潜在狄利克雷分布主题模型挖掘新闻的语义特征,将非二进制空间的内容特征向量和主题特征向量转换到二进制特征空间上,依次应用LSH算法对新闻文本基于内容特征和主题特征聚类,得到具有"主题-内容"层次的话题. 实验结果表明,该方法通过挖掘新闻的内容特征和主题特征,能更准确和完整地表现新闻内容;将内容特征和主题特征转换到统一的二进制空间,有效降低了聚类过程的时间复杂度,在保证话题检测准确率和话题在语义层面上扩展性的前提下,提高了话题检测的效率.

关键词: 话题检测, 层次聚类, 主题模型, 位置敏感哈希

Abstract:

To improve the timeliness of detecting topics in retrospective topic detection, an improved locality sensitive Hashing (LSH) algorithm is proposed and applied in constructing hierarchical topic model for web news. Firstly, the news content feature is excavated, and the topic feature is excavated using latent dirichlet allocation model. Then the non-binary content eigenvector and topic eigenvector are converted to binary feature space. Finally, news articles are clustered in order using binary content eigenvector and binary topic eigenvector by LSH, and the hierarchical topic-content news topic model is generated. Experiments prove the following results: extracting content feature and topic feature can express the news exactly; converting content eigenvector and topic eigenvector to unified binary space can reduce the time complexity of clustering, and thus increase the efficiency of topic detection while ensure the accuracy and semantic expansibility.

Key words: topic detection, hierarchy clustering, topic model, locality sensitive Hashing

中图分类号: