北京邮电大学学报

  • EI核心期刊

北京邮电大学学报 ›› 2001, Vol. 24 ›› Issue (1): 42-46.

• • 上一篇    下一篇

基于K-最近距离的自动文本分类的研究

孙健, 王伟, 钟义信   

  1. 北京邮电大学信息工程学院, 北京 100876
  • 收稿日期:2000-09-14 出版日期:2001-01-10
  • 作者简介:孙健(1974—),男,山东临沂人,博士生.
  • 基金资助:
    国家自然科学基金资助项目(69982001)

Automatic Text Categorization Based on K-Nearest Neighbor

SUN Jian, WANG Wei, ZHONG Yi-xin   

  1. Information Engineering School, Beijing University of Posts and Telecommunications, Beijing 100876, China
  • Received:2000-09-14 Online:2001-01-10

摘要: 提出并实现了利用统计词频信息和语言信息相结合的方法选择特征,计算特征的权重值时不仅考虑词频,还利用了特征的集中度、分散度.经过训练和统计对每一类文本形成特征的权重向量,利用K-最近距离的方法对测试集进行分类.对英文文本的测试结果表明,该算法提高了文本分类的准确率.

关键词: 自然语言理解, 向量空间模型, K-最近距离, 自动文本分类

Abstract: A method that integrates language information and statistical information from the training corpus is put forward. The weight of these characters is computed from three parameters: word frequency, centralized degree, decentralized degree. After training, we get the vector space model of the text categorization. The classification of the input text is decided by K-nearest-neighbor.The result shows that the method improves the accuracy of the categorization.

Key words: natural language understanding, vector space model, K-nearest-neighbor, automatic text categorization

中图分类号: