北京邮电大学学报

  • EI核心期刊

北京邮电大学学报 ›› 2008, Vol. 31 ›› Issue (1): 26-29.doi: 10.13190/jbupt.200801.26.067

• 论文 • 上一篇    下一篇

Internet中的新词识别

李 钝1,2,曹元大2,万月亮2   

  1. 1 郑州大学 信息工程学院, 郑州 450001; 2 北京理工大学 计算机科学技术学院, 北京 100081
  • 收稿日期:2006-03-26 修回日期:1900-01-01 出版日期:2008-02-28 发布日期:2008-02-28
  • 通讯作者: 李 钝

Internet-Oriented New Words Identification

LI Dun1,2, CAO Yuan-da2, WAN Yue-liang2   

  1. 1. School of Information Engineering, Zhengzhou University, Zhengzhou 450001, China;
    2. School of Computer Science and Technology, Beijing Institute of Technology, Beijing 100081, China
  • Received:2006-03-26 Revised:1900-01-01 Online:2008-02-28 Published:2008-02-28
  • Contact: LI Dun

摘要:

针对Internet中新词不断出现且难以被及时有效识别的问题,在分析其出现特征的基础上,利用单字之间的同现词频信息以及它们出现的时间规律确定候选新词字串.利用候选字串中各字符相邻、有序、频繁出现的特点,提出采用改进的关联规则挖掘算法进行新词的识别.实验表明,该方法不仅可以根据词串的出现规律区分出新词和常用的单字组合,改善传统方法因固定n元模式匹配而导致的僵化现象,而且解决了“长词中包含短词”的问题,提高了新词识别的准确率.

关键词: 新词识别, 关联规则, 时间函数, 分词碎片

Abstract:

The new words identification becomes an issue we have to face to with the generating flood of new words on Internet. Through analysis of the appearance characteristics of the new words, candidate strings were extracted according to co-current frequency and timeliness rules of the scattered single characters. Thereafter an improved algorithm of association rules was proposed for identifying the new words from the candidate strings based on the new word characteristics—adjacency, sequence, and frequency. Experiments show that the new words can be distinguished from the general single-character strings, the rigidity of the traditional n-gram model matching is lessened, and the short-word-included-in-long-word problem is solved, meanwhile, the precision of the new words identification is increased.

Key words: new words identification, association rules, timeliness function, segmentation fragment

中图分类号: