Internet中的新词识别

doi:10.13190/jbupt.200801.26.067

北京邮电大学学报 ›› 2008, Vol. 31 ›› Issue (1): 26-29.doi: 10.13190/jbupt.200801.26.067

Internet中的新词识别

李钝1，2，曹元大2，万月亮2

1 郑州大学信息工程学院, 郑州 450001; 2 北京理工大学计算机科学技术学院, 北京 100081

收稿日期:2006-03-26 修回日期:1900-01-01 出版日期:2008-02-28 发布日期:2008-02-28
通讯作者: 李钝

Internet-Oriented New Words Identification

LI Dun1,2, CAO Yuan-da2, WAN Yue-liang2

1. School of Information Engineering, Zhengzhou University, Zhengzhou 450001, China;
2. School of Computer Science and Technology, Beijing Institute of Technology, Beijing 100081, China

Received:2006-03-26 Revised:1900-01-01 Online:2008-02-28 Published:2008-02-28
Contact: LI Dun

摘要/Abstract

摘要：

针对Internet中新词不断出现且难以被及时有效识别的问题，在分析其出现特征的基础上，利用单字之间的同现词频信息以及它们出现的时间规律确定候选新词字串.利用候选字串中各字符相邻、有序、频繁出现的特点，提出采用改进的关联规则挖掘算法进行新词的识别．实验表明，该方法不仅可以根据词串的出现规律区分出新词和常用的单字组合，改善传统方法因固定n元模式匹配而导致的僵化现象，而且解决了“长词中包含短词”的问题，提高了新词识别的准确率．

关键词: 新词识别, 关联规则, 时间函数, 分词碎片

Abstract:

The new words identification becomes an issue we have to face to with the generating flood of new words on Internet. Through analysis of the appearance characteristics of the new words, candidate strings were extracted according to co-current frequency and timeliness rules of the scattered single characters. Thereafter an improved algorithm of association rules was proposed for identifying the new words from the candidate strings based on the new word characteristics—adjacency, sequence, and frequency. Experiments show that the new words can be distinguished from the general single-character strings, the rigidity of the traditional n-gram model matching is lessened, and the short-word-included-in-long-word problem is solved, meanwhile, the precision of the new words identification is increased.

Key words: new words identification, association rules, timeliness function, segmentation fragment

中图分类号:

TP311

李钝1，2，曹元大2，万月亮2. Internet中的新词识别[J]. 北京邮电大学学报, 2008, 31(1): 26-29.

LI Dun1,2, CAO Yuan-da2, WAN Yue-liang2
. Internet-Oriented New Words Identification[J]. JOURNAL OF BEIJING UNIVERSITY OF POSTS AND TELECOM, 2008, 31(1): 26-29.

[1]	刘敬, 谷利泽, 钮心忻, 杨义先, 李忠献. 基于神经网络和遗传算法的网络安全事件分析方法[J]. 北京邮电大学学报, 2015, 38(2): 50-54.
[2]	杨月华, 杜军平, 平源. 特定领域概念间关系自动抽取方法[J]. 北京邮电大学学报, 2013, 36(5): 81-85.
[3]	徐前方，肖波，郭军. 一种基于相关度统计的告警关联规则挖掘算法[J]. 北京邮电大学学报, 2007, 30(1): 66-70.
[4]	罗守山1,陈亚娟2,宋传恒2,王自亮2,钮心忻2,杨义先2. 基于用户击键数据的异常入侵检测模型 [J]. 北京邮电大学学报, 2003, 26(4): 85-89.
[5]	段云峰, 宋俊德, 李剑威, 舒华英. 基于数量的关联规则挖掘[J]. 北京邮电大学学报, 2002, 25(4): 56-60.

Internet中的新词识别

Internet-Oriented New Words Identification

PDF

可视化

摘要/Abstract

引用本文

使用本文

参考文献

相关文章 5

编辑推荐

Metrics

本文评价