北京邮电大学学报

  • EI核心期刊

北京邮电大学学报 ›› 2023, Vol. 46 ›› Issue (4): 91-96,122.

• 研究报告 • 上一篇    下一篇

基于门控特征融合的中文错别字纠正模型

周雨昊,孙哲,吴晓非,禹可   

  1. 北京邮电大学人工智能学院
  • 收稿日期:2022-06-28 修回日期:2022-09-26 出版日期:2023-08-28 发布日期:2023-08-24
  • 通讯作者: 禹可 E-mail:yuke@bupt.edu.cn
  • 基金资助:
    国家自然科学基金项目

Chinese Spelling Correction Model Based on Gated Feature Fusion

ZHOU Yuhao, SUN Zhe, WU Xiaofei, YU Ke   

  1. Beijing University of Posts and Telecommunications
  • Received:2022-06-28 Revised:2022-09-26 Online:2023-08-28 Published:2023-08-24

摘要: 针对在中文错别字纠正中,平等地融合汉字的语义读音和字形信息进行建模的方法会由于错误的读音或字形信息而影响模型性能的问题,提出了一种基于门控特征融合的中文错别字纠正模型,利用自适应门控来选择性地融合语义读音和字形信息,提升模型性能并加强模型的可解释性此外,使用改进的四角号码编码汉字的字形信息,有效地提取了汉字的字形特征,并且基于此扩展了模型预训练时的字形相似混淆集使用了基于混淆集替换的预训练掩码策略,使模型能有效学习文本错误知识在公开数据集 SIGHAN13、SIGHAN14 SIGHAN15 ,所提模型分别取得了 78.7% 、67.8% 77.7% 的纠错 F1 分数,相比于最优基线模型分别提升了1.5% 、1.5% 1.0% 。

关键词: 中文错别字纠正 , 预训练 , 门控特征融合 , 四角号码

Abstract: In response to the problem of model performance being affected by incorrect pronunciation or glyph when fusing semantic, phonetic and glyph information of Chinese characters equally in Chinese spelling correction, a Chinese spelling correction model based on gated feature fusion is proposed, which uses adaptive gates to selectively fuse semantic, phonetic and glyph information to improve the performance of the model and enhance the interpretability of the model. The improved four corner code is used to encode the glyph features of Chinese characters, effectively extracting the glyph features of Chinese characters, and based on this, the glyph similarity confusion set in the pre-training stage of the model is expanded. The pre-training mask strategy based on confusion set replacement is used to enable the model to effectively learn the erroneous knowledge contained in the text. On the public SIGHAN13, SIGHAN14 and SIGHAN15 datasets, the proposed model achieves correction F1-scores of 78.7% , 67.8% and 77.7% , respectively, which are 1.5% , 1.5% and 1.0% higher than the optimal baseline model.

Key words: Chinese spelling correction, pre-training, gated feature fusion, four corner code

中图分类号: