基于预训练语言模型的中医症状标准化方法

北京邮电大学学报 ›› 2022, Vol. 45 ›› Issue (4): 14-20.

基于预训练语言模型的中医症状标准化方法

谢永红¹,²,陶浒¹,²,贾麒¹,杨石兵¹,韩辛亮²

1. 北京科技大学计算机与通信工程学院
2. 材料领域知识工程北京市重点实验室

收稿日期:2021-09-01 修回日期:2021-11-17 出版日期:2022-08-28 发布日期:2022-06-26
通讯作者: 谢永红 E-mail:xieyh@ustb.edu.cn
基金资助:
国家重点研发计划云计算和大数据专项“大数据驱动的中医智能辅助诊断服务系统”

Traditional Chinese Medicine Symptom Normalization Approach Based on Pre-trained Language Models

Received:2021-09-01 Revised:2021-11-17 Online:2022-08-28 Published:2022-06-26

摘要/Abstract

摘要： 中医症状标准化对挖掘中医领域知识乃至推动中医现代化发展有至关重要的作用，而症状描述词存在的异名同义、一对多现象给症状标准化过程带来巨大的挑战。论文提出了一种基于预训练模型的两阶段症状标准化框架来处理这一难题：第一阶段，参考中医症状词的定义与分类，利用多标签分类思想对原始症状描述进行语义划分，得到相应语义标签下的候选标准症状词；第二阶段，使用症状词匹配模型对第一阶段得到的候选词集进行评分与排序，选取各语义标签下得分最高的候选词作为最终的标准化结果，最后模型设计了一些策略对结果进行二次召回以提高性能。通过在构建的数据集上进行实验，对比并分析了使用不同预训练模型得到的最终效果，证明了论文提出的方法和策略能够有效处理症状标准化问题，其中基于ERNIE的模型性能最优，F1值达到0.894。

关键词: 中医, 症状标准化, 实体匹配, 语义分类, 预训练语言模型

Abstract: Symptom normalization plays a vital role in mining Traditional Chinese medicine (TCM) knowledge and the promotion of the modernization of TCM. It is difficult because the challenges of symptom descriptions such as one symptom having different literal descriptions, one-to-many symptom descriptions. To deal with this problem, a two-stage framework based on pre-trained language models is proposed. First, a multi-label text classification model is adopted to semantically divide the symptom descriptions to obtain candidate normalization symptom words, according to the definition and classification of symptoms. Then score and sort the candidate words with a symptom word matching model, after which take the candidate word with the highest score in each semantic label as the normalization result of the symptom description. Finally, some strategies are designed to perform a second recall of the results to improve performance. The research analyzes the results obtained with different pre-trained models with a constructed symptom normalization dataset. The experiments show that the method and strategies can effectively deal with symptom normalization, among which the ERNIE-based model shows the best performance with F1 value 0.894.

Key words: Traditional Chinese Medicine, Symptom Normalization, Entity Matching, Semantic Classification, Pre-trained Language Models

中图分类号:

TP391.1

谢永红陶浒贾麒杨石兵韩辛亮. 基于预训练语言模型的中医症状标准化方法[J]. 北京邮电大学学报, 2022, 45(4): 14-20.