北京邮电大学学报

  • EI核心期刊

北京邮电大学学报 ›› 2025, Vol. 48 ›› Issue (2): 8-17.

• 论文 • 上一篇    下一篇

面向科学数据集决策影响力评价的长文本分类方法

高瑜蔚, 王兆钧, 胡良霖, 游新冬, 周建设   

  1. 1. 中国科学院 计算机网络信息中心大数据应用技术与发展部 2. 首都师范大学 中国语言智能中心 3. 科技部 国家基础学科公共科学数据中心 4. 北京信息科技大学 网络文化与数字传播北京市重点实验室
  • 收稿日期:2024-02-29 修回日期:2024-05-09 出版日期:2025-04-30 发布日期:2025-04-30
  • 通讯作者: 周建设 E-mail:1019285127@qq.com
  • 基金资助:
    国家重点研发计划项目; 中国科学院信息化专项“中国科学院基础学科公共科学数据中心能力建设"项目

Long Text Categorization for Decision Impact Evaluation of Scientific Dataset

  • Received:2024-02-29 Revised:2024-05-09 Online:2025-04-30 Published:2025-04-30

摘要: 数据要素时代,对决策的影响力成为衡量开放科学数据集质量与价值的重要维度,自动、客观地从政府公文中提取和评价开放科学数据集的决策影响力成为可行的手段。政府公告文件通常是一种长文本,从其中判断和科学数据集相关度作为关键环节,可转化成长文本的分类任务。针对该问题,面向科学数据集决策影响力评价场景,提出了一种基于词性特征融合的长文本分类方法(LTC-WF)。首先,将文本中的词语根据不同词性进行分类,将相同词性的短语进行整合。然后,将归类后的短语组和原文本分别进行嵌入表示。最后,为了验证融合词性信息的效果,分别设计了拼接融合单元和门控融合单元,其中门控融合单元将短语组嵌入向量和原文文本嵌入向量分别赋予不同权重进行聚合,生成文本的最终嵌入表示进行分类。通过在构建的科学数据政策数据集上的实验结果表明,该方法比现有主流方法取得了更好的性能,为实现评价科学数据集决策影响力提供了有效技术方案。

关键词: 科学数据集, 门控机制, 词性特征, 长文本, 文本分类

Abstract: The openness of scientific datasets enables policymakers to make more targeted decisions, so automatic and objective evaluation of the decision-making impact of open scientific datasets has attracted attention at home and abroad. In order to effectively evaluate the decision-making impact of datasets, it is crucial to determine the relevance of datasets from the relevant documents of government announcements. However, government announcement documents are usually a kind of long text, and determining whether they are affected by the relevant influence translates into the task of classifying the long text. Accordingly, this paper proposes a long text classification method based on word feature fusion (LTC-WF) for evaluating the influence of decision-making on scientific datasets, in order to determine whether the policy is influenced by the corresponding dataset. Firstly, the words in the text are categorized according to different lexical properties, and the phrases with the same lexical properties are integrated. Then, the group of categorized phrases and the original text were embedded separately for representation. Finally, in order to verify the effect of integrating lexical information, the splicing fusion unit and the gated fusion unit are designed respectively, in which the gated fusion unit aggregates the phrase group embedding vectors and the original text embedding vectors by assigning different weights to them respectively to generate the final embedding representations of the text for classification. The experimental results on the constructed scientific data policy dataset show that our model achieves better performance than the existing mainstream methods, and the method provides an effective technical solution for realizing the evaluation of decision-making impact of scientific datasets.

Key words: scientific dataset, gating mechanisms, lexical features, long text, text classification

中图分类号: