Journal of Beijing University of Posts and Telecommunications

  • EI核心期刊

Journal of Beijing University of Posts and Telecommunications ›› 2025, Vol. 48 ›› Issue (2): 8-17.

Previous Articles     Next Articles

Long Text Categorization for Decision Impact Evaluation of Scientific Dataset

  

  • Received:2024-02-29 Revised:2024-05-09 Online:2025-04-30 Published:2025-04-30

Abstract: The openness of scientific datasets enables policymakers to make more targeted decisions, so automatic and objective evaluation of the decision-making impact of open scientific datasets has attracted attention at home and abroad. In order to effectively evaluate the decision-making impact of datasets, it is crucial to determine the relevance of datasets from the relevant documents of government announcements. However, government announcement documents are usually a kind of long text, and determining whether they are affected by the relevant influence translates into the task of classifying the long text. Accordingly, this paper proposes a long text classification method based on word feature fusion (LTC-WF) for evaluating the influence of decision-making on scientific datasets, in order to determine whether the policy is influenced by the corresponding dataset. Firstly, the words in the text are categorized according to different lexical properties, and the phrases with the same lexical properties are integrated. Then, the group of categorized phrases and the original text were embedded separately for representation. Finally, in order to verify the effect of integrating lexical information, the splicing fusion unit and the gated fusion unit are designed respectively, in which the gated fusion unit aggregates the phrase group embedding vectors and the original text embedding vectors by assigning different weights to them respectively to generate the final embedding representations of the text for classification. The experimental results on the constructed scientific data policy dataset show that our model achieves better performance than the existing mainstream methods, and the method provides an effective technical solution for realizing the evaluation of decision-making impact of scientific datasets.

Key words: scientific dataset, gating mechanisms, lexical features, long text, text classification

CLC Number: