北京邮电大学学报

  • EI核心期刊

北京邮电大学学报 ›› 2017, Vol. 40 ›› Issue (s1): 85-88.doi: 10.13190/j.jbupt.2017.s.019

• 论文 • 上一篇    下一篇

HTML页面中的文献记录分析算法

曾庆涛1,2, 解凯1, 李业丽1, 王欣刚3, 叶宇姗1, 马少平2   

  1. 1. 北京印刷学院 信息工程学院, 北京 102600;
    2. 清华大学 计算机科学与技术博士后流动站, 北京 100084;
    3. 国家新闻出版广电总局 广播电视卫星直播管理中心, 北京 100045
  • 收稿日期:2016-05-26 出版日期:2017-09-28 发布日期:2017-09-28
  • 作者简介:曾庆涛(1982-),男,讲师,E-mail:jiakechongbeijing@163.com.
  • 基金资助:
    北京市教委科技创新服务能力建设项目(PXM2016_014223_000025);北京印刷学院校级重点项目(ea201507);北京印刷学院教师队伍建设—博士启动金项目(27170116005/062);北京印刷学院科研项目—出版物数据资产评估实验室建设项目(20190116005/006).

Analysis Algorithm of Reference Record in HTML Page

ZENG Qing-tao1,2, XIE Kai1, LI Ye-li1, WANG Xin-gang3, YE Yu-shan1, MA Shao-ping2   

  1. 1. School of Information Engineering, Beijing Institute of Graphic Communication, Beijing 102600, China;
    2. Postdoctoral Research Station in Computer Science and Technology, Tsinghua University, Beijing 100084, China;
    3. Broadcast and Television Direct Broadcasting Satellite Management Center, The State Administration of Press, Publication, Radio, Film and Television, Beijing 100045, China
  • Received:2016-05-26 Online:2017-09-28 Published:2017-09-28

摘要: 为了使出版机构能够及时从大量网页中发现所需文献,需要设计能够从超文本标记语言页面中自动提取文献信息的算法.为此,设计了基于条件随机场的文献记录分析算法:首先,设计了文档对象树的分割算法,通过分割标记将页面数据分成独立的部分,这些数据块由标签和文本序列构成;随后,将该序列作为条件随机场模型的特征向量,建立文献信息标记模型;最后,设计启发式算法,从标记模型中提取文献信息数据,并通过实验验证了其有效性.

关键词: 数字出版, 条件随机场, 文献记录分析

Abstract: With rapid development of Internet, web pages have become the main sources of information. In order to make publishing agencies timely find necessary references from large number of pages, it is necessary to design a reference information extraction algorithm to get useful references information from hyper text markup language pages. A reference analysis algorithm based on conditional random fields was proposed. Firstly, a document object tree segmentation algorithm was designed. Through classifier the web page data were divided into separate parts,and these data blocks were composed of tags and text sequences. Subsequently, these sequences were taken as characteristic vectors of conditional random field model to establish reference information labeling model. Finally, a heuristic algorithm was presented to extract reference information data from the labeling model, and validity of this algorithm was verified by experiments.

Key words: digital publishing, conditional random field, reference analysis

中图分类号: