Journal of Beijing University of Posts and Telecommunications

  • EI核心期刊

JOURNAL OF BEIJING UNIVERSITY OF POSTS AND TELECOM ›› 2017, Vol. 40 ›› Issue (s1): 85-88.doi: 10.13190/j.jbupt.2017.s.019

• Papers • Previous Articles     Next Articles

Analysis Algorithm of Reference Record in HTML Page

ZENG Qing-tao1,2, XIE Kai1, LI Ye-li1, WANG Xin-gang3, YE Yu-shan1, MA Shao-ping2   

  1. 1. School of Information Engineering, Beijing Institute of Graphic Communication, Beijing 102600, China;
    2. Postdoctoral Research Station in Computer Science and Technology, Tsinghua University, Beijing 100084, China;
    3. Broadcast and Television Direct Broadcasting Satellite Management Center, The State Administration of Press, Publication, Radio, Film and Television, Beijing 100045, China
  • Received:2016-05-26 Online:2017-09-28 Published:2017-09-28

Abstract: With rapid development of Internet, web pages have become the main sources of information. In order to make publishing agencies timely find necessary references from large number of pages, it is necessary to design a reference information extraction algorithm to get useful references information from hyper text markup language pages. A reference analysis algorithm based on conditional random fields was proposed. Firstly, a document object tree segmentation algorithm was designed. Through classifier the web page data were divided into separate parts,and these data blocks were composed of tags and text sequences. Subsequently, these sequences were taken as characteristic vectors of conditional random field model to establish reference information labeling model. Finally, a heuristic algorithm was presented to extract reference information data from the labeling model, and validity of this algorithm was verified by experiments.

Key words: digital publishing, conditional random field, reference analysis

CLC Number: