HTML页面中的文献记录分析算法

doi:10.13190/j.jbupt.2017.s.019

北京邮电大学学报 ›› 2017, Vol. 40 ›› Issue (s1): 85-88.doi: 10.13190/j.jbupt.2017.s.019

HTML页面中的文献记录分析算法

曾庆涛^1,2, 解凯¹, 李业丽¹, 王欣刚³, 叶宇姗¹, 马少平²

1. 北京印刷学院信息工程学院, 北京 102600;
2. 清华大学计算机科学与技术博士后流动站, 北京 100084;
3. 国家新闻出版广电总局广播电视卫星直播管理中心, 北京 100045

收稿日期:2016-05-26 出版日期:2017-09-28 发布日期:2017-09-28
作者简介:曾庆涛(1982-),男,讲师,E-mail:jiakechongbeijing@163.com.
基金资助:
北京市教委科技创新服务能力建设项目（PXM2016_014223_000025）；北京印刷学院校级重点项目（ea201507）；北京印刷学院教师队伍建设—博士启动金项目（27170116005/062）；北京印刷学院科研项目—出版物数据资产评估实验室建设项目（20190116005/006）.

Analysis Algorithm of Reference Record in HTML Page

ZENG Qing-tao^1,2, XIE Kai¹, LI Ye-li¹, WANG Xin-gang³, YE Yu-shan¹, MA Shao-ping²

1. School of Information Engineering, Beijing Institute of Graphic Communication, Beijing 102600, China;
2. Postdoctoral Research Station in Computer Science and Technology, Tsinghua University, Beijing 100084, China;
3. Broadcast and Television Direct Broadcasting Satellite Management Center, The State Administration of Press, Publication, Radio, Film and Television, Beijing 100045, China

Received:2016-05-26 Online:2017-09-28 Published:2017-09-28

摘要/Abstract

摘要： 为了使出版机构能够及时从大量网页中发现所需文献，需要设计能够从超文本标记语言页面中自动提取文献信息的算法.为此，设计了基于条件随机场的文献记录分析算法：首先，设计了文档对象树的分割算法，通过分割标记将页面数据分成独立的部分，这些数据块由标签和文本序列构成；随后，将该序列作为条件随机场模型的特征向量，建立文献信息标记模型；最后，设计启发式算法，从标记模型中提取文献信息数据，并通过实验验证了其有效性.

关键词: 数字出版, 条件随机场, 文献记录分析

Abstract: With rapid development of Internet, web pages have become the main sources of information. In order to make publishing agencies timely find necessary references from large number of pages, it is necessary to design a reference information extraction algorithm to get useful references information from hyper text markup language pages. A reference analysis algorithm based on conditional random fields was proposed. Firstly, a document object tree segmentation algorithm was designed. Through classifier the web page data were divided into separate parts,and these data blocks were composed of tags and text sequences. Subsequently, these sequences were taken as characteristic vectors of conditional random field model to establish reference information labeling model. Finally, a heuristic algorithm was presented to extract reference information data from the labeling model, and validity of this algorithm was verified by experiments.

Key words: digital publishing, conditional random field, reference analysis

中图分类号:

TP393

曾庆涛, 解凯, 李业丽, 王欣刚, 叶宇姗, 马少平. HTML页面中的文献记录分析算法[J]. 北京邮电大学学报, 2017, 40(s1): 85-88.

ZENG Qing-tao, XIE Kai, LI Ye-li, WANG Xin-gang, YE Yu-shan, MA Shao-ping. Analysis Algorithm of Reference Record in HTML Page[J]. JOURNAL OF BEIJING UNIVERSITY OF POSTS AND TELECOM, 2017, 40(s1): 85-88.

参考文献

[1] 湛江. 文献检索统计中易被漏检和错误归类的高校学报[J]. 中国科技期刊研究, 2015, 26(9): 1005-1008. Zhan Jiang. The journals of universities easily missed or wrongly classified in statistical analysis[J]. Chinese Journal of Scientific and Technical Periodicals, 2015, 26(9): 1005-1008.
[2] 孙颖,崔洁爽,陈扬.关键词共现分析技术在图书馆文献检索中的应用——以心理学为我国"五位一体"战略布局服务为例[J]. 图书馆工作与研究, 2015(11): 45-49. Sun Ying, Cui Jieshuang, Chen Yang. Keywords co-occurrence analysis technology in the library literature retrieval application—to psychology for China "one of five" strategic layout of the service as an example[J]. Library Work and Study, 2015(11): 45-49.
[3] 林岚.认知弹性理论在文献检索教学中的应用[J]. 图书馆, 2010(2):119-120. Lin Lan. Application of cognitive flexibility theory on document retrieval teaching[J]. Library, 2010(2): 119-120.
[4] 张莉.文献检索方式的发展与提高期刊影响力[J]. 编辑学报, 2005, 17(2): 124-125. Zhang Li. Evolution of literature retrieval and improvement of the journal's influence[J].Acta Editologica, 2005, 17(2): 124-125.
[5] 张佳, 窦丽华, 陈杰. 科技文献检索实践课程教学的创新[J]. 实验室研究与探索, 2012, 31(2): 115-118. Zhang Jia, Dou Lihua, Chen Jie. Teaching innovation of science and technology literature retrieval[J]. Research and Exploration in Laboratory, 2012, 31(2): 115-118.
[6] 邹永利, 何侃, 徐健. 文体特征在网络学术文献检索中的意义与应用[J]. 情报理论与实践, 2008, 31(4): 594-597. Zou Yongli, He Kan, Xu Jian. The significance and application of stylistic features in network academic literature retrieval[J]. Information Studies: Theory & Application, 2008, 31(4): 594-597.
[7] 张永宏, 胡立耘.文献检索在编辑工作中的应用[J]. 编辑学报, 2001, 13(3):158-160. Zhang Yonghong, Hu Liyun. Application of knowledge of bibliography to editing[J].Acta Editologica, 2001, 13(3):158-160.
[8] 黄晓鹂,李树民, 廉立军. 我国高等院校文献检索教学研究文献分析[J]. 现代情报, 2009, 29(3):222-225. Huang Xiaoli, Li Shumin, Lian Lijun. Literature analysis of literature retrieval teaching research in Chinese university[J]. Journal of Modern Information, 2009, 29(3): 222-225.

HTML页面中的文献记录分析算法

Analysis Algorithm of Reference Record in HTML Page

PDF

可视化

摘要/Abstract

引用本文

使用本文

参考文献

相关文章 2

编辑推荐

Metrics

本文评价

[1]	董远;周涛;董乘宇;王海拉. 条件随机场模型在韵律结构预测中的应用[J]. 北京邮电大学学报, 2009, 32(5): 36-40.
[2]	秦颖, 王小捷, 钟义信. 级联中文组块识别[J]. 北京邮电大学学报, 2008, 31(1): 14-17.