北京邮电大学学报

  • EI核心期刊

北京邮电大学学报 ›› 2023, Vol. 46 ›› Issue (5): 112-117.

• 论文 • 上一篇    下一篇

基于预训练模型和编码器的图文跨模态检索算法

陈曦1,彭姣1,张鹏飞1,罗中李2,欧中洪2   

  1. 1. 国网河北信通公司
    2. 北京邮电大学
  • 收稿日期:2023-07-15 修回日期:2023-08-13 出版日期:2023-10-28 发布日期:2023-11-03
  • 通讯作者: 欧中洪 E-mail:zhonghong.ou@bupt.edu.cn
  • 基金资助:
    国网河北信通公司科技项目;国家自然科学基金

Cross-modal Retrieval Algorithm for Image and Text Based on Pre-trained Models and Encoders

  • Received:2023-07-15 Revised:2023-08-13 Online:2023-10-28 Published:2023-11-03

摘要: 随着大数据时代的到来,互联网上的数据呈爆炸式增长,如何精准高效地从海量数据中检索出感兴趣的信息,是当下亟需解决的问题。目前主流的图文跨模态检索模型架构主要基于双编码器或融合编码器,前者分别将图片和文本进行编码,然后计算图文间向量的相似度距离,虽然检索效率较高,但精度不足;后者通过对图文数据进行联合编码得到图文之间的相似度分数,检索精度较高,但效率低下。针对上述模型架构存在的问题,本文提出一种基于预训练模型和编码器的图文跨模态检索算法。首先,提出一种召回排序策略,使用双编码器实现粗略召回,再使用融合编码器实现精准排序;其次,提出了一种基于多路Transformer预训练模型构建双编码器和融合编码器的方法,实现图文之间高质量的语义对齐,提升检索性能。在两个公开数据集MSCOCO和Flickr30k上的实验证明了所提算法的有效性。

关键词: 跨模态检索, 预训练模型, 双编码器, 融合编码器

Abstract: With the advent of the Internet era, the amount of image and text data on the web has grown exponentially. How to efficiently and accurately retrieve the information people need from massive amounts of data is a pressing issue. At present, the mainstream image-text cross-modal retrieval model architectures are mainly based on dual encoders or fusion encoders. The former encodes the image and text respectively, and then calculates the similarity distance between the image and text vectors, although the retrieval efficiency is high, the accuracy is insufficient. The latter obtains the similarity score between images and texts by jointly encoding the data of images and texts, which has high retrieval accuracy but low efficiency. In order to solve the problems of the above model architecture, this paper proposes a cross-modal image retrieval algorithm based on pre-trained model and encoder. Firstly, a recall sequencing strategy is proposed, which uses dual encoder to achieve rough recall and fusion encoder to achieve precise sequencing. Secondly, a method to build dual encoders and fusion encoders based on multi-channel Transformer pre-trained model is proposed to achieve high-quality semantic alignment between texts and images and improve retrieval performance. Experiments on two public datasets MSCOCO and Flickr30k demonstrate the effectiveness of the proposed algorithm.

Key words: Cross-modal retrieval algorithm, pre-trained model, dual encoders, fusion encoders.

中图分类号: