Journal of Beijing University of Posts and Telecommunications

  • EI核心期刊

Journal of Beijing University of Posts and Telecommunications ›› 2023, Vol. 46 ›› Issue (5): 112-117.

Previous Articles     Next Articles

Cross-modal Retrieval Algorithm for Image and Text Based on Pre-trained Models and Encoders

  

  • Received:2023-07-15 Revised:2023-08-13 Online:2023-10-28 Published:2023-11-03

Abstract: With the advent of the Internet era, the amount of image and text data on the web has grown exponentially. How to efficiently and accurately retrieve the information people need from massive amounts of data is a pressing issue. At present, the mainstream image-text cross-modal retrieval model architectures are mainly based on dual encoders or fusion encoders. The former encodes the image and text respectively, and then calculates the similarity distance between the image and text vectors, although the retrieval efficiency is high, the accuracy is insufficient. The latter obtains the similarity score between images and texts by jointly encoding the data of images and texts, which has high retrieval accuracy but low efficiency. In order to solve the problems of the above model architecture, this paper proposes a cross-modal image retrieval algorithm based on pre-trained model and encoder. Firstly, a recall sequencing strategy is proposed, which uses dual encoder to achieve rough recall and fusion encoder to achieve precise sequencing. Secondly, a method to build dual encoders and fusion encoders based on multi-channel Transformer pre-trained model is proposed to achieve high-quality semantic alignment between texts and images and improve retrieval performance. Experiments on two public datasets MSCOCO and Flickr30k demonstrate the effectiveness of the proposed algorithm.

Key words: Cross-modal retrieval algorithm, pre-trained model, dual encoders, fusion encoders.

CLC Number: