北京邮电大学学报

  • EI核心期刊

北京邮电大学学报 ›› 2024, Vol. 47 ›› Issue (2): 11-17.

• 论文 • 上一篇    下一篇

基于视觉语言提示学习的少样本图像分类方法

李宝安,王欣宇,滕尚志,吕学强   

  1. 北京信息科技大学
  • 收稿日期:2023-01-17 修回日期:2023-05-23 出版日期:2024-04-28 发布日期:2024-01-24
  • 通讯作者: 滕尚志 E-mail:tengshangzhi@bistu.edu.cn
  • 基金资助:
    国家自然科学基金项目;国家自然科学基金项目;北京市自然科学基金项目;国家语言文字工作委员会科研项目;北京市教委科技一般项目

Visual Language Learning for Few-Shot Image Classification

  • Received:2023-01-17 Revised:2023-05-23 Online:2024-04-28 Published:2024-01-24

摘要: 为了提高少样本图像分类的性能和泛化能力,充分利用大规模视觉语言预训练模型,提出了一种高效处理少样本图像分类问题的方法。首先,在文本编码部分,整合多个可学习的文本提示,充分挖掘图像类别标签在提示语句中不同位置对模型泛化性能的影响;其次,在图像编码部分,引入可学习的视觉提示,使图像预训练参数能更好地表征少样本图像;最后,在图像和文本特征编码器后添加特征适配器,并在图像分类数据集上微调网络,以提升网络在少样本图像分类数据集上的性能。在 10 个公开数据集上进行了大量实验,结果表明,相较于现有方法,所提方法在单样本分类的平均准确度上提高了 2.9% 。

关键词: 提示学习, 视觉语言模型, 少样本学习, 图像分类, 预训练模型

Abstract: This paper proposes a method to efficiently deal with the classification of images with few samples by making full use of large-scale visual language pre-training model. Firstly, in the text encoding part, multiple learnable text s are to be integrated. The purpose is to fully explore the influence of image categories in different positions in the sentence on the generalization performance of the model. Secondly, a learnable visual is added in the image coding part to make the image pre-training parameters better represent the image with few samples. Finally, a feature adapter is added to the image and text feature encoder, and the network is fine-tuned on the image classification dataset, so that the network can achieve better performance on the few-shot image classification datasets. Extensive experimental results on 10 public datasets show that the proposed method has a significant performance improvement compared to other existing methods. For example, the average accuracy of single-sample classification is increased by 2.9%.

Key words: learning, visual-language model, few-shot learning, image classification, pre-trained model

中图分类号: