北京邮电大学学报

  • EI核心期刊

北京邮电大学学报 ›› 2024, Vol. 47 ›› Issue (4): 130-135.

• 论文 • 上一篇    下一篇

基于多模态特征与增强对齐的细粒度图像分类

韩晶,张天鹏,吕学强   

  1. 北京信息科技大学 网络文化与数字传播北京市重点实验室
  • 收稿日期:2023-07-04 修回日期:2023-10-12 出版日期:2024-08-28 发布日期:2024-08-26
  • 通讯作者: 吕学强 E-mail:icddtxyx@163.com
  • 基金资助:
    国家自然科学基金项目;北京市自然科学基金项目;北京市教委科研计划科技一般项目

Fine-Grained Image Classification Based on Multi-Modal Features and Enhanced Alignment

HAN Jing, ZHANG Tianpeng, LYU Xueqiang   

  1. Beijing Key Laboratory of Internet Culture and Digital Dissemination Research, Beijing Information Science and Technology University
  • Received:2023-07-04 Revised:2023-10-12 Online:2024-08-28 Published:2024-08-26

摘要: 针对现有模型在多模态信息处理过程中存在特征提取不足、信息交互不充分等问题,提出基于多模态特征增强对齐的细粒度图像分类模型。首先,提出层次特征自适应融合模块,以实现多模态特征的多层次自适应融合,从而充分利用卷积中间层的特征信息,增强模型对图像局部细节的感知能力。其次,为提高多模态特征之间的交互维度,提出增强对齐特征融合模块,以充分挖掘不同模态之间的映射关系。实验结果表明,所提模型在多个数据集上均取得了良好的识别效果,优于以往多模态特征融合的模型。同时,消融实验结果表明,2 个模块单独使用的效果均优于原模型,进一步验证了所提模型的有效性。

关键词: 深度学习 , 细粒度图像分类 , 多模态 , 自适应特征融合 , 注意力机制

Abstract: Addressing the limitations of existing models in multimodal information processing, such as inadequate feature extraction and insufficient information interaction, a fine-grained image classification model is proposed, incorporating multi-modal features and enhanced alignment. A hierarchical feature adaptive fusion module is proposed to achieve multi-level adaptive fusion of multi-modal features, fully utilizing feature information of the convolutional intermediate layer and enhancing the model's ability to perceive local details of the image. Additionally, an enhanced aligned feature fusion module is proposed to improve the interaction dimension between multimodal features and make full use of the mapping relationship between different modalities. Experimental results show that the proposed model achieves excellent recognition performance on several public datasets, outperforming previous multimodal feature fusion models. Furthermore, through comparative analysis in ablation experiments, the results of individual modules are better than the original model, highlighting the effectiveness of the proposed model.

Key words: deep learning ,  fine-grained image classification ,  multimodal ,  adaptive feature fusion ,  attention mechanism

中图分类号: