[1] Vinyals O, Toshev A, Bengio S, et al. Show and tell:a neural image caption generator[C]//2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). New York:IEEE Press, 2015:3156-3164.
[2] Lu Jiasen, Xiong Caiming, Parikh D, et al. Knowing when to look:adaptive attention via a visual sentinel for image captioning[C]//2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). New York:IEEE Press, 2017:375-383.
[3] Mao Yuzhao, Zhou Chang, Wang Xiaojie, et al. Show and tell more:topic-oriented multi-sentence image captioning[C]//Proceedings of the 27th International Joint Conference on Artificial Intelligence. California:International Joint Conferences on Artificial Intelligence Organization, 2018:4258-4264.
[4] Xu K, Ba J, Kiros R, et al. Show, attend and tell:neural image caption generation with visual attention[C]//International Conference on Machine Learning. Lille, France:ACM, 2015:2048-2057.
[5] You Quanzeng, Jin Hailin, Wang Zhaowen, et al. Image captioning with semantic attention[C]//2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). New York:IEEE Press, 2016:4651-4659.
[6] Karpathy A, Li Feifei. Deep visual-semantic alignments for generating image descriptions[C]//2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). New York:IEEE Press, 2015:3128-3137.
[7] Anderson P, He Xiaodong, Buehler C, et al. Bottom-up and top-down attention for image captioning and visual question answering[C]//2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. New York:IEEE Press, 2018:6077-6086.
[8] Krause J, Johnson J, Krishna R, et al. A hierarchical approach for generating descriptive image paragraphs[C]//2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). New York:IEEE Press, 2017:317-325.
[9] Liang Xiaodan, Hu Zhiting, Zhang Hao, et al. Recurrent topic-transition GAN for visual paragraph generation[C]//2017 IEEE International Conference on Computer Vision (ICCV). New York:IEEE Press, 2017:3362-3371.
[10] Goodfellow I, Pouget-Abadie J, Mirza M, et al. Generative adversarial nets[C]//Advances in Neural Information Processing Systems. Cambridge:MA, MIT Press, 2014:2672-2680.
[11] Chatterjee M, Schwing A G. Diverse and coherent paragraph generation from images[M]//Computer Vision-ECCV 2018. Cham:Springer International Publishing, 2018:747-763.
[12] Wang Z, Luo Y, Li Y, et al. Look deeper see richer:depth-aware image paragraph captioning[C]//2018 ACM Multimedia Conference. Association for Computing Machinery. New York:ACM Press, 2018:672-680.
[13] Che Wenbin, Fan Xiaopeng, Xiong Ruiqin, et al. Paragraph generation network with visual relationship detection[C]//2018 ACM Multimedia Conference on Multimedia-MM'18. New York:ACM Press, 2018:1435-1443.
[14] Dauphin Y N, Fan A, Auli M, et al. Language modeling with gated convolutional networks[C]//The 34th International Conference on Machine Learning-Volume 70. Sydney, Australia:ACM Press, 2017:933-941.
[15] Krishna R, Zhu Yuke, Groth O, et al. Visual genome:connecting language and vision using crowdsourced dense image annotations[J]. International Journal of Computer Vision, 2017, 123(1):32-73.
[16] Chen X, Fang H, Lin T Y, et al. Microsoft COCO captions:data collection and evaluation server[J]. arXiv preprint arXiv:1504. 00325, 2015.
[17] Papineni K, Roukos S, Ward T, et al. BLEU:a method for automatic evaluation of machine translation[C]//The 40th Annual Meeting on Association for Computational Linguistics (ACL). PA, USA:ACL, 2002:311-318.
[18] Vedantam R, Zitnick C L, Parikh D. CIDEr:consensus-based image description evaluation[C]//2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). New York:IEEE Press, 2015:4566-4575. |