
This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.
The Evolution of Image Captioning Models: Trends, Techniques, and Future Challenges
Corresponding Author(s) : Abrar Wahid Abrar
Kinetik: Game Technology, Information System, Computer Network, Computing, Electronics, and Control,
Vol. 10, No. 4, November 2025
Abstract
This study provides a comprehensive systematic literature review (SLR) of the evolution of image captioning models from 2017 to 2025, with a particular emphasis on the impending problems, methodological enhancements, and significant architectural developments. The evaluation is guided by the increasing demand for precise and contextually aware image descriptions, and it adheres to the PRISMA methodology. It selects 36 relevant papers from reputable scientific databases. The results indicate a significant transition from traditional CNN-RNN models to Transformer-based architectures, which leads to enhanced semantic coherence and contextual comprehension. Current methodologies, such as prompt engineering and GAN-based augmentation, have further facilitated generalization and diversity, while multimodal fusion solutions, which incorporate attention mechanisms and knowledge integration, have improved caption quality. Additionally, significant areas of concern include data bias, equity in model assessment, and support for low-resource languages. The study underscores the fact that modern vision-language models, such as Flamingo, GIT, and LLaVA, offer robust domain generalization through cross-modal learning and joint embedding. Furthermore, the efficacy of computing in restricted environments is improved by the development of pretraining procedures and lightweight models. This study contributes by identifying future prospects, analyzing technical trade-offs, and delineating research trends, particularly in sectors such as healthcare, construction, and inclusive AI. According to the results, in order to optimize their efficacy in real-world applications, future picture captioning models must prioritize resource efficiency, impartiality, and multilingual capabilities.
Keywords
Download Citation
Endnote/Zotero/Mendeley (RIS)BibTeX
- H. T. Ho et al., “A Review on Vision-Language-Based Approaches: Challenges and Applications,” Comput. Mater. Contin., vol. 82, no. 2, pp. 1733–1756, 2025. https://doi.org/10.32604/cmc.2025.060363
- N. M. Khassaf and N. H. M. Ali, “Improving Pre-trained CNN-LSTM Models for Image Captioning with Hyper-Parameter Optimization,” Eng. Technol. Appl. Sci. Res., vol. 14, no. 5, pp. 17337–17343, 2024. https://doi.org/10.48084/etasr.8455
- S. Tyagi et al., “Novel Advance Image Caption Generation Utilizing Vision Transformer and Generative Adversarial Networks,” Computers, vol. 13, no. 12, 2024. https://doi.org/10.3390/computers13120305
- H. B. Duy et al., “A dental intraoral image dataset of gingivitis for image captioning,” Data Br., vol. 57, p. 110960, 2024. https://doi.org/10.1016/j.dib.2024.110960
- Y. Li, X. Zhang, T. Zhang, G. Wang, X. Wang, and S. Li, “A Patch-Level Region-Aware Module with a Multi-Label Framework for Remote Sensing Image Captioning,” Remote Sens., vol. 16, no. 21, pp. 1–20, 2024. https://doi.org/10.3390/rs16213987
- K. Cheng, E. Cambria, J. Liu, Y. Chen, and Z. Wu, “KE-RSIC: Remote Sensing Image Captioning Based on Knowledge Embedding,” IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens., vol. 18, pp. 4286–4304, 2024. https://doi.org/10.1109/JSTARS.2024.3523944
- S. Das and R. Sharma, “A TextGCN-Based Decoding Approach for Improving Remote Sensing Image Captioning,” IEEE Geosci. Remote Sens. Lett., pp. 1–6, 2024. https://doi.org/10.1109/LGRS.2024.3523134
- Q. Lin, S. Wang, X. Ye, R. Wang, R. Yang, and L. Jiao, “CLIP-based Grid Features and Masking for Remote Sensing Image Captioning,” IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens., vol. 18, pp. 2631–2642, 2024. https://doi.org/10.1109/JSTARS.2024.3510414
- Y. Yang, T. Liu, Y. Pu, L. Liu, Q. Zhao, and Q. Wan, “Multi-Attentive Network with Diffusion Model,” pp. 1–18, 2024. https://doi.org/10.3390/rs16214083
- X. Zhang, J. Shen, Y. Wang, J. Xiao, and J. Li, “Zero-Shot Image Caption Inference System Based on Pretrained Models,” Electron., vol. 13, no. 19, 2024. https://doi.org/10.3390/electronics13193854
- P. S. Sherly and P. Velvizhy, “‘Idol talks!’ AI-driven image to text to speech: illustrated by an application to images of deities,” Herit. Sci., vol. 12, no. 1, pp. 1–21, 2024. https://doi.org/10.1186/s40494-024-01490-0
- L. Yu, M. Nikandrou, J. Jin, and V. Rieser, “Quality-agnostic Image Captioning to Safely Assist People with Vision Impairment,” IJCAI Int. Jt. Conf. Artif. Intell., vol. 2023-Augus, pp. 6281–6289, 2023. https://doi.org/10.24963/ijcai.2023/697
- Y. Li, Y. Ma, Y. Zhou, and X. Yu, “Semantic-Guided Selective Representation for Image Captioning,” IEEE Access, vol. 11, no. December 2022, pp. 14500–14510, 2023. https://doi.org/10.1109/ACCESS.2023.3243952
- M. Alansari, K. Alnuaimi, S. Alansari, and S. Javed, “ELTrack: Events-Language Description for Visual Object Tracking,” IEEE Access, vol. 13, no. December 2024, pp. 31351–31367, 2025. https://doi.org/10.1109/ACCESS.2025.3540445
- F. Kalantari, K. Faez, H. Amindavar, and S. Nazari, “Improved image reconstruction from brain activity through automatic image captioning,” Sci. Rep., vol. 15, no. 1, pp. 1–17, 2025. https://doi.org/10.1038/s41598-025-89242-3
- Y. Qin, S. Ding, and H. Xie, “Advancements in Large-Scale Image and Text Representation Learning: A Comprehensive Review and Outlook,” IEEE Access, vol. PP, p. 1, 2025. https://doi.org/10.1109/ACCESS.2025.3541194
- A. Masud, M. B. Hosen, M. Habibullah, M. Anannya, and M. S. Kaiser, “Image captioning in Bengali language using visual attention,” PLoS One, vol. 20, no. 2 February, pp. 1–15, 2025. https://doi.org/10.1371/journal.pone.0309364
- B. Patra and D. R. Kisku, “Exploring Bengali Image Descriptions through the combination of diverse CNN Architectures and Transformer Decoders,” Turkish J. Eng., vol. 9, no. 1, pp. 64–78, 2025. https://doi.org/10.31127/tuje.1507442
- Y. Tang, Y. Yuan, F. Tao, and M. Tang, “Cross-modal Augmented Transformer for Automated Medical Report Generation,” IEEE J. Transl. Eng. Heal. Med., vol. 13, no. December 2024, pp. 33–48, 2025. https://doi.org/10.1109/JTEHM.2025.3536441
- Y. Zhang, J. Tong, and H. Liu, “SCAP: enhancing image captioning through lightweight feature sifting and hierarchical decoding,” Vis. Comput., pp. 0–26, 2025. https://doi.org/10.1007/s00371-025-03824-w
- F. Zhao, Z. Yu, T. Wang, and Y. Lv, “Image Captioning Based on Semantic Scenes,” Entropy, vol. 26, no. 10, pp. 1–20, 2024. https://doi.org/10.3390/e26100876
- N. Shetty and Y. Li, “Detailed Image Captioning and Hashtag Generation,” Futur. Internet, vol. 16, no. 12, 2024. https://doi.org/10.3390/fi16120444
- A. A. E. Osman, M. A. W. Shalaby, M. M. Soliman, and K. M. Elsayed, “Novel concept-based image captioning models using LSTM and multi-encoder transformer architecture,” Sci. Rep., vol. 14, no. 1, pp. 1–15, 2024. https://doi.org/10.1038/s41598-024-69664-1
- A. Zheng, S. Zheng, C. Bai, and D. Chen, “Triple-level relationship enhanced transformer for image captioning,” Multimed. Syst., vol. 29, no. 4, pp. 1955–1966, 2023. https://doi.org/10.1007/s00530-023-01073-2
- Y. Pan, T. Yao, Y. Li, and T. Mei, “X-Linear Attention Networks for Image Captioning,” Proc. IEEE Comput. Soc. Conf. Comput. Vis. Pattern Recognit., pp. 10968–10977, 2020. https://doi.org/10.1109/CVPR42600.2020.01098
- Y. Jung, I. Cho, S. H. Hsu, and M. Golparvar-Fard, “VISUALSITEDIARY: A detector-free Vision-Language Transformer model for captioning photologs for daily construction reporting and image retrievals,” Autom. Constr., vol. 165, no. May, p. 105483, 2024. https://doi.org/10.1016/j.autcon.2024.105483
- J. Chen, H. Guo, K. Yi, B. Li, and M. Elhoseiny, “VisualGPT: Data-efficient Adaptation of Pretrained Language Models for Image Captioning,” Proc. IEEE Comput. Soc. Conf. Comput. Vis. Pattern Recognit., vol. 2022-June, pp. 18009–18019, 2022. https://doi.org/10.1109/CVPR52688.2022.01750
- Y. Zhou, Y. Zhang, Z. Hu, and M. Wang, “Semi-Autoregressive Transformer for Image Captioning,” Proc. IEEE Int. Conf. Comput. Vis., vol. 2021-Octob, pp. 3132–3136, 2021. https://doi.org/10.1109/ICCVW54120.2021.00350
- J. H. Wang, M. Norouzi, and S. M. Tsai, “Augmenting Multimodal Content Representation with Transformers for Misinformation Detection †,” Big Data Cogn. Comput., vol. 8, no. 10, 2024. https://doi.org/10.3390/bdcc8100134
- S. Gautam et al., “Kvasir-VQA: A Text-Image Pair GI Tract Dataset,” arXiv Prepr. arXiv2409.01437, 2024. https://doi.org/10.1145/3689096.3689458
- Z. Li, D. Liu, H. Wang, C. Zhang, and W. Cai, “Exploring Annotation-free Image Captioning with Retrieval-augmented Pseudo Sentence Generation,” no. Vl, 2023. https://doi.org/10.1145/3696409.3700223
- K. Y. Cheng, M. Lange-Hegermann, J. B. Hövener, and B. Schreiweis, “Instance-level medical image classification for text-based retrieval in a medical data integration center,” Comput. Struct. Biotechnol. J., vol. 24, no. February, pp. 434–450, 2024. https://doi.org/10.1016/j.csbj.2024.06.006
- X. Guo, X. Di Liu, and J. Jiang, “A Scene Graph Generation Method for Historical District Street-view Imagery: A Case Study in Beijing, China,” Int. Arch. Photogramm. Remote Sens. Spat. Inf. Sci. - ISPRS Arch., vol. 48, no. 3, pp. 209–216, 2024. https://doi.org/10.5194/isprs-archives-XLVIII-3-2024-209-2024
- E. K. Holden and K. Korovin, “Graph sequence learning for premise selection,” J. Symb. Comput., vol. 128, p. 102376, 2025. https://doi.org/10.1016/j.jsc.2024.102376
- S. Fayou, H. C. Ngo, Y. W. Sek, and Z. Meng, “Clustering swap prediction for image-text pre-training,” Sci. Rep., vol. 14, no. 1, pp. 1–16, 2024. https://doi.org/10.1038/s41598-024-60832-x
- A. Sebaq and M. ElHelw, “RSDiff: remote sensing image generation from text using diffusion model,” Neural Comput. Appl., vol. 36, no. 36, pp. 23103–23111, 2024. https://doi.org/10.1007/s00521-024-10363-3
- H. Senior, G. Slabaugh, S. Yuan, and L. Rossi, “Graph neural networks in vision-language image understanding: a survey,” Vis. Comput., vol. 41, no. 1, pp. 491–516, 2024. https://doi.org/10.1007/s00371-024-03343-0
- W. Hu, F. Zhang, and Y. Zhao, “Thangka image captioning model with Salient Attention and Local Interaction Aggregator,” Herit. Sci., vol. 12, no. 1, pp. 1–21, 2024. https://doi.org/10.1186/s40494-024-01518-5
- F. Zhao, Z. Yu, T. Wang, and H. Zhao, “Meshed Context-Aware Beam Search for Image Captioning,” Entropy, vol. 26, no. 10, pp. 1–22, 2024. https://doi.org/10.3390/e26100866
- P. Sloan, P. Clatworthy, E. Simpson, and M. Mirmehdi, “Automated Radiology Report Generation: A Review of Recent Advances,” IEEE Rev. Biomed. Eng., vol. XX, no. Xx, pp. 1–24, 2024. https://doi.org/10.1109/RBME.2024.3408456
- M. J. Page et al., “The PRISMA 2020 statement: An updated guideline for reporting systematic reviews,” BMJ, vol. 372, 2021. https://doi.org/10.1136/bmj.n71
- M. L. Rethlefsen et al., “PRISMA-S: an extension to the PRISMA Statement for Reporting Literature Searches in Systematic Reviews,” Syst. Rev., vol. 10, no. 1, pp. 1–19, 2021. https://doi.org/10.1186/s13643-020-01542-z
- N. R. Haddaway, M. J. Page, C. C. Pritchard, and L. A. McGuinness, “PRISMA2020: An R package and Shiny app for producing PRISMA 2020-compliant flow diagrams, with interactivity for optimised digital transparency and Open Synthesis,” Campbell Syst. Rev., vol. 18, no. 2, pp. 1–12, 2022. https://doi.org/10.1002/cl2.1230
- W. Z. Zhang. J, Zhang. K, Xie. Y, “Deep Reciprocal Learning for Image Captioning,” IEEE Trans. Circuits Syst. Video Technol., 2025. https://doi.org/10.1109/TCSVT.2025.3539344
- W. L. Liu A. A, Wu Q, Xu N, Tian H, “Enriched Image Captioning based on Knowledge Divergence and Focus,” IEEE Trans. Circuits Syst. Video Technol., 2025. https://doi.org/10.1109/TCSVT.2024.3525158
- W. M. Song. Z, Hu. Z, Zhou. Y, Zhao. Y, Hong. R, “Embedded Heterogeneous Attention Transformer for Cross-Lingual Image Captioning,” IEEE Trans. Multimed., vol. 26, pp. 9008–9020, 2024. https://doi.org/10.1109/TMM.2024.3384678
- M. Z. Li. J, Zhang. L, Zhang. K, Hu. B, Xie. H, “Cascade Semantic Prompt Alignment Network for Image Captioning,” IEEE Trans. Circuits Syst. Video Technol., vol. 34, no. 7, pp. 5266–5281, 2024. https://doi.org/10.1109/TCSVT.2023.3343520
- C. Z. Shi. Y, Xia. J, Zhou. M, “A Dual-Feature-Based Adaptive Shared Transformer Network for Image Captioning,” IEEE Trans. Instrum. Meas., vol. 73, pp. 1–13, 2024. https://doi.org/10.1109/TIM.2024.3353830
- W. Z. Cao. S, An. G, Zheng. Z, “Vision-Enhanced and Consensus-Aware Transformer for Image Captioning,” IEEE Trans. Circuits Syst. Video Technol., vol. 32, no. 10, pp. 7005–7018, 2022. https://doi.org/10.1109/TCSVT.2022.3178844
- L. B. Hu. N, Ming. Y, Fan. C, Feng. F, “TSFNet: Triple-Steam Image Captioning,” IEEE Trans. Multimed., vol. 25, pp. 6904–6916, 2023. https://doi.org/10.1109/TMM.2022.3215861
- W. M. Yuan. J, Zhu. S, Huang. S, Zhang. H, Xiao. Y, Li. Z, “Discriminative Style Learning for Cross-Domain Image Captioning,” IEEE Trans. Image Process., vol. 31, pp. 1723–1736, 2022. https://doi.org/10.1109/TIP.2022.3145158
- L. J. Zhao. W, Wu. X, “Cross-Domain Image Captioning via Cross-Modal Retrieval and Model Adaptation,” IEEE Trans. Image Process., vol. 30, pp. 1180–1192, 2021. https://doi.org/10.1109/TIP.2020.3042086
- Z. J. Yu. N, Hu. X, Song. B, Yang. J, “Topic-Oriented Image Captioning Based on Order-Embedding,” IEEE Trans. Image Process., vol. 28, no. 6, pp. 2743–2754, 2019. https://doi.org/10.1109/TIP.2018.2889922
- W. Z. Zhang. J, Xie, Y, Ding. W, “Cross on Cross Attention: Deep Fusion Transformer for Image Captioning,” IEEE Trans. Circuits Syst. Video Technol., vol. 33, no. 8, pp. 4257–4268, 2023. https://doi.org/10.1109/TCSVT.2023.3243725
- C. A. B. Wang. J, Xu. W, Wang. Q, “On Distinctive Image Captioning via Comparing and Reweighting,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 45, no. 2, pp. 2088–2103, 2023. https://doi.org/10.1109/TPAMI.2022.3159811
- H. H. Jiang. W, Zhou. W, “Double-Stream Position Learning Transformer Network for Image Captioning,” IEEE Trans. Circuits Syst. Video Technol., vol. 32, no. 11, pp. 7706–7718, 2022. https://doi.org/10.1109/TCSVT.2022.3181490
- C. T.-S. Chen. L, Zhang. H, Xiao. J, Nie. L, Shao. J, Liu. W, “SCA-CNN: Spatial and channel-wise attention in convolutional networks for image captioning,” Proc. - 30th IEEE Conf. Comput. Vis. Pattern Recognition, CVPR 2017, vol. 2017-Janua, pp. 6298–6306, 2017. https://doi.org/10.1109/CVPR.2017.667
- Z. Y. Xu. N, Zhang. H, Liu. A.-A, Nie. W, Su. Y, Nie. J, “Multi-Level Policy and Reward-Based Deep Reinforcement Learning Framework for Image Captioning,” IEEE Trans. Multimed., vol. 22, no. 5, pp. 1372–1383, 2020. https://doi.org/10.1109/TMM.2019.2941820
- M. . . Yao. T, Pan. Y, Li. Y, “Incorporating copying mechanism in image captioning for learning novel objects,” Proc. - 30th IEEE Conf. Comput. Vis. Pattern Recognition, CVPR 2017, vol. 2017-Janua, pp. 5263–5271, 2017. https://doi.org/10.1109/CVPR.2017.559
- G. . . Rennie. S.J, Marcheret. E, Mroueh. Y, Ross. J, “Self-critical sequence training for image captioning,” Proc. - 30th IEEE Conf. Comput. Vis. Pattern Recognition, CVPR 2017, vol. 2017-Janua, pp. 1179–1195, 2017. https://doi.org/10.1109/CVPR.2017.131
- W. M. Wang. D, Hu. Z, Zhou. Y, Hong. R, “A Text-Guided Generation and Refinement Model for Image Captioning,” IEEE Trans. Multimed., vol. 25, pp. 2966–2977, 2023. https://doi.org/10.1109/TMM.2022.3154149
- A. S. . Al-Qatf. M, Wang. X, Hawbani. Am, Abdussalam. A, “Image Captioning With Novel Topics Guidance and Retrieval-Based Topics Re-Weighting,” IEEE Trans. Multimed., vol. 25, pp. 5984–5999, 2023. https://doi.org/10.1109/TMM.2022.3202690
- L. C. Yang. M, Liu. J, Shen. Y, Zhao. Z, Chen. X, Wu. Q, “An Ensemble of Generation-and Retrieval-Based Image Captioning with Dual Generator Generative Adversarial Network,” IEEE Trans. Image Process., vol. 29, pp. 9627–9640, 2020. https://doi.org/10.1109/TIP.2020.3028651
- X. Y. Huang. Y, Chen. J, Ouyang. W, Wan. W, “Image Captioning with End-to-End Attribute Detection and Subsequent Attributes Prediction,” IEEE Trans. Image Process., vol. 29, pp. 4013–4026, 2020. https://doi.org/10.1109/TIP.2020.2969330
- F. W. Zhou. L, Zhang. Y, Jiang. Y.-G, Zhang. T, “Re-Caption: Saliency-Enhanced Image Captioning through Two-Phase Learning,” IEEE Trans. Image Process., vol. 29, pp. 694–709, 2020. https://doi.org/10.1109/TIP.2019.2928144
- M. H. Xian. T, Li. Z, Tang. Z, “Adaptive Path Selection for Dynamic Image Captioning,” IEEE Trans. Circuits Syst. Video Technol., vol. 32, no. 9, pp. 5762–5775, 2022. https://doi.org/10.1109/TCSVT.2022.3155795
- W. Q. Wang. L, Li. H, Hu. W, Zhang. X, Qiu. H, Meng. F, “What Happens in Crowd Scenes: A New Dataset About Crowd Scenes for Image Captioning,” IEEE Trans. Multimed., vol. 25, pp. 5400–5412, 2023. https://doi.org/10.1109/TMM.2022.3192729
- L. Y. Jiang. W, Zhu. M, Fang. Y, Shi. G, Zhao. X, “Visual Cluster Grounding for Image Captioning,” IEEE Trans. Image Process., vol. 31, pp. 3920–3934, 2022. https://doi.org/10.1109/TIP.2022.3177318
- Z. Y. Wang. Y, Xu. N, Liu. A.-A, Li. W, “High-Order Interaction Learning for Image Captioning,” IEEE Trans. Circuits Syst. Video Technol., vol. 32, no. 7, pp. 4417–4430, 2022. https://doi.org/10.1109/TCSVT.2021.3121062
- J. R. Ji. J, Ma. Y, Sun. X, Zhou. Y, Wu. Y, “Knowing What to Learn: A Metric-Oriented Focal Mechanism for Image Captioning,” IEEE Trans. Image Process., vol. 31, pp. 4321–4335, 2022. https://doi.org/10.1109/TIP.2022.3183434
- H. Q. Yu. J, Li. J, Yu. Z, “Multimodal Transformer with Multi-View Visual Representation for Image Captioning,” IEEE Trans. Circuits Syst. Video Technol., vol. 30, no. 12, pp. 4467–4480, 2020. https://doi.org/10.1109/TCSVT.2019.2947482
- S. A. Ghosal. K, Rana. A, “Aesthetic image captioning from weakly-labelled photographs,” Proc. - 2019 Int. Conf. Comput. Vis. Work. ICCVW 2019, pp. 4550–4560, 2019. https://doi.org/10.1109/ICCVW.2019.00556
- C. T.-S. Zhang. M, Yang. Y, Zhang. H, Ji. Y, Shen. H.T, “More is better: Precise and detailed image captioning using online positive recall and missing concepts mining,” IEEE Trans. Image Process., vol. 28, no. 1, pp. 32–44, 2019. https://doi.org/10.1109/TIP.2018.2855415
- L. L.-J. Ren. Z, Wang. X, Zhang. N, Lv. X, “Deep reinforcement learning-based image captioning with embedding reward,” Proc. - 30th IEEE Conf. Comput. Vis. Pattern Recognition, CVPR 2017, pp. 1151–1159, 2017. https://doi.org/10.1109/CVPR.2017.128
- K. G. Park. C.C, Kim. B, “Attend to you: Personalized image captioning with Context Sequence Memory Networks,” Proc. - 30th IEEE Conf. Comput. Vis. Pattern Recognition, CVPR 2017, vol. 2017-Janua, pp. 6432–6440, 2017. https://doi.org/10.1109/CVPR.2017.681
- C. G. . Wang. Y, Lin. Z, Shen. X, Cohen. S, “Skeleton key: Image captioning by skeleton-Attribute decomposition,” Proc. - 30th IEEE Conf. Comput. Vis. Pattern Recognition, CVPR 2017, vol. 2017-Janua, pp. 7378–7387, 2017. https://doi.org/10.1109/CVPR.2017.780
- X. Yan, C, Hao, Y, Li, L, Yin, J, Liu, A, Mao, Z, Chen, Z, Gao, “Task-Adaptive Attention for Image Captioning,” IEEE Trans. Circuits Syst. Video Technol., vol. 32, no. 1, pp. 43–51, 2022. https://doi.org/10.1109/TCSVT.2021.3067449
- C. F. Zhang. Z, Wu. Q, Wang. Y, “Exploring Pairwise Relationships Adaptively From Linguistic Context in Image Captioning,” IEEE Trans. Multimed., vol. 24, pp. 3101–3113, 2022. https://doi.org/10.1109/TMM.2021.3093725
- P. C. Xiao. X, Wang. L, Ding. K, Xiang. S, “Deep Hierarchical Encoder-Decoder Network for Image Captioning,” IEEE Trans. Multimed., vol. 21, no. 11, pp. 2942–2956, 2019. https://doi.org/10.1109/TMM.2019.2915033
References
H. T. Ho et al., “A Review on Vision-Language-Based Approaches: Challenges and Applications,” Comput. Mater. Contin., vol. 82, no. 2, pp. 1733–1756, 2025. https://doi.org/10.32604/cmc.2025.060363
N. M. Khassaf and N. H. M. Ali, “Improving Pre-trained CNN-LSTM Models for Image Captioning with Hyper-Parameter Optimization,” Eng. Technol. Appl. Sci. Res., vol. 14, no. 5, pp. 17337–17343, 2024. https://doi.org/10.48084/etasr.8455
S. Tyagi et al., “Novel Advance Image Caption Generation Utilizing Vision Transformer and Generative Adversarial Networks,” Computers, vol. 13, no. 12, 2024. https://doi.org/10.3390/computers13120305
H. B. Duy et al., “A dental intraoral image dataset of gingivitis for image captioning,” Data Br., vol. 57, p. 110960, 2024. https://doi.org/10.1016/j.dib.2024.110960
Y. Li, X. Zhang, T. Zhang, G. Wang, X. Wang, and S. Li, “A Patch-Level Region-Aware Module with a Multi-Label Framework for Remote Sensing Image Captioning,” Remote Sens., vol. 16, no. 21, pp. 1–20, 2024. https://doi.org/10.3390/rs16213987
K. Cheng, E. Cambria, J. Liu, Y. Chen, and Z. Wu, “KE-RSIC: Remote Sensing Image Captioning Based on Knowledge Embedding,” IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens., vol. 18, pp. 4286–4304, 2024. https://doi.org/10.1109/JSTARS.2024.3523944
S. Das and R. Sharma, “A TextGCN-Based Decoding Approach for Improving Remote Sensing Image Captioning,” IEEE Geosci. Remote Sens. Lett., pp. 1–6, 2024. https://doi.org/10.1109/LGRS.2024.3523134
Q. Lin, S. Wang, X. Ye, R. Wang, R. Yang, and L. Jiao, “CLIP-based Grid Features and Masking for Remote Sensing Image Captioning,” IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens., vol. 18, pp. 2631–2642, 2024. https://doi.org/10.1109/JSTARS.2024.3510414
Y. Yang, T. Liu, Y. Pu, L. Liu, Q. Zhao, and Q. Wan, “Multi-Attentive Network with Diffusion Model,” pp. 1–18, 2024. https://doi.org/10.3390/rs16214083
X. Zhang, J. Shen, Y. Wang, J. Xiao, and J. Li, “Zero-Shot Image Caption Inference System Based on Pretrained Models,” Electron., vol. 13, no. 19, 2024. https://doi.org/10.3390/electronics13193854
P. S. Sherly and P. Velvizhy, “‘Idol talks!’ AI-driven image to text to speech: illustrated by an application to images of deities,” Herit. Sci., vol. 12, no. 1, pp. 1–21, 2024. https://doi.org/10.1186/s40494-024-01490-0
L. Yu, M. Nikandrou, J. Jin, and V. Rieser, “Quality-agnostic Image Captioning to Safely Assist People with Vision Impairment,” IJCAI Int. Jt. Conf. Artif. Intell., vol. 2023-Augus, pp. 6281–6289, 2023. https://doi.org/10.24963/ijcai.2023/697
Y. Li, Y. Ma, Y. Zhou, and X. Yu, “Semantic-Guided Selective Representation for Image Captioning,” IEEE Access, vol. 11, no. December 2022, pp. 14500–14510, 2023. https://doi.org/10.1109/ACCESS.2023.3243952
M. Alansari, K. Alnuaimi, S. Alansari, and S. Javed, “ELTrack: Events-Language Description for Visual Object Tracking,” IEEE Access, vol. 13, no. December 2024, pp. 31351–31367, 2025. https://doi.org/10.1109/ACCESS.2025.3540445
F. Kalantari, K. Faez, H. Amindavar, and S. Nazari, “Improved image reconstruction from brain activity through automatic image captioning,” Sci. Rep., vol. 15, no. 1, pp. 1–17, 2025. https://doi.org/10.1038/s41598-025-89242-3
Y. Qin, S. Ding, and H. Xie, “Advancements in Large-Scale Image and Text Representation Learning: A Comprehensive Review and Outlook,” IEEE Access, vol. PP, p. 1, 2025. https://doi.org/10.1109/ACCESS.2025.3541194
A. Masud, M. B. Hosen, M. Habibullah, M. Anannya, and M. S. Kaiser, “Image captioning in Bengali language using visual attention,” PLoS One, vol. 20, no. 2 February, pp. 1–15, 2025. https://doi.org/10.1371/journal.pone.0309364
B. Patra and D. R. Kisku, “Exploring Bengali Image Descriptions through the combination of diverse CNN Architectures and Transformer Decoders,” Turkish J. Eng., vol. 9, no. 1, pp. 64–78, 2025. https://doi.org/10.31127/tuje.1507442
Y. Tang, Y. Yuan, F. Tao, and M. Tang, “Cross-modal Augmented Transformer for Automated Medical Report Generation,” IEEE J. Transl. Eng. Heal. Med., vol. 13, no. December 2024, pp. 33–48, 2025. https://doi.org/10.1109/JTEHM.2025.3536441
Y. Zhang, J. Tong, and H. Liu, “SCAP: enhancing image captioning through lightweight feature sifting and hierarchical decoding,” Vis. Comput., pp. 0–26, 2025. https://doi.org/10.1007/s00371-025-03824-w
F. Zhao, Z. Yu, T. Wang, and Y. Lv, “Image Captioning Based on Semantic Scenes,” Entropy, vol. 26, no. 10, pp. 1–20, 2024. https://doi.org/10.3390/e26100876
N. Shetty and Y. Li, “Detailed Image Captioning and Hashtag Generation,” Futur. Internet, vol. 16, no. 12, 2024. https://doi.org/10.3390/fi16120444
A. A. E. Osman, M. A. W. Shalaby, M. M. Soliman, and K. M. Elsayed, “Novel concept-based image captioning models using LSTM and multi-encoder transformer architecture,” Sci. Rep., vol. 14, no. 1, pp. 1–15, 2024. https://doi.org/10.1038/s41598-024-69664-1
A. Zheng, S. Zheng, C. Bai, and D. Chen, “Triple-level relationship enhanced transformer for image captioning,” Multimed. Syst., vol. 29, no. 4, pp. 1955–1966, 2023. https://doi.org/10.1007/s00530-023-01073-2
Y. Pan, T. Yao, Y. Li, and T. Mei, “X-Linear Attention Networks for Image Captioning,” Proc. IEEE Comput. Soc. Conf. Comput. Vis. Pattern Recognit., pp. 10968–10977, 2020. https://doi.org/10.1109/CVPR42600.2020.01098
Y. Jung, I. Cho, S. H. Hsu, and M. Golparvar-Fard, “VISUALSITEDIARY: A detector-free Vision-Language Transformer model for captioning photologs for daily construction reporting and image retrievals,” Autom. Constr., vol. 165, no. May, p. 105483, 2024. https://doi.org/10.1016/j.autcon.2024.105483
J. Chen, H. Guo, K. Yi, B. Li, and M. Elhoseiny, “VisualGPT: Data-efficient Adaptation of Pretrained Language Models for Image Captioning,” Proc. IEEE Comput. Soc. Conf. Comput. Vis. Pattern Recognit., vol. 2022-June, pp. 18009–18019, 2022. https://doi.org/10.1109/CVPR52688.2022.01750
Y. Zhou, Y. Zhang, Z. Hu, and M. Wang, “Semi-Autoregressive Transformer for Image Captioning,” Proc. IEEE Int. Conf. Comput. Vis., vol. 2021-Octob, pp. 3132–3136, 2021. https://doi.org/10.1109/ICCVW54120.2021.00350
J. H. Wang, M. Norouzi, and S. M. Tsai, “Augmenting Multimodal Content Representation with Transformers for Misinformation Detection †,” Big Data Cogn. Comput., vol. 8, no. 10, 2024. https://doi.org/10.3390/bdcc8100134
S. Gautam et al., “Kvasir-VQA: A Text-Image Pair GI Tract Dataset,” arXiv Prepr. arXiv2409.01437, 2024. https://doi.org/10.1145/3689096.3689458
Z. Li, D. Liu, H. Wang, C. Zhang, and W. Cai, “Exploring Annotation-free Image Captioning with Retrieval-augmented Pseudo Sentence Generation,” no. Vl, 2023. https://doi.org/10.1145/3696409.3700223
K. Y. Cheng, M. Lange-Hegermann, J. B. Hövener, and B. Schreiweis, “Instance-level medical image classification for text-based retrieval in a medical data integration center,” Comput. Struct. Biotechnol. J., vol. 24, no. February, pp. 434–450, 2024. https://doi.org/10.1016/j.csbj.2024.06.006
X. Guo, X. Di Liu, and J. Jiang, “A Scene Graph Generation Method for Historical District Street-view Imagery: A Case Study in Beijing, China,” Int. Arch. Photogramm. Remote Sens. Spat. Inf. Sci. - ISPRS Arch., vol. 48, no. 3, pp. 209–216, 2024. https://doi.org/10.5194/isprs-archives-XLVIII-3-2024-209-2024
E. K. Holden and K. Korovin, “Graph sequence learning for premise selection,” J. Symb. Comput., vol. 128, p. 102376, 2025. https://doi.org/10.1016/j.jsc.2024.102376
S. Fayou, H. C. Ngo, Y. W. Sek, and Z. Meng, “Clustering swap prediction for image-text pre-training,” Sci. Rep., vol. 14, no. 1, pp. 1–16, 2024. https://doi.org/10.1038/s41598-024-60832-x
A. Sebaq and M. ElHelw, “RSDiff: remote sensing image generation from text using diffusion model,” Neural Comput. Appl., vol. 36, no. 36, pp. 23103–23111, 2024. https://doi.org/10.1007/s00521-024-10363-3
H. Senior, G. Slabaugh, S. Yuan, and L. Rossi, “Graph neural networks in vision-language image understanding: a survey,” Vis. Comput., vol. 41, no. 1, pp. 491–516, 2024. https://doi.org/10.1007/s00371-024-03343-0
W. Hu, F. Zhang, and Y. Zhao, “Thangka image captioning model with Salient Attention and Local Interaction Aggregator,” Herit. Sci., vol. 12, no. 1, pp. 1–21, 2024. https://doi.org/10.1186/s40494-024-01518-5
F. Zhao, Z. Yu, T. Wang, and H. Zhao, “Meshed Context-Aware Beam Search for Image Captioning,” Entropy, vol. 26, no. 10, pp. 1–22, 2024. https://doi.org/10.3390/e26100866
P. Sloan, P. Clatworthy, E. Simpson, and M. Mirmehdi, “Automated Radiology Report Generation: A Review of Recent Advances,” IEEE Rev. Biomed. Eng., vol. XX, no. Xx, pp. 1–24, 2024. https://doi.org/10.1109/RBME.2024.3408456
M. J. Page et al., “The PRISMA 2020 statement: An updated guideline for reporting systematic reviews,” BMJ, vol. 372, 2021. https://doi.org/10.1136/bmj.n71
M. L. Rethlefsen et al., “PRISMA-S: an extension to the PRISMA Statement for Reporting Literature Searches in Systematic Reviews,” Syst. Rev., vol. 10, no. 1, pp. 1–19, 2021. https://doi.org/10.1186/s13643-020-01542-z
N. R. Haddaway, M. J. Page, C. C. Pritchard, and L. A. McGuinness, “PRISMA2020: An R package and Shiny app for producing PRISMA 2020-compliant flow diagrams, with interactivity for optimised digital transparency and Open Synthesis,” Campbell Syst. Rev., vol. 18, no. 2, pp. 1–12, 2022. https://doi.org/10.1002/cl2.1230
W. Z. Zhang. J, Zhang. K, Xie. Y, “Deep Reciprocal Learning for Image Captioning,” IEEE Trans. Circuits Syst. Video Technol., 2025. https://doi.org/10.1109/TCSVT.2025.3539344
W. L. Liu A. A, Wu Q, Xu N, Tian H, “Enriched Image Captioning based on Knowledge Divergence and Focus,” IEEE Trans. Circuits Syst. Video Technol., 2025. https://doi.org/10.1109/TCSVT.2024.3525158
W. M. Song. Z, Hu. Z, Zhou. Y, Zhao. Y, Hong. R, “Embedded Heterogeneous Attention Transformer for Cross-Lingual Image Captioning,” IEEE Trans. Multimed., vol. 26, pp. 9008–9020, 2024. https://doi.org/10.1109/TMM.2024.3384678
M. Z. Li. J, Zhang. L, Zhang. K, Hu. B, Xie. H, “Cascade Semantic Prompt Alignment Network for Image Captioning,” IEEE Trans. Circuits Syst. Video Technol., vol. 34, no. 7, pp. 5266–5281, 2024. https://doi.org/10.1109/TCSVT.2023.3343520
C. Z. Shi. Y, Xia. J, Zhou. M, “A Dual-Feature-Based Adaptive Shared Transformer Network for Image Captioning,” IEEE Trans. Instrum. Meas., vol. 73, pp. 1–13, 2024. https://doi.org/10.1109/TIM.2024.3353830
W. Z. Cao. S, An. G, Zheng. Z, “Vision-Enhanced and Consensus-Aware Transformer for Image Captioning,” IEEE Trans. Circuits Syst. Video Technol., vol. 32, no. 10, pp. 7005–7018, 2022. https://doi.org/10.1109/TCSVT.2022.3178844
L. B. Hu. N, Ming. Y, Fan. C, Feng. F, “TSFNet: Triple-Steam Image Captioning,” IEEE Trans. Multimed., vol. 25, pp. 6904–6916, 2023. https://doi.org/10.1109/TMM.2022.3215861
W. M. Yuan. J, Zhu. S, Huang. S, Zhang. H, Xiao. Y, Li. Z, “Discriminative Style Learning for Cross-Domain Image Captioning,” IEEE Trans. Image Process., vol. 31, pp. 1723–1736, 2022. https://doi.org/10.1109/TIP.2022.3145158
L. J. Zhao. W, Wu. X, “Cross-Domain Image Captioning via Cross-Modal Retrieval and Model Adaptation,” IEEE Trans. Image Process., vol. 30, pp. 1180–1192, 2021. https://doi.org/10.1109/TIP.2020.3042086
Z. J. Yu. N, Hu. X, Song. B, Yang. J, “Topic-Oriented Image Captioning Based on Order-Embedding,” IEEE Trans. Image Process., vol. 28, no. 6, pp. 2743–2754, 2019. https://doi.org/10.1109/TIP.2018.2889922
W. Z. Zhang. J, Xie, Y, Ding. W, “Cross on Cross Attention: Deep Fusion Transformer for Image Captioning,” IEEE Trans. Circuits Syst. Video Technol., vol. 33, no. 8, pp. 4257–4268, 2023. https://doi.org/10.1109/TCSVT.2023.3243725
C. A. B. Wang. J, Xu. W, Wang. Q, “On Distinctive Image Captioning via Comparing and Reweighting,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 45, no. 2, pp. 2088–2103, 2023. https://doi.org/10.1109/TPAMI.2022.3159811
H. H. Jiang. W, Zhou. W, “Double-Stream Position Learning Transformer Network for Image Captioning,” IEEE Trans. Circuits Syst. Video Technol., vol. 32, no. 11, pp. 7706–7718, 2022. https://doi.org/10.1109/TCSVT.2022.3181490
C. T.-S. Chen. L, Zhang. H, Xiao. J, Nie. L, Shao. J, Liu. W, “SCA-CNN: Spatial and channel-wise attention in convolutional networks for image captioning,” Proc. - 30th IEEE Conf. Comput. Vis. Pattern Recognition, CVPR 2017, vol. 2017-Janua, pp. 6298–6306, 2017. https://doi.org/10.1109/CVPR.2017.667
Z. Y. Xu. N, Zhang. H, Liu. A.-A, Nie. W, Su. Y, Nie. J, “Multi-Level Policy and Reward-Based Deep Reinforcement Learning Framework for Image Captioning,” IEEE Trans. Multimed., vol. 22, no. 5, pp. 1372–1383, 2020. https://doi.org/10.1109/TMM.2019.2941820
M. . . Yao. T, Pan. Y, Li. Y, “Incorporating copying mechanism in image captioning for learning novel objects,” Proc. - 30th IEEE Conf. Comput. Vis. Pattern Recognition, CVPR 2017, vol. 2017-Janua, pp. 5263–5271, 2017. https://doi.org/10.1109/CVPR.2017.559
G. . . Rennie. S.J, Marcheret. E, Mroueh. Y, Ross. J, “Self-critical sequence training for image captioning,” Proc. - 30th IEEE Conf. Comput. Vis. Pattern Recognition, CVPR 2017, vol. 2017-Janua, pp. 1179–1195, 2017. https://doi.org/10.1109/CVPR.2017.131
W. M. Wang. D, Hu. Z, Zhou. Y, Hong. R, “A Text-Guided Generation and Refinement Model for Image Captioning,” IEEE Trans. Multimed., vol. 25, pp. 2966–2977, 2023. https://doi.org/10.1109/TMM.2022.3154149
A. S. . Al-Qatf. M, Wang. X, Hawbani. Am, Abdussalam. A, “Image Captioning With Novel Topics Guidance and Retrieval-Based Topics Re-Weighting,” IEEE Trans. Multimed., vol. 25, pp. 5984–5999, 2023. https://doi.org/10.1109/TMM.2022.3202690
L. C. Yang. M, Liu. J, Shen. Y, Zhao. Z, Chen. X, Wu. Q, “An Ensemble of Generation-and Retrieval-Based Image Captioning with Dual Generator Generative Adversarial Network,” IEEE Trans. Image Process., vol. 29, pp. 9627–9640, 2020. https://doi.org/10.1109/TIP.2020.3028651
X. Y. Huang. Y, Chen. J, Ouyang. W, Wan. W, “Image Captioning with End-to-End Attribute Detection and Subsequent Attributes Prediction,” IEEE Trans. Image Process., vol. 29, pp. 4013–4026, 2020. https://doi.org/10.1109/TIP.2020.2969330
F. W. Zhou. L, Zhang. Y, Jiang. Y.-G, Zhang. T, “Re-Caption: Saliency-Enhanced Image Captioning through Two-Phase Learning,” IEEE Trans. Image Process., vol. 29, pp. 694–709, 2020. https://doi.org/10.1109/TIP.2019.2928144
M. H. Xian. T, Li. Z, Tang. Z, “Adaptive Path Selection for Dynamic Image Captioning,” IEEE Trans. Circuits Syst. Video Technol., vol. 32, no. 9, pp. 5762–5775, 2022. https://doi.org/10.1109/TCSVT.2022.3155795
W. Q. Wang. L, Li. H, Hu. W, Zhang. X, Qiu. H, Meng. F, “What Happens in Crowd Scenes: A New Dataset About Crowd Scenes for Image Captioning,” IEEE Trans. Multimed., vol. 25, pp. 5400–5412, 2023. https://doi.org/10.1109/TMM.2022.3192729
L. Y. Jiang. W, Zhu. M, Fang. Y, Shi. G, Zhao. X, “Visual Cluster Grounding for Image Captioning,” IEEE Trans. Image Process., vol. 31, pp. 3920–3934, 2022. https://doi.org/10.1109/TIP.2022.3177318
Z. Y. Wang. Y, Xu. N, Liu. A.-A, Li. W, “High-Order Interaction Learning for Image Captioning,” IEEE Trans. Circuits Syst. Video Technol., vol. 32, no. 7, pp. 4417–4430, 2022. https://doi.org/10.1109/TCSVT.2021.3121062
J. R. Ji. J, Ma. Y, Sun. X, Zhou. Y, Wu. Y, “Knowing What to Learn: A Metric-Oriented Focal Mechanism for Image Captioning,” IEEE Trans. Image Process., vol. 31, pp. 4321–4335, 2022. https://doi.org/10.1109/TIP.2022.3183434
H. Q. Yu. J, Li. J, Yu. Z, “Multimodal Transformer with Multi-View Visual Representation for Image Captioning,” IEEE Trans. Circuits Syst. Video Technol., vol. 30, no. 12, pp. 4467–4480, 2020. https://doi.org/10.1109/TCSVT.2019.2947482
S. A. Ghosal. K, Rana. A, “Aesthetic image captioning from weakly-labelled photographs,” Proc. - 2019 Int. Conf. Comput. Vis. Work. ICCVW 2019, pp. 4550–4560, 2019. https://doi.org/10.1109/ICCVW.2019.00556
C. T.-S. Zhang. M, Yang. Y, Zhang. H, Ji. Y, Shen. H.T, “More is better: Precise and detailed image captioning using online positive recall and missing concepts mining,” IEEE Trans. Image Process., vol. 28, no. 1, pp. 32–44, 2019. https://doi.org/10.1109/TIP.2018.2855415
L. L.-J. Ren. Z, Wang. X, Zhang. N, Lv. X, “Deep reinforcement learning-based image captioning with embedding reward,” Proc. - 30th IEEE Conf. Comput. Vis. Pattern Recognition, CVPR 2017, pp. 1151–1159, 2017. https://doi.org/10.1109/CVPR.2017.128
K. G. Park. C.C, Kim. B, “Attend to you: Personalized image captioning with Context Sequence Memory Networks,” Proc. - 30th IEEE Conf. Comput. Vis. Pattern Recognition, CVPR 2017, vol. 2017-Janua, pp. 6432–6440, 2017. https://doi.org/10.1109/CVPR.2017.681
C. G. . Wang. Y, Lin. Z, Shen. X, Cohen. S, “Skeleton key: Image captioning by skeleton-Attribute decomposition,” Proc. - 30th IEEE Conf. Comput. Vis. Pattern Recognition, CVPR 2017, vol. 2017-Janua, pp. 7378–7387, 2017. https://doi.org/10.1109/CVPR.2017.780
X. Yan, C, Hao, Y, Li, L, Yin, J, Liu, A, Mao, Z, Chen, Z, Gao, “Task-Adaptive Attention for Image Captioning,” IEEE Trans. Circuits Syst. Video Technol., vol. 32, no. 1, pp. 43–51, 2022. https://doi.org/10.1109/TCSVT.2021.3067449
C. F. Zhang. Z, Wu. Q, Wang. Y, “Exploring Pairwise Relationships Adaptively From Linguistic Context in Image Captioning,” IEEE Trans. Multimed., vol. 24, pp. 3101–3113, 2022. https://doi.org/10.1109/TMM.2021.3093725
P. C. Xiao. X, Wang. L, Ding. K, Xiang. S, “Deep Hierarchical Encoder-Decoder Network for Image Captioning,” IEEE Trans. Multimed., vol. 21, no. 11, pp. 2942–2956, 2019. https://doi.org/10.1109/TMM.2019.2915033