The Evolution of Image Captioning Models: Trends, Techniques, and Future Challenges

Ade Bastian; Abrar Wahid; Zacky Hafsari; Ardi Mardiana

doi:10.22219/kinetik.v10i4.2305

Issue

Vol. 10, No. 4, November 2025

Issue Published : Nov 1, 2025

This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.

The Evolution of Image Captioning Models: Trends, Techniques, and Future Challenges

https://doi.org/10.22219/kinetik.v10i4.2305

Ade Bastian

Majalengka University

Abrar Wahid

Majalengka University

Zacky Hafsari

Majalengka University

Ardi Mardiana

Majalengka University

Corresponding Author(s) : Abrar Wahid

221410088@unma.ac.id

Kinetik: Game Technology, Information System, Computer Network, Computing, Electronics, and Control, Vol. 10, No. 4, November 2025
Article Published : Nov 1, 2025

Abstract

This study provides a comprehensive systematic literature review (SLR) of the evolution of image captioning models from 2017 to 2025, with a particular emphasis on the impending problems, methodological enhancements, and significant architectural developments. The evaluation is guided by the increasing demand for precise and contextually aware image descriptions, and it adheres to the PRISMA methodology. It selects 36 relevant papers from reputable scientific databases. The results indicate a significant transition from traditional CNN-RNN models to Transformer-based architectures, which leads to enhanced semantic coherence and contextual comprehension. Current methodologies, such as prompt engineering and GAN-based augmentation, have further facilitated generalization and diversity, while multimodal fusion solutions, which incorporate attention mechanisms and knowledge integration, have improved caption quality. Additionally, significant areas of concern include data bias, equity in model assessment, and support for low-resource languages. The study underscores the fact that modern vision-language models, such as Flamingo, GIT, and LLaVA, offer robust domain generalization through cross-modal learning and joint embedding. Furthermore, the efficacy of computing in restricted environments is improved by the development of pretraining procedures and lightweight models. This study contributes by identifying future prospects, analyzing technical trade-offs, and delineating research trends, particularly in sectors such as healthcare, construction, and inclusive AI. According to the results, in order to optimize their efficacy in real-world applications, future picture captioning models must prioritize resource efficiency, impartiality, and multilingual capabilities.

Keywords

Computational Efficiency Image Captioning Knowledge Integration Systematic Literature Review Vision-Language Models

Bastian, A., Wahid, A., Hafsari, Z., & Mardiana, A. (2025). The Evolution of Image Captioning Models: Trends, Techniques, and Future Challenges. Kinetik: Game Technology, Information System, Computer Network, Computing, Electronics, and Control, 10(4). https://doi.org/10.22219/kinetik.v10i4.2305

Download Citation

References

H. T. Ho et al., “A Review on Vision-Language-Based Approaches: Challenges and Applications,” Comput. Mater. Contin., vol. 82, no. 2, pp. 1733–1756, 2025. https://doi.org/10.32604/cmc.2025.060363
N. M. Khassaf and N. H. M. Ali, “Improving Pre-trained CNN-LSTM Models for Image Captioning with Hyper-Parameter Optimization,” Eng. Technol. Appl. Sci. Res., vol. 14, no. 5, pp. 17337–17343, 2024. https://doi.org/10.48084/etasr.8455
S. Tyagi et al., “Novel Advance Image Caption Generation Utilizing Vision Transformer and Generative Adversarial Networks,” Computers, vol. 13, no. 12, 2024. https://doi.org/10.3390/computers13120305
H. B. Duy et al., “A dental intraoral image dataset of gingivitis for image captioning,” Data Br., vol. 57, p. 110960, 2024. https://doi.org/10.1016/j.dib.2024.110960
Y. Li, X. Zhang, T. Zhang, G. Wang, X. Wang, and S. Li, “A Patch-Level Region-Aware Module with a Multi-Label Framework for Remote Sensing Image Captioning,” Remote Sens., vol. 16, no. 21, pp. 1–20, 2024. https://doi.org/10.3390/rs16213987
K. Cheng, E. Cambria, J. Liu, Y. Chen, and Z. Wu, “KE-RSIC: Remote Sensing Image Captioning Based on Knowledge Embedding,” IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens., vol. 18, pp. 4286–4304, 2024. https://doi.org/10.1109/JSTARS.2024.3523944
S. Das and R. Sharma, “A TextGCN-Based Decoding Approach for Improving Remote Sensing Image Captioning,” IEEE Geosci. Remote Sens. Lett., pp. 1–6, 2024. https://doi.org/10.1109/LGRS.2024.3523134
Q. Lin, S. Wang, X. Ye, R. Wang, R. Yang, and L. Jiao, “CLIP-based Grid Features and Masking for Remote Sensing Image Captioning,” IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens., vol. 18, pp. 2631–2642, 2024. https://doi.org/10.1109/JSTARS.2024.3510414
Y. Yang, T. Liu, Y. Pu, L. Liu, Q. Zhao, and Q. Wan, “Multi-Attentive Network with Diffusion Model,” pp. 1–18, 2024. https://doi.org/10.3390/rs16214083
X. Zhang, J. Shen, Y. Wang, J. Xiao, and J. Li, “Zero-Shot Image Caption Inference System Based on Pretrained Models,” Electron., vol. 13, no. 19, 2024. https://doi.org/10.3390/electronics13193854
P. S. Sherly and P. Velvizhy, “‘Idol talks!’ AI-driven image to text to speech: illustrated by an application to images of deities,” Herit. Sci., vol. 12, no. 1, pp. 1–21, 2024. https://doi.org/10.1186/s40494-024-01490-0
L. Yu, M. Nikandrou, J. Jin, and V. Rieser, “Quality-agnostic Image Captioning to Safely Assist People with Vision Impairment,” IJCAI Int. Jt. Conf. Artif. Intell., vol. 2023-Augus, pp. 6281–6289, 2023. https://doi.org/10.24963/ijcai.2023/697
Y. Li, Y. Ma, Y. Zhou, and X. Yu, “Semantic-Guided Selective Representation for Image Captioning,” IEEE Access, vol. 11, no. December 2022, pp. 14500–14510, 2023. https://doi.org/10.1109/ACCESS.2023.3243952
M. Alansari, K. Alnuaimi, S. Alansari, and S. Javed, “ELTrack: Events-Language Description for Visual Object Tracking,” IEEE Access, vol. 13, no. December 2024, pp. 31351–31367, 2025. https://doi.org/10.1109/ACCESS.2025.3540445
F. Kalantari, K. Faez, H. Amindavar, and S. Nazari, “Improved image reconstruction from brain activity through automatic image captioning,” Sci. Rep., vol. 15, no. 1, pp. 1–17, 2025. https://doi.org/10.1038/s41598-025-89242-3
Y. Qin, S. Ding, and H. Xie, “Advancements in Large-Scale Image and Text Representation Learning: A Comprehensive Review and Outlook,” IEEE Access, vol. PP, p. 1, 2025. https://doi.org/10.1109/ACCESS.2025.3541194
A. Masud, M. B. Hosen, M. Habibullah, M. Anannya, and M. S. Kaiser, “Image captioning in Bengali language using visual attention,” PLoS One, vol. 20, no. 2 February, pp. 1–15, 2025. https://doi.org/10.1371/journal.pone.0309364
B. Patra and D. R. Kisku, “Exploring Bengali Image Descriptions through the combination of diverse CNN Architectures and Transformer Decoders,” Turkish J. Eng., vol. 9, no. 1, pp. 64–78, 2025. https://doi.org/10.31127/tuje.1507442
Y. Tang, Y. Yuan, F. Tao, and M. Tang, “Cross-modal Augmented Transformer for Automated Medical Report Generation,” IEEE J. Transl. Eng. Heal. Med., vol. 13, no. December 2024, pp. 33–48, 2025. https://doi.org/10.1109/JTEHM.2025.3536441
Y. Zhang, J. Tong, and H. Liu, “SCAP: enhancing image captioning through lightweight feature sifting and hierarchical decoding,” Vis. Comput., pp. 0–26, 2025. https://doi.org/10.1007/s00371-025-03824-w
F. Zhao, Z. Yu, T. Wang, and Y. Lv, “Image Captioning Based on Semantic Scenes,” Entropy, vol. 26, no. 10, pp. 1–20, 2024. https://doi.org/10.3390/e26100876
N. Shetty and Y. Li, “Detailed Image Captioning and Hashtag Generation,” Futur. Internet, vol. 16, no. 12, 2024. https://doi.org/10.3390/fi16120444
A. A. E. Osman, M. A. W. Shalaby, M. M. Soliman, and K. M. Elsayed, “Novel concept-based image captioning models using LSTM and multi-encoder transformer architecture,” Sci. Rep., vol. 14, no. 1, pp. 1–15, 2024. https://doi.org/10.1038/s41598-024-69664-1
A. Zheng, S. Zheng, C. Bai, and D. Chen, “Triple-level relationship enhanced transformer for image captioning,” Multimed. Syst., vol. 29, no. 4, pp. 1955–1966, 2023. https://doi.org/10.1007/s00530-023-01073-2
Y. Pan, T. Yao, Y. Li, and T. Mei, “X-Linear Attention Networks for Image Captioning,” Proc. IEEE Comput. Soc. Conf. Comput. Vis. Pattern Recognit., pp. 10968–10977, 2020. https://doi.org/10.1109/CVPR42600.2020.01098
Y. Jung, I. Cho, S. H. Hsu, and M. Golparvar-Fard, “VISUALSITEDIARY: A detector-free Vision-Language Transformer model for captioning photologs for daily construction reporting and image retrievals,” Autom. Constr., vol. 165, no. May, p. 105483, 2024. https://doi.org/10.1016/j.autcon.2024.105483
J. Chen, H. Guo, K. Yi, B. Li, and M. Elhoseiny, “VisualGPT: Data-efficient Adaptation of Pretrained Language Models for Image Captioning,” Proc. IEEE Comput. Soc. Conf. Comput. Vis. Pattern Recognit., vol. 2022-June, pp. 18009–18019, 2022. https://doi.org/10.1109/CVPR52688.2022.01750
Y. Zhou, Y. Zhang, Z. Hu, and M. Wang, “Semi-Autoregressive Transformer for Image Captioning,” Proc. IEEE Int. Conf. Comput. Vis., vol. 2021-Octob, pp. 3132–3136, 2021. https://doi.org/10.1109/ICCVW54120.2021.00350
J. H. Wang, M. Norouzi, and S. M. Tsai, “Augmenting Multimodal Content Representation with Transformers for Misinformation Detection †,” Big Data Cogn. Comput., vol. 8, no. 10, 2024. https://doi.org/10.3390/bdcc8100134
S. Gautam et al., “Kvasir-VQA: A Text-Image Pair GI Tract Dataset,” arXiv Prepr. arXiv2409.01437, 2024. https://doi.org/10.1145/3689096.3689458
Z. Li, D. Liu, H. Wang, C. Zhang, and W. Cai, “Exploring Annotation-free Image Captioning with Retrieval-augmented Pseudo Sentence Generation,” no. Vl, 2023. https://doi.org/10.1145/3696409.3700223
K. Y. Cheng, M. Lange-Hegermann, J. B. Hövener, and B. Schreiweis, “Instance-level medical image classification for text-based retrieval in a medical data integration center,” Comput. Struct. Biotechnol. J., vol. 24, no. February, pp. 434–450, 2024. https://doi.org/10.1016/j.csbj.2024.06.006
X. Guo, X. Di Liu, and J. Jiang, “A Scene Graph Generation Method for Historical District Street-view Imagery: A Case Study in Beijing, China,” Int. Arch. Photogramm. Remote Sens. Spat. Inf. Sci. - ISPRS Arch., vol. 48, no. 3, pp. 209–216, 2024. https://doi.org/10.5194/isprs-archives-XLVIII-3-2024-209-2024
E. K. Holden and K. Korovin, “Graph sequence learning for premise selection,” J. Symb. Comput., vol. 128, p. 102376, 2025. https://doi.org/10.1016/j.jsc.2024.102376
S. Fayou, H. C. Ngo, Y. W. Sek, and Z. Meng, “Clustering swap prediction for image-text pre-training,” Sci. Rep., vol. 14, no. 1, pp. 1–16, 2024. https://doi.org/10.1038/s41598-024-60832-x
A. Sebaq and M. ElHelw, “RSDiff: remote sensing image generation from text using diffusion model,” Neural Comput. Appl., vol. 36, no. 36, pp. 23103–23111, 2024. https://doi.org/10.1007/s00521-024-10363-3
H. Senior, G. Slabaugh, S. Yuan, and L. Rossi, “Graph neural networks in vision-language image understanding: a survey,” Vis. Comput., vol. 41, no. 1, pp. 491–516, 2024. https://doi.org/10.1007/s00371-024-03343-0
W. Hu, F. Zhang, and Y. Zhao, “Thangka image captioning model with Salient Attention and Local Interaction Aggregator,” Herit. Sci., vol. 12, no. 1, pp. 1–21, 2024. https://doi.org/10.1186/s40494-024-01518-5
F. Zhao, Z. Yu, T. Wang, and H. Zhao, “Meshed Context-Aware Beam Search for Image Captioning,” Entropy, vol. 26, no. 10, pp. 1–22, 2024. https://doi.org/10.3390/e26100866
P. Sloan, P. Clatworthy, E. Simpson, and M. Mirmehdi, “Automated Radiology Report Generation: A Review of Recent Advances,” IEEE Rev. Biomed. Eng., vol. XX, no. Xx, pp. 1–24, 2024. https://doi.org/10.1109/RBME.2024.3408456
M. J. Page et al., “The PRISMA 2020 statement: An updated guideline for reporting systematic reviews,” BMJ, vol. 372, 2021. https://doi.org/10.1136/bmj.n71
M. L. Rethlefsen et al., “PRISMA-S: an extension to the PRISMA Statement for Reporting Literature Searches in Systematic Reviews,” Syst. Rev., vol. 10, no. 1, pp. 1–19, 2021. https://doi.org/10.1186/s13643-020-01542-z
N. R. Haddaway, M. J. Page, C. C. Pritchard, and L. A. McGuinness, “PRISMA2020: An R package and Shiny app for producing PRISMA 2020-compliant flow diagrams, with interactivity for optimised digital transparency and Open Synthesis,” Campbell Syst. Rev., vol. 18, no. 2, pp. 1–12, 2022. https://doi.org/10.1002/cl2.1230
S. A. Ghosal. K, Rana. A, “Aesthetic image captioning from weakly-labelled photographs,” Proc. - 2019 Int. Conf. Comput. Vis. Work. ICCVW 2019, pp. 4550–4560, 2019. https://doi.org/10.1109/ICCVW.2019.00556
C. T.-S. Zhang. M, Yang. Y, Zhang. H, Ji. Y, Shen. H.T, “More is better: Precise and detailed image captioning using online positive recall and missing concepts mining,” IEEE Trans. Image Process., vol. 28, no. 1, pp. 32–44, 2019. https://doi.org/10.1109/TIP.2018.2855415
C. T.-S. Chen. L, Zhang. H, Xiao. J, Nie. L, Shao. J, Liu. W, “SCA-CNN: Spatial and channel-wise attention in convolutional networks for image captioning,” Proc. - 30th IEEE Conf. Comput. Vis. Pattern Recognition, CVPR 2017, vol. 2017-Janua, pp. 6298–6306, 2017. https://doi.org/10.1109/CVPR.2017.667
L. B. Hu. N, Ming. Y, Fan. C, Feng. F, “TSFNet: Triple-Steam Image Captioning,” IEEE Trans. Multimed., vol. 25, pp. 6904–6916, 2023. https://doi.org/10.1109/TMM.2022.3215861
Z. Y. Xu. N, Zhang. H, Liu. A.-A, Nie. W, Su. Y, Nie. J, “Multi-Level Policy and Reward-Based Deep Reinforcement Learning Framework for Image Captioning,” IEEE Trans. Multimed., vol. 22, no. 5, pp. 1372–1383, 2020. https://doi.org/10.1109/TMM.2019.2941820
G. . . Rennie. S.J, Marcheret. E, Mroueh. Y, Ross. J, “Self-critical sequence training for image captioning,” Proc. - 30th IEEE Conf. Comput. Vis. Pattern Recognition, CVPR 2017, vol. 2017-Janua, pp. 1179–1195, 2017. https://doi.org/10.1109/CVPR.2017.131
W. Z. Cao. S, An. G, Zheng. Z, “Vision-Enhanced and Consensus-Aware Transformer for Image Captioning,” IEEE Trans. Circuits Syst. Video Technol., vol. 32, no. 10, pp. 7005–7018, 2022. https://doi.org/10.1109/TCSVT.2022.3178844
W. Z. Zhang. J, Xie, Y, Ding. W, “Cross on Cross Attention: Deep Fusion Transformer for Image Captioning,” IEEE Trans. Circuits Syst. Video Technol., vol. 33, no. 8, pp. 4257–4268, 2023. https://doi.org/10.1109/TCSVT.2023.3243725
W. M. Song. Z, Hu. Z, Zhou. Y, Zhao. Y, Hong. R, “Embedded Heterogeneous Attention Transformer for Cross-Lingual Image Captioning,” IEEE Trans. Multimed., vol. 26, pp. 9008–9020, 2024. https://doi.org/10.1109/TMM.2024.3384678
W. L. Liu A. A, Wu Q, Xu N, Tian H, “Enriched Image Captioning based on Knowledge Divergence and Focus,” IEEE Trans. Circuits Syst. Video Technol., 2025. https://doi.org/10.1109/TCSVT.2024.3525158
J. B. Alayrac et al., “Flamingo: a Visual Language Model for Few-Shot Learning,” Adv. Neural Inf. Process. Syst., vol. 35, no. NeurIPS, 2022.
J. Wang et al., “GIT: A Generative Image-to-text Transformer for Vision and Language,” vol. 2, pp. 1–49, 2022, [Online]. Available: http://arxiv.org/abs/2205.14100
H. Liu, C. Li, Q. Wu, and Y. J. Lee, “Visual Instruction Tuning,” Adv. Neural Inf. Process. Syst., vol. 36, no. NeurIPS, pp. 1–25, 2023.
E. J. Bassey, J. H. Cheng, and D. W. Sun, “Enhancing infrared drying of red dragon fruit by novel and innovative thermoultrasound and microwave-mediated freeze-thaw pretreatments,” Lwt, vol. 202, no. March, p. 116225, 2024. https://doi.org//10.1016/j.lwt.2024.116225

References

H. T. Ho et al., “A Review on Vision-Language-Based Approaches: Challenges and Applications,” Comput. Mater. Contin., vol. 82, no. 2, pp. 1733–1756, 2025. https://doi.org/10.32604/cmc.2025.060363

N. M. Khassaf and N. H. M. Ali, “Improving Pre-trained CNN-LSTM Models for Image Captioning with Hyper-Parameter Optimization,” Eng. Technol. Appl. Sci. Res., vol. 14, no. 5, pp. 17337–17343, 2024. https://doi.org/10.48084/etasr.8455

S. Tyagi et al., “Novel Advance Image Caption Generation Utilizing Vision Transformer and Generative Adversarial Networks,” Computers, vol. 13, no. 12, 2024. https://doi.org/10.3390/computers13120305

H. B. Duy et al., “A dental intraoral image dataset of gingivitis for image captioning,” Data Br., vol. 57, p. 110960, 2024. https://doi.org/10.1016/j.dib.2024.110960

Y. Li, X. Zhang, T. Zhang, G. Wang, X. Wang, and S. Li, “A Patch-Level Region-Aware Module with a Multi-Label Framework for Remote Sensing Image Captioning,” Remote Sens., vol. 16, no. 21, pp. 1–20, 2024. https://doi.org/10.3390/rs16213987

K. Cheng, E. Cambria, J. Liu, Y. Chen, and Z. Wu, “KE-RSIC: Remote Sensing Image Captioning Based on Knowledge Embedding,” IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens., vol. 18, pp. 4286–4304, 2024. https://doi.org/10.1109/JSTARS.2024.3523944

S. Das and R. Sharma, “A TextGCN-Based Decoding Approach for Improving Remote Sensing Image Captioning,” IEEE Geosci. Remote Sens. Lett., pp. 1–6, 2024. https://doi.org/10.1109/LGRS.2024.3523134

Q. Lin, S. Wang, X. Ye, R. Wang, R. Yang, and L. Jiao, “CLIP-based Grid Features and Masking for Remote Sensing Image Captioning,” IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens., vol. 18, pp. 2631–2642, 2024. https://doi.org/10.1109/JSTARS.2024.3510414

Y. Yang, T. Liu, Y. Pu, L. Liu, Q. Zhao, and Q. Wan, “Multi-Attentive Network with Diffusion Model,” pp. 1–18, 2024. https://doi.org/10.3390/rs16214083

X. Zhang, J. Shen, Y. Wang, J. Xiao, and J. Li, “Zero-Shot Image Caption Inference System Based on Pretrained Models,” Electron., vol. 13, no. 19, 2024. https://doi.org/10.3390/electronics13193854

P. S. Sherly and P. Velvizhy, “‘Idol talks!’ AI-driven image to text to speech: illustrated by an application to images of deities,” Herit. Sci., vol. 12, no. 1, pp. 1–21, 2024. https://doi.org/10.1186/s40494-024-01490-0

L. Yu, M. Nikandrou, J. Jin, and V. Rieser, “Quality-agnostic Image Captioning to Safely Assist People with Vision Impairment,” IJCAI Int. Jt. Conf. Artif. Intell., vol. 2023-Augus, pp. 6281–6289, 2023. https://doi.org/10.24963/ijcai.2023/697

Y. Li, Y. Ma, Y. Zhou, and X. Yu, “Semantic-Guided Selective Representation for Image Captioning,” IEEE Access, vol. 11, no. December 2022, pp. 14500–14510, 2023. https://doi.org/10.1109/ACCESS.2023.3243952

M. Alansari, K. Alnuaimi, S. Alansari, and S. Javed, “ELTrack: Events-Language Description for Visual Object Tracking,” IEEE Access, vol. 13, no. December 2024, pp. 31351–31367, 2025. https://doi.org/10.1109/ACCESS.2025.3540445

F. Kalantari, K. Faez, H. Amindavar, and S. Nazari, “Improved image reconstruction from brain activity through automatic image captioning,” Sci. Rep., vol. 15, no. 1, pp. 1–17, 2025. https://doi.org/10.1038/s41598-025-89242-3

Y. Qin, S. Ding, and H. Xie, “Advancements in Large-Scale Image and Text Representation Learning: A Comprehensive Review and Outlook,” IEEE Access, vol. PP, p. 1, 2025. https://doi.org/10.1109/ACCESS.2025.3541194

A. Masud, M. B. Hosen, M. Habibullah, M. Anannya, and M. S. Kaiser, “Image captioning in Bengali language using visual attention,” PLoS One, vol. 20, no. 2 February, pp. 1–15, 2025. https://doi.org/10.1371/journal.pone.0309364

B. Patra and D. R. Kisku, “Exploring Bengali Image Descriptions through the combination of diverse CNN Architectures and Transformer Decoders,” Turkish J. Eng., vol. 9, no. 1, pp. 64–78, 2025. https://doi.org/10.31127/tuje.1507442

Y. Tang, Y. Yuan, F. Tao, and M. Tang, “Cross-modal Augmented Transformer for Automated Medical Report Generation,” IEEE J. Transl. Eng. Heal. Med., vol. 13, no. December 2024, pp. 33–48, 2025. https://doi.org/10.1109/JTEHM.2025.3536441

Y. Zhang, J. Tong, and H. Liu, “SCAP: enhancing image captioning through lightweight feature sifting and hierarchical decoding,” Vis. Comput., pp. 0–26, 2025. https://doi.org/10.1007/s00371-025-03824-w

F. Zhao, Z. Yu, T. Wang, and Y. Lv, “Image Captioning Based on Semantic Scenes,” Entropy, vol. 26, no. 10, pp. 1–20, 2024. https://doi.org/10.3390/e26100876

N. Shetty and Y. Li, “Detailed Image Captioning and Hashtag Generation,” Futur. Internet, vol. 16, no. 12, 2024. https://doi.org/10.3390/fi16120444

A. A. E. Osman, M. A. W. Shalaby, M. M. Soliman, and K. M. Elsayed, “Novel concept-based image captioning models using LSTM and multi-encoder transformer architecture,” Sci. Rep., vol. 14, no. 1, pp. 1–15, 2024. https://doi.org/10.1038/s41598-024-69664-1

A. Zheng, S. Zheng, C. Bai, and D. Chen, “Triple-level relationship enhanced transformer for image captioning,” Multimed. Syst., vol. 29, no. 4, pp. 1955–1966, 2023. https://doi.org/10.1007/s00530-023-01073-2

Y. Pan, T. Yao, Y. Li, and T. Mei, “X-Linear Attention Networks for Image Captioning,” Proc. IEEE Comput. Soc. Conf. Comput. Vis. Pattern Recognit., pp. 10968–10977, 2020. https://doi.org/10.1109/CVPR42600.2020.01098

Y. Jung, I. Cho, S. H. Hsu, and M. Golparvar-Fard, “VISUALSITEDIARY: A detector-free Vision-Language Transformer model for captioning photologs for daily construction reporting and image retrievals,” Autom. Constr., vol. 165, no. May, p. 105483, 2024. https://doi.org/10.1016/j.autcon.2024.105483

J. Chen, H. Guo, K. Yi, B. Li, and M. Elhoseiny, “VisualGPT: Data-efficient Adaptation of Pretrained Language Models for Image Captioning,” Proc. IEEE Comput. Soc. Conf. Comput. Vis. Pattern Recognit., vol. 2022-June, pp. 18009–18019, 2022. https://doi.org/10.1109/CVPR52688.2022.01750

Y. Zhou, Y. Zhang, Z. Hu, and M. Wang, “Semi-Autoregressive Transformer for Image Captioning,” Proc. IEEE Int. Conf. Comput. Vis., vol. 2021-Octob, pp. 3132–3136, 2021. https://doi.org/10.1109/ICCVW54120.2021.00350

J. H. Wang, M. Norouzi, and S. M. Tsai, “Augmenting Multimodal Content Representation with Transformers for Misinformation Detection †,” Big Data Cogn. Comput., vol. 8, no. 10, 2024. https://doi.org/10.3390/bdcc8100134

S. Gautam et al., “Kvasir-VQA: A Text-Image Pair GI Tract Dataset,” arXiv Prepr. arXiv2409.01437, 2024. https://doi.org/10.1145/3689096.3689458

Z. Li, D. Liu, H. Wang, C. Zhang, and W. Cai, “Exploring Annotation-free Image Captioning with Retrieval-augmented Pseudo Sentence Generation,” no. Vl, 2023. https://doi.org/10.1145/3696409.3700223

K. Y. Cheng, M. Lange-Hegermann, J. B. Hövener, and B. Schreiweis, “Instance-level medical image classification for text-based retrieval in a medical data integration center,” Comput. Struct. Biotechnol. J., vol. 24, no. February, pp. 434–450, 2024. https://doi.org/10.1016/j.csbj.2024.06.006

X. Guo, X. Di Liu, and J. Jiang, “A Scene Graph Generation Method for Historical District Street-view Imagery: A Case Study in Beijing, China,” Int. Arch. Photogramm. Remote Sens. Spat. Inf. Sci. - ISPRS Arch., vol. 48, no. 3, pp. 209–216, 2024. https://doi.org/10.5194/isprs-archives-XLVIII-3-2024-209-2024

E. K. Holden and K. Korovin, “Graph sequence learning for premise selection,” J. Symb. Comput., vol. 128, p. 102376, 2025. https://doi.org/10.1016/j.jsc.2024.102376

S. Fayou, H. C. Ngo, Y. W. Sek, and Z. Meng, “Clustering swap prediction for image-text pre-training,” Sci. Rep., vol. 14, no. 1, pp. 1–16, 2024. https://doi.org/10.1038/s41598-024-60832-x

A. Sebaq and M. ElHelw, “RSDiff: remote sensing image generation from text using diffusion model,” Neural Comput. Appl., vol. 36, no. 36, pp. 23103–23111, 2024. https://doi.org/10.1007/s00521-024-10363-3

H. Senior, G. Slabaugh, S. Yuan, and L. Rossi, “Graph neural networks in vision-language image understanding: a survey,” Vis. Comput., vol. 41, no. 1, pp. 491–516, 2024. https://doi.org/10.1007/s00371-024-03343-0

W. Hu, F. Zhang, and Y. Zhao, “Thangka image captioning model with Salient Attention and Local Interaction Aggregator,” Herit. Sci., vol. 12, no. 1, pp. 1–21, 2024. https://doi.org/10.1186/s40494-024-01518-5

F. Zhao, Z. Yu, T. Wang, and H. Zhao, “Meshed Context-Aware Beam Search for Image Captioning,” Entropy, vol. 26, no. 10, pp. 1–22, 2024. https://doi.org/10.3390/e26100866

P. Sloan, P. Clatworthy, E. Simpson, and M. Mirmehdi, “Automated Radiology Report Generation: A Review of Recent Advances,” IEEE Rev. Biomed. Eng., vol. XX, no. Xx, pp. 1–24, 2024. https://doi.org/10.1109/RBME.2024.3408456

M. J. Page et al., “The PRISMA 2020 statement: An updated guideline for reporting systematic reviews,” BMJ, vol. 372, 2021. https://doi.org/10.1136/bmj.n71

M. L. Rethlefsen et al., “PRISMA-S: an extension to the PRISMA Statement for Reporting Literature Searches in Systematic Reviews,” Syst. Rev., vol. 10, no. 1, pp. 1–19, 2021. https://doi.org/10.1186/s13643-020-01542-z

N. R. Haddaway, M. J. Page, C. C. Pritchard, and L. A. McGuinness, “PRISMA2020: An R package and Shiny app for producing PRISMA 2020-compliant flow diagrams, with interactivity for optimised digital transparency and Open Synthesis,” Campbell Syst. Rev., vol. 18, no. 2, pp. 1–12, 2022. https://doi.org/10.1002/cl2.1230

S. A. Ghosal. K, Rana. A, “Aesthetic image captioning from weakly-labelled photographs,” Proc. - 2019 Int. Conf. Comput. Vis. Work. ICCVW 2019, pp. 4550–4560, 2019. https://doi.org/10.1109/ICCVW.2019.00556

C. T.-S. Zhang. M, Yang. Y, Zhang. H, Ji. Y, Shen. H.T, “More is better: Precise and detailed image captioning using online positive recall and missing concepts mining,” IEEE Trans. Image Process., vol. 28, no. 1, pp. 32–44, 2019. https://doi.org/10.1109/TIP.2018.2855415

C. T.-S. Chen. L, Zhang. H, Xiao. J, Nie. L, Shao. J, Liu. W, “SCA-CNN: Spatial and channel-wise attention in convolutional networks for image captioning,” Proc. - 30th IEEE Conf. Comput. Vis. Pattern Recognition, CVPR 2017, vol. 2017-Janua, pp. 6298–6306, 2017. https://doi.org/10.1109/CVPR.2017.667

L. B. Hu. N, Ming. Y, Fan. C, Feng. F, “TSFNet: Triple-Steam Image Captioning,” IEEE Trans. Multimed., vol. 25, pp. 6904–6916, 2023. https://doi.org/10.1109/TMM.2022.3215861

Z. Y. Xu. N, Zhang. H, Liu. A.-A, Nie. W, Su. Y, Nie. J, “Multi-Level Policy and Reward-Based Deep Reinforcement Learning Framework for Image Captioning,” IEEE Trans. Multimed., vol. 22, no. 5, pp. 1372–1383, 2020. https://doi.org/10.1109/TMM.2019.2941820

G. . . Rennie. S.J, Marcheret. E, Mroueh. Y, Ross. J, “Self-critical sequence training for image captioning,” Proc. - 30th IEEE Conf. Comput. Vis. Pattern Recognition, CVPR 2017, vol. 2017-Janua, pp. 1179–1195, 2017. https://doi.org/10.1109/CVPR.2017.131

W. Z. Cao. S, An. G, Zheng. Z, “Vision-Enhanced and Consensus-Aware Transformer for Image Captioning,” IEEE Trans. Circuits Syst. Video Technol., vol. 32, no. 10, pp. 7005–7018, 2022. https://doi.org/10.1109/TCSVT.2022.3178844

W. Z. Zhang. J, Xie, Y, Ding. W, “Cross on Cross Attention: Deep Fusion Transformer for Image Captioning,” IEEE Trans. Circuits Syst. Video Technol., vol. 33, no. 8, pp. 4257–4268, 2023. https://doi.org/10.1109/TCSVT.2023.3243725

W. M. Song. Z, Hu. Z, Zhou. Y, Zhao. Y, Hong. R, “Embedded Heterogeneous Attention Transformer for Cross-Lingual Image Captioning,” IEEE Trans. Multimed., vol. 26, pp. 9008–9020, 2024. https://doi.org/10.1109/TMM.2024.3384678

W. L. Liu A. A, Wu Q, Xu N, Tian H, “Enriched Image Captioning based on Knowledge Divergence and Focus,” IEEE Trans. Circuits Syst. Video Technol., 2025. https://doi.org/10.1109/TCSVT.2024.3525158

J. B. Alayrac et al., “Flamingo: a Visual Language Model for Few-Shot Learning,” Adv. Neural Inf. Process. Syst., vol. 35, no. NeurIPS, 2022.

J. Wang et al., “GIT: A Generative Image-to-text Transformer for Vision and Language,” vol. 2, pp. 1–49, 2022, [Online]. Available: http://arxiv.org/abs/2205.14100

H. Liu, C. Li, Q. Wu, and Y. J. Lee, “Visual Instruction Tuning,” Adv. Neural Inf. Process. Syst., vol. 36, no. NeurIPS, pp. 1–25, 2023.

E. J. Bassey, J. H. Cheng, and D. W. Sun, “Enhancing infrared drying of red dragon fruit by novel and innovative thermoultrasound and microwave-mediated freeze-thaw pretreatments,” Lwt, vol. 202, no. March, p. 116225, 2024. https://doi.org//10.1016/j.lwt.2024.116225

Issue

Vol. 10, No. 4, November 2025

The Evolution of Image Captioning Models: Trends, Techniques, and Future Challenges

Corresponding Author(s) : Abrar Wahid

Abstract

Keywords

Download Citation

References

Downloads