Quick jump to page content
  • Main Navigation
  • Main Content
  • Sidebar

  • Home
  • Current
  • Archives
  • Join As Reviewer
  • Info
  • Announcements
  • Statistics
  • About
    • About the Journal
    • Submissions
    • Editorial Team
    • Privacy Statement
    • Contact
  • Register
  • Login
  • Home
  • Current
  • Archives
  • Join As Reviewer
  • Info
  • Announcements
  • Statistics
  • About
    • About the Journal
    • Submissions
    • Editorial Team
    • Privacy Statement
    • Contact
  1. Home
  2. Archives
  3. Vol. 10, No. 4, November 2025
  4. Articles

Issue

Vol. 10, No. 4, November 2025

Issue Published : Nov 1, 2025
Creative Commons License

This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.

The Evolution of Image Captioning Models: Trends, Techniques, and Future Challenges

https://doi.org/10.22219/kinetik.v10i4.2305
Ade Bastian
Majalengka University
Abrar Wahid
Majalengka University
Zacky Hafsari
Majalengka University
Ardi Mardiana
Majalengka University

Corresponding Author(s) : Abrar Wahid

221410088@unma.ac.id

Kinetik: Game Technology, Information System, Computer Network, Computing, Electronics, and Control, Vol. 10, No. 4, November 2025
Article Published : Nov 1, 2025

Share
WA Share on Facebook Share on Twitter Pinterest Email Telegram
  • Abstract
  • Cite
  • References
  • Authors Details

Abstract

This study provides a comprehensive systematic literature review (SLR) of the evolution of image captioning models from 2017 to 2025, with a particular emphasis on the impending problems, methodological enhancements, and significant architectural developments. The evaluation is guided by the increasing demand for precise and contextually aware image descriptions, and it adheres to the PRISMA methodology. It selects 36 relevant papers from reputable scientific databases. The results indicate a significant transition from traditional CNN-RNN models to Transformer-based architectures, which leads to enhanced semantic coherence and contextual comprehension. Current methodologies, such as prompt engineering and GAN-based augmentation, have further facilitated generalization and diversity, while multimodal fusion solutions, which incorporate attention mechanisms and knowledge integration, have improved caption quality. Additionally, significant areas of concern include data bias, equity in model assessment, and support for low-resource languages. The study underscores the fact that modern vision-language models, such as Flamingo, GIT, and LLaVA, offer robust domain generalization through cross-modal learning and joint embedding. Furthermore, the efficacy of computing in restricted environments is improved by the development of pretraining procedures and lightweight models. This study contributes by identifying future prospects, analyzing technical trade-offs, and delineating research trends, particularly in sectors such as healthcare, construction, and inclusive AI. According to the results, in order to optimize their efficacy in real-world applications, future picture captioning models must prioritize resource efficiency, impartiality, and multilingual capabilities.

Keywords

Computational Efficiency Image Captioning Knowledge Integration Systematic Literature Review Vision-Language Models
Bastian, A., Wahid, A., Hafsari, Z., & Mardiana, A. (2025). The Evolution of Image Captioning Models: Trends, Techniques, and Future Challenges. Kinetik: Game Technology, Information System, Computer Network, Computing, Electronics, and Control, 10(4). https://doi.org/10.22219/kinetik.v10i4.2305
  • ACM
  • ACS
  • APA
  • ABNT
  • Chicago
  • Harvard
  • IEEE
  • MLA
  • Turabian
  • Vancouver
Download Citation
Endnote/Zotero/Mendeley (RIS)
BibTeX
References
  1. H. T. Ho et al., “A Review on Vision-Language-Based Approaches: Challenges and Applications,” Comput. Mater. Contin., vol. 82, no. 2, pp. 1733–1756, 2025. https://doi.org/10.32604/cmc.2025.060363
  2. N. M. Khassaf and N. H. M. Ali, “Improving Pre-trained CNN-LSTM Models for Image Captioning with Hyper-Parameter Optimization,” Eng. Technol. Appl. Sci. Res., vol. 14, no. 5, pp. 17337–17343, 2024. https://doi.org/10.48084/etasr.8455
  3. S. Tyagi et al., “Novel Advance Image Caption Generation Utilizing Vision Transformer and Generative Adversarial Networks,” Computers, vol. 13, no. 12, 2024. https://doi.org/10.3390/computers13120305
  4. H. B. Duy et al., “A dental intraoral image dataset of gingivitis for image captioning,” Data Br., vol. 57, p. 110960, 2024. https://doi.org/10.1016/j.dib.2024.110960
  5. Y. Li, X. Zhang, T. Zhang, G. Wang, X. Wang, and S. Li, “A Patch-Level Region-Aware Module with a Multi-Label Framework for Remote Sensing Image Captioning,” Remote Sens., vol. 16, no. 21, pp. 1–20, 2024. https://doi.org/10.3390/rs16213987
  6. K. Cheng, E. Cambria, J. Liu, Y. Chen, and Z. Wu, “KE-RSIC: Remote Sensing Image Captioning Based on Knowledge Embedding,” IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens., vol. 18, pp. 4286–4304, 2024. https://doi.org/10.1109/JSTARS.2024.3523944
  7. S. Das and R. Sharma, “A TextGCN-Based Decoding Approach for Improving Remote Sensing Image Captioning,” IEEE Geosci. Remote Sens. Lett., pp. 1–6, 2024. https://doi.org/10.1109/LGRS.2024.3523134
  8. Q. Lin, S. Wang, X. Ye, R. Wang, R. Yang, and L. Jiao, “CLIP-based Grid Features and Masking for Remote Sensing Image Captioning,” IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens., vol. 18, pp. 2631–2642, 2024. https://doi.org/10.1109/JSTARS.2024.3510414
  9. Y. Yang, T. Liu, Y. Pu, L. Liu, Q. Zhao, and Q. Wan, “Multi-Attentive Network with Diffusion Model,” pp. 1–18, 2024. https://doi.org/10.3390/rs16214083
  10. X. Zhang, J. Shen, Y. Wang, J. Xiao, and J. Li, “Zero-Shot Image Caption Inference System Based on Pretrained Models,” Electron., vol. 13, no. 19, 2024. https://doi.org/10.3390/electronics13193854
  11. P. S. Sherly and P. Velvizhy, “‘Idol talks!’ AI-driven image to text to speech: illustrated by an application to images of deities,” Herit. Sci., vol. 12, no. 1, pp. 1–21, 2024. https://doi.org/10.1186/s40494-024-01490-0
  12. L. Yu, M. Nikandrou, J. Jin, and V. Rieser, “Quality-agnostic Image Captioning to Safely Assist People with Vision Impairment,” IJCAI Int. Jt. Conf. Artif. Intell., vol. 2023-Augus, pp. 6281–6289, 2023. https://doi.org/10.24963/ijcai.2023/697
  13. Y. Li, Y. Ma, Y. Zhou, and X. Yu, “Semantic-Guided Selective Representation for Image Captioning,” IEEE Access, vol. 11, no. December 2022, pp. 14500–14510, 2023. https://doi.org/10.1109/ACCESS.2023.3243952
  14. M. Alansari, K. Alnuaimi, S. Alansari, and S. Javed, “ELTrack: Events-Language Description for Visual Object Tracking,” IEEE Access, vol. 13, no. December 2024, pp. 31351–31367, 2025. https://doi.org/10.1109/ACCESS.2025.3540445
  15. F. Kalantari, K. Faez, H. Amindavar, and S. Nazari, “Improved image reconstruction from brain activity through automatic image captioning,” Sci. Rep., vol. 15, no. 1, pp. 1–17, 2025. https://doi.org/10.1038/s41598-025-89242-3
  16. Y. Qin, S. Ding, and H. Xie, “Advancements in Large-Scale Image and Text Representation Learning: A Comprehensive Review and Outlook,” IEEE Access, vol. PP, p. 1, 2025. https://doi.org/10.1109/ACCESS.2025.3541194
  17. A. Masud, M. B. Hosen, M. Habibullah, M. Anannya, and M. S. Kaiser, “Image captioning in Bengali language using visual attention,” PLoS One, vol. 20, no. 2 February, pp. 1–15, 2025. https://doi.org/10.1371/journal.pone.0309364
  18. B. Patra and D. R. Kisku, “Exploring Bengali Image Descriptions through the combination of diverse CNN Architectures and Transformer Decoders,” Turkish J. Eng., vol. 9, no. 1, pp. 64–78, 2025. https://doi.org/10.31127/tuje.1507442
  19. Y. Tang, Y. Yuan, F. Tao, and M. Tang, “Cross-modal Augmented Transformer for Automated Medical Report Generation,” IEEE J. Transl. Eng. Heal. Med., vol. 13, no. December 2024, pp. 33–48, 2025. https://doi.org/10.1109/JTEHM.2025.3536441
  20. Y. Zhang, J. Tong, and H. Liu, “SCAP: enhancing image captioning through lightweight feature sifting and hierarchical decoding,” Vis. Comput., pp. 0–26, 2025. https://doi.org/10.1007/s00371-025-03824-w
  21. F. Zhao, Z. Yu, T. Wang, and Y. Lv, “Image Captioning Based on Semantic Scenes,” Entropy, vol. 26, no. 10, pp. 1–20, 2024. https://doi.org/10.3390/e26100876
  22. N. Shetty and Y. Li, “Detailed Image Captioning and Hashtag Generation,” Futur. Internet, vol. 16, no. 12, 2024. https://doi.org/10.3390/fi16120444
  23. A. A. E. Osman, M. A. W. Shalaby, M. M. Soliman, and K. M. Elsayed, “Novel concept-based image captioning models using LSTM and multi-encoder transformer architecture,” Sci. Rep., vol. 14, no. 1, pp. 1–15, 2024. https://doi.org/10.1038/s41598-024-69664-1
  24. A. Zheng, S. Zheng, C. Bai, and D. Chen, “Triple-level relationship enhanced transformer for image captioning,” Multimed. Syst., vol. 29, no. 4, pp. 1955–1966, 2023. https://doi.org/10.1007/s00530-023-01073-2
  25. Y. Pan, T. Yao, Y. Li, and T. Mei, “X-Linear Attention Networks for Image Captioning,” Proc. IEEE Comput. Soc. Conf. Comput. Vis. Pattern Recognit., pp. 10968–10977, 2020. https://doi.org/10.1109/CVPR42600.2020.01098
  26. Y. Jung, I. Cho, S. H. Hsu, and M. Golparvar-Fard, “VISUALSITEDIARY: A detector-free Vision-Language Transformer model for captioning photologs for daily construction reporting and image retrievals,” Autom. Constr., vol. 165, no. May, p. 105483, 2024. https://doi.org/10.1016/j.autcon.2024.105483
  27. J. Chen, H. Guo, K. Yi, B. Li, and M. Elhoseiny, “VisualGPT: Data-efficient Adaptation of Pretrained Language Models for Image Captioning,” Proc. IEEE Comput. Soc. Conf. Comput. Vis. Pattern Recognit., vol. 2022-June, pp. 18009–18019, 2022. https://doi.org/10.1109/CVPR52688.2022.01750
  28. Y. Zhou, Y. Zhang, Z. Hu, and M. Wang, “Semi-Autoregressive Transformer for Image Captioning,” Proc. IEEE Int. Conf. Comput. Vis., vol. 2021-Octob, pp. 3132–3136, 2021. https://doi.org/10.1109/ICCVW54120.2021.00350
  29. J. H. Wang, M. Norouzi, and S. M. Tsai, “Augmenting Multimodal Content Representation with Transformers for Misinformation Detection †,” Big Data Cogn. Comput., vol. 8, no. 10, 2024. https://doi.org/10.3390/bdcc8100134
  30. S. Gautam et al., “Kvasir-VQA: A Text-Image Pair GI Tract Dataset,” arXiv Prepr. arXiv2409.01437, 2024. https://doi.org/10.1145/3689096.3689458
  31. Z. Li, D. Liu, H. Wang, C. Zhang, and W. Cai, “Exploring Annotation-free Image Captioning with Retrieval-augmented Pseudo Sentence Generation,” no. Vl, 2023. https://doi.org/10.1145/3696409.3700223
  32. K. Y. Cheng, M. Lange-Hegermann, J. B. Hövener, and B. Schreiweis, “Instance-level medical image classification for text-based retrieval in a medical data integration center,” Comput. Struct. Biotechnol. J., vol. 24, no. February, pp. 434–450, 2024. https://doi.org/10.1016/j.csbj.2024.06.006
  33. X. Guo, X. Di Liu, and J. Jiang, “A Scene Graph Generation Method for Historical District Street-view Imagery: A Case Study in Beijing, China,” Int. Arch. Photogramm. Remote Sens. Spat. Inf. Sci. - ISPRS Arch., vol. 48, no. 3, pp. 209–216, 2024. https://doi.org/10.5194/isprs-archives-XLVIII-3-2024-209-2024
  34. E. K. Holden and K. Korovin, “Graph sequence learning for premise selection,” J. Symb. Comput., vol. 128, p. 102376, 2025. https://doi.org/10.1016/j.jsc.2024.102376
  35. S. Fayou, H. C. Ngo, Y. W. Sek, and Z. Meng, “Clustering swap prediction for image-text pre-training,” Sci. Rep., vol. 14, no. 1, pp. 1–16, 2024. https://doi.org/10.1038/s41598-024-60832-x
  36. A. Sebaq and M. ElHelw, “RSDiff: remote sensing image generation from text using diffusion model,” Neural Comput. Appl., vol. 36, no. 36, pp. 23103–23111, 2024. https://doi.org/10.1007/s00521-024-10363-3
  37. H. Senior, G. Slabaugh, S. Yuan, and L. Rossi, “Graph neural networks in vision-language image understanding: a survey,” Vis. Comput., vol. 41, no. 1, pp. 491–516, 2024. https://doi.org/10.1007/s00371-024-03343-0
  38. W. Hu, F. Zhang, and Y. Zhao, “Thangka image captioning model with Salient Attention and Local Interaction Aggregator,” Herit. Sci., vol. 12, no. 1, pp. 1–21, 2024. https://doi.org/10.1186/s40494-024-01518-5
  39. F. Zhao, Z. Yu, T. Wang, and H. Zhao, “Meshed Context-Aware Beam Search for Image Captioning,” Entropy, vol. 26, no. 10, pp. 1–22, 2024. https://doi.org/10.3390/e26100866
  40. P. Sloan, P. Clatworthy, E. Simpson, and M. Mirmehdi, “Automated Radiology Report Generation: A Review of Recent Advances,” IEEE Rev. Biomed. Eng., vol. XX, no. Xx, pp. 1–24, 2024. https://doi.org/10.1109/RBME.2024.3408456
  41. M. J. Page et al., “The PRISMA 2020 statement: An updated guideline for reporting systematic reviews,” BMJ, vol. 372, 2021. https://doi.org/10.1136/bmj.n71
  42. M. L. Rethlefsen et al., “PRISMA-S: an extension to the PRISMA Statement for Reporting Literature Searches in Systematic Reviews,” Syst. Rev., vol. 10, no. 1, pp. 1–19, 2021. https://doi.org/10.1186/s13643-020-01542-z
  43. N. R. Haddaway, M. J. Page, C. C. Pritchard, and L. A. McGuinness, “PRISMA2020: An R package and Shiny app for producing PRISMA 2020-compliant flow diagrams, with interactivity for optimised digital transparency and Open Synthesis,” Campbell Syst. Rev., vol. 18, no. 2, pp. 1–12, 2022. https://doi.org/10.1002/cl2.1230
  44. S. A. Ghosal. K, Rana. A, “Aesthetic image captioning from weakly-labelled photographs,” Proc. - 2019 Int. Conf. Comput. Vis. Work. ICCVW 2019, pp. 4550–4560, 2019. https://doi.org/10.1109/ICCVW.2019.00556
  45. C. T.-S. Zhang. M, Yang. Y, Zhang. H, Ji. Y, Shen. H.T, “More is better: Precise and detailed image captioning using online positive recall and missing concepts mining,” IEEE Trans. Image Process., vol. 28, no. 1, pp. 32–44, 2019. https://doi.org/10.1109/TIP.2018.2855415
  46. C. T.-S. Chen. L, Zhang. H, Xiao. J, Nie. L, Shao. J, Liu. W, “SCA-CNN: Spatial and channel-wise attention in convolutional networks for image captioning,” Proc. - 30th IEEE Conf. Comput. Vis. Pattern Recognition, CVPR 2017, vol. 2017-Janua, pp. 6298–6306, 2017. https://doi.org/10.1109/CVPR.2017.667
  47. L. B. Hu. N, Ming. Y, Fan. C, Feng. F, “TSFNet: Triple-Steam Image Captioning,” IEEE Trans. Multimed., vol. 25, pp. 6904–6916, 2023. https://doi.org/10.1109/TMM.2022.3215861
  48. Z. Y. Xu. N, Zhang. H, Liu. A.-A, Nie. W, Su. Y, Nie. J, “Multi-Level Policy and Reward-Based Deep Reinforcement Learning Framework for Image Captioning,” IEEE Trans. Multimed., vol. 22, no. 5, pp. 1372–1383, 2020. https://doi.org/10.1109/TMM.2019.2941820
  49. G. . . Rennie. S.J, Marcheret. E, Mroueh. Y, Ross. J, “Self-critical sequence training for image captioning,” Proc. - 30th IEEE Conf. Comput. Vis. Pattern Recognition, CVPR 2017, vol. 2017-Janua, pp. 1179–1195, 2017. https://doi.org/10.1109/CVPR.2017.131
  50. W. Z. Cao. S, An. G, Zheng. Z, “Vision-Enhanced and Consensus-Aware Transformer for Image Captioning,” IEEE Trans. Circuits Syst. Video Technol., vol. 32, no. 10, pp. 7005–7018, 2022. https://doi.org/10.1109/TCSVT.2022.3178844
  51. W. Z. Zhang. J, Xie, Y, Ding. W, “Cross on Cross Attention: Deep Fusion Transformer for Image Captioning,” IEEE Trans. Circuits Syst. Video Technol., vol. 33, no. 8, pp. 4257–4268, 2023. https://doi.org/10.1109/TCSVT.2023.3243725
  52. W. M. Song. Z, Hu. Z, Zhou. Y, Zhao. Y, Hong. R, “Embedded Heterogeneous Attention Transformer for Cross-Lingual Image Captioning,” IEEE Trans. Multimed., vol. 26, pp. 9008–9020, 2024. https://doi.org/10.1109/TMM.2024.3384678
  53. W. L. Liu A. A, Wu Q, Xu N, Tian H, “Enriched Image Captioning based on Knowledge Divergence and Focus,” IEEE Trans. Circuits Syst. Video Technol., 2025. https://doi.org/10.1109/TCSVT.2024.3525158
  54. J. B. Alayrac et al., “Flamingo: a Visual Language Model for Few-Shot Learning,” Adv. Neural Inf. Process. Syst., vol. 35, no. NeurIPS, 2022.
  55. J. Wang et al., “GIT: A Generative Image-to-text Transformer for Vision and Language,” vol. 2, pp. 1–49, 2022, [Online]. Available: http://arxiv.org/abs/2205.14100
  56. H. Liu, C. Li, Q. Wu, and Y. J. Lee, “Visual Instruction Tuning,” Adv. Neural Inf. Process. Syst., vol. 36, no. NeurIPS, pp. 1–25, 2023.
  57. E. J. Bassey, J. H. Cheng, and D. W. Sun, “Enhancing infrared drying of red dragon fruit by novel and innovative thermoultrasound and microwave-mediated freeze-thaw pretreatments,” Lwt, vol. 202, no. March, p. 116225, 2024. https://doi.org//10.1016/j.lwt.2024.116225
Read More

References


H. T. Ho et al., “A Review on Vision-Language-Based Approaches: Challenges and Applications,” Comput. Mater. Contin., vol. 82, no. 2, pp. 1733–1756, 2025. https://doi.org/10.32604/cmc.2025.060363

N. M. Khassaf and N. H. M. Ali, “Improving Pre-trained CNN-LSTM Models for Image Captioning with Hyper-Parameter Optimization,” Eng. Technol. Appl. Sci. Res., vol. 14, no. 5, pp. 17337–17343, 2024. https://doi.org/10.48084/etasr.8455

S. Tyagi et al., “Novel Advance Image Caption Generation Utilizing Vision Transformer and Generative Adversarial Networks,” Computers, vol. 13, no. 12, 2024. https://doi.org/10.3390/computers13120305

H. B. Duy et al., “A dental intraoral image dataset of gingivitis for image captioning,” Data Br., vol. 57, p. 110960, 2024. https://doi.org/10.1016/j.dib.2024.110960

Y. Li, X. Zhang, T. Zhang, G. Wang, X. Wang, and S. Li, “A Patch-Level Region-Aware Module with a Multi-Label Framework for Remote Sensing Image Captioning,” Remote Sens., vol. 16, no. 21, pp. 1–20, 2024. https://doi.org/10.3390/rs16213987

K. Cheng, E. Cambria, J. Liu, Y. Chen, and Z. Wu, “KE-RSIC: Remote Sensing Image Captioning Based on Knowledge Embedding,” IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens., vol. 18, pp. 4286–4304, 2024. https://doi.org/10.1109/JSTARS.2024.3523944

S. Das and R. Sharma, “A TextGCN-Based Decoding Approach for Improving Remote Sensing Image Captioning,” IEEE Geosci. Remote Sens. Lett., pp. 1–6, 2024. https://doi.org/10.1109/LGRS.2024.3523134

Q. Lin, S. Wang, X. Ye, R. Wang, R. Yang, and L. Jiao, “CLIP-based Grid Features and Masking for Remote Sensing Image Captioning,” IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens., vol. 18, pp. 2631–2642, 2024. https://doi.org/10.1109/JSTARS.2024.3510414

Y. Yang, T. Liu, Y. Pu, L. Liu, Q. Zhao, and Q. Wan, “Multi-Attentive Network with Diffusion Model,” pp. 1–18, 2024. https://doi.org/10.3390/rs16214083

X. Zhang, J. Shen, Y. Wang, J. Xiao, and J. Li, “Zero-Shot Image Caption Inference System Based on Pretrained Models,” Electron., vol. 13, no. 19, 2024. https://doi.org/10.3390/electronics13193854

P. S. Sherly and P. Velvizhy, “‘Idol talks!’ AI-driven image to text to speech: illustrated by an application to images of deities,” Herit. Sci., vol. 12, no. 1, pp. 1–21, 2024. https://doi.org/10.1186/s40494-024-01490-0

L. Yu, M. Nikandrou, J. Jin, and V. Rieser, “Quality-agnostic Image Captioning to Safely Assist People with Vision Impairment,” IJCAI Int. Jt. Conf. Artif. Intell., vol. 2023-Augus, pp. 6281–6289, 2023. https://doi.org/10.24963/ijcai.2023/697

Y. Li, Y. Ma, Y. Zhou, and X. Yu, “Semantic-Guided Selective Representation for Image Captioning,” IEEE Access, vol. 11, no. December 2022, pp. 14500–14510, 2023. https://doi.org/10.1109/ACCESS.2023.3243952

M. Alansari, K. Alnuaimi, S. Alansari, and S. Javed, “ELTrack: Events-Language Description for Visual Object Tracking,” IEEE Access, vol. 13, no. December 2024, pp. 31351–31367, 2025. https://doi.org/10.1109/ACCESS.2025.3540445

F. Kalantari, K. Faez, H. Amindavar, and S. Nazari, “Improved image reconstruction from brain activity through automatic image captioning,” Sci. Rep., vol. 15, no. 1, pp. 1–17, 2025. https://doi.org/10.1038/s41598-025-89242-3

Y. Qin, S. Ding, and H. Xie, “Advancements in Large-Scale Image and Text Representation Learning: A Comprehensive Review and Outlook,” IEEE Access, vol. PP, p. 1, 2025. https://doi.org/10.1109/ACCESS.2025.3541194

A. Masud, M. B. Hosen, M. Habibullah, M. Anannya, and M. S. Kaiser, “Image captioning in Bengali language using visual attention,” PLoS One, vol. 20, no. 2 February, pp. 1–15, 2025. https://doi.org/10.1371/journal.pone.0309364

B. Patra and D. R. Kisku, “Exploring Bengali Image Descriptions through the combination of diverse CNN Architectures and Transformer Decoders,” Turkish J. Eng., vol. 9, no. 1, pp. 64–78, 2025. https://doi.org/10.31127/tuje.1507442

Y. Tang, Y. Yuan, F. Tao, and M. Tang, “Cross-modal Augmented Transformer for Automated Medical Report Generation,” IEEE J. Transl. Eng. Heal. Med., vol. 13, no. December 2024, pp. 33–48, 2025. https://doi.org/10.1109/JTEHM.2025.3536441

Y. Zhang, J. Tong, and H. Liu, “SCAP: enhancing image captioning through lightweight feature sifting and hierarchical decoding,” Vis. Comput., pp. 0–26, 2025. https://doi.org/10.1007/s00371-025-03824-w

F. Zhao, Z. Yu, T. Wang, and Y. Lv, “Image Captioning Based on Semantic Scenes,” Entropy, vol. 26, no. 10, pp. 1–20, 2024. https://doi.org/10.3390/e26100876

N. Shetty and Y. Li, “Detailed Image Captioning and Hashtag Generation,” Futur. Internet, vol. 16, no. 12, 2024. https://doi.org/10.3390/fi16120444

A. A. E. Osman, M. A. W. Shalaby, M. M. Soliman, and K. M. Elsayed, “Novel concept-based image captioning models using LSTM and multi-encoder transformer architecture,” Sci. Rep., vol. 14, no. 1, pp. 1–15, 2024. https://doi.org/10.1038/s41598-024-69664-1

A. Zheng, S. Zheng, C. Bai, and D. Chen, “Triple-level relationship enhanced transformer for image captioning,” Multimed. Syst., vol. 29, no. 4, pp. 1955–1966, 2023. https://doi.org/10.1007/s00530-023-01073-2

Y. Pan, T. Yao, Y. Li, and T. Mei, “X-Linear Attention Networks for Image Captioning,” Proc. IEEE Comput. Soc. Conf. Comput. Vis. Pattern Recognit., pp. 10968–10977, 2020. https://doi.org/10.1109/CVPR42600.2020.01098

Y. Jung, I. Cho, S. H. Hsu, and M. Golparvar-Fard, “VISUALSITEDIARY: A detector-free Vision-Language Transformer model for captioning photologs for daily construction reporting and image retrievals,” Autom. Constr., vol. 165, no. May, p. 105483, 2024. https://doi.org/10.1016/j.autcon.2024.105483

J. Chen, H. Guo, K. Yi, B. Li, and M. Elhoseiny, “VisualGPT: Data-efficient Adaptation of Pretrained Language Models for Image Captioning,” Proc. IEEE Comput. Soc. Conf. Comput. Vis. Pattern Recognit., vol. 2022-June, pp. 18009–18019, 2022. https://doi.org/10.1109/CVPR52688.2022.01750

Y. Zhou, Y. Zhang, Z. Hu, and M. Wang, “Semi-Autoregressive Transformer for Image Captioning,” Proc. IEEE Int. Conf. Comput. Vis., vol. 2021-Octob, pp. 3132–3136, 2021. https://doi.org/10.1109/ICCVW54120.2021.00350

J. H. Wang, M. Norouzi, and S. M. Tsai, “Augmenting Multimodal Content Representation with Transformers for Misinformation Detection †,” Big Data Cogn. Comput., vol. 8, no. 10, 2024. https://doi.org/10.3390/bdcc8100134

S. Gautam et al., “Kvasir-VQA: A Text-Image Pair GI Tract Dataset,” arXiv Prepr. arXiv2409.01437, 2024. https://doi.org/10.1145/3689096.3689458

Z. Li, D. Liu, H. Wang, C. Zhang, and W. Cai, “Exploring Annotation-free Image Captioning with Retrieval-augmented Pseudo Sentence Generation,” no. Vl, 2023. https://doi.org/10.1145/3696409.3700223

K. Y. Cheng, M. Lange-Hegermann, J. B. Hövener, and B. Schreiweis, “Instance-level medical image classification for text-based retrieval in a medical data integration center,” Comput. Struct. Biotechnol. J., vol. 24, no. February, pp. 434–450, 2024. https://doi.org/10.1016/j.csbj.2024.06.006

X. Guo, X. Di Liu, and J. Jiang, “A Scene Graph Generation Method for Historical District Street-view Imagery: A Case Study in Beijing, China,” Int. Arch. Photogramm. Remote Sens. Spat. Inf. Sci. - ISPRS Arch., vol. 48, no. 3, pp. 209–216, 2024. https://doi.org/10.5194/isprs-archives-XLVIII-3-2024-209-2024

E. K. Holden and K. Korovin, “Graph sequence learning for premise selection,” J. Symb. Comput., vol. 128, p. 102376, 2025. https://doi.org/10.1016/j.jsc.2024.102376

S. Fayou, H. C. Ngo, Y. W. Sek, and Z. Meng, “Clustering swap prediction for image-text pre-training,” Sci. Rep., vol. 14, no. 1, pp. 1–16, 2024. https://doi.org/10.1038/s41598-024-60832-x

A. Sebaq and M. ElHelw, “RSDiff: remote sensing image generation from text using diffusion model,” Neural Comput. Appl., vol. 36, no. 36, pp. 23103–23111, 2024. https://doi.org/10.1007/s00521-024-10363-3

H. Senior, G. Slabaugh, S. Yuan, and L. Rossi, “Graph neural networks in vision-language image understanding: a survey,” Vis. Comput., vol. 41, no. 1, pp. 491–516, 2024. https://doi.org/10.1007/s00371-024-03343-0

W. Hu, F. Zhang, and Y. Zhao, “Thangka image captioning model with Salient Attention and Local Interaction Aggregator,” Herit. Sci., vol. 12, no. 1, pp. 1–21, 2024. https://doi.org/10.1186/s40494-024-01518-5

F. Zhao, Z. Yu, T. Wang, and H. Zhao, “Meshed Context-Aware Beam Search for Image Captioning,” Entropy, vol. 26, no. 10, pp. 1–22, 2024. https://doi.org/10.3390/e26100866

P. Sloan, P. Clatworthy, E. Simpson, and M. Mirmehdi, “Automated Radiology Report Generation: A Review of Recent Advances,” IEEE Rev. Biomed. Eng., vol. XX, no. Xx, pp. 1–24, 2024. https://doi.org/10.1109/RBME.2024.3408456

M. J. Page et al., “The PRISMA 2020 statement: An updated guideline for reporting systematic reviews,” BMJ, vol. 372, 2021. https://doi.org/10.1136/bmj.n71

M. L. Rethlefsen et al., “PRISMA-S: an extension to the PRISMA Statement for Reporting Literature Searches in Systematic Reviews,” Syst. Rev., vol. 10, no. 1, pp. 1–19, 2021. https://doi.org/10.1186/s13643-020-01542-z

N. R. Haddaway, M. J. Page, C. C. Pritchard, and L. A. McGuinness, “PRISMA2020: An R package and Shiny app for producing PRISMA 2020-compliant flow diagrams, with interactivity for optimised digital transparency and Open Synthesis,” Campbell Syst. Rev., vol. 18, no. 2, pp. 1–12, 2022. https://doi.org/10.1002/cl2.1230

S. A. Ghosal. K, Rana. A, “Aesthetic image captioning from weakly-labelled photographs,” Proc. - 2019 Int. Conf. Comput. Vis. Work. ICCVW 2019, pp. 4550–4560, 2019. https://doi.org/10.1109/ICCVW.2019.00556

C. T.-S. Zhang. M, Yang. Y, Zhang. H, Ji. Y, Shen. H.T, “More is better: Precise and detailed image captioning using online positive recall and missing concepts mining,” IEEE Trans. Image Process., vol. 28, no. 1, pp. 32–44, 2019. https://doi.org/10.1109/TIP.2018.2855415

C. T.-S. Chen. L, Zhang. H, Xiao. J, Nie. L, Shao. J, Liu. W, “SCA-CNN: Spatial and channel-wise attention in convolutional networks for image captioning,” Proc. - 30th IEEE Conf. Comput. Vis. Pattern Recognition, CVPR 2017, vol. 2017-Janua, pp. 6298–6306, 2017. https://doi.org/10.1109/CVPR.2017.667

L. B. Hu. N, Ming. Y, Fan. C, Feng. F, “TSFNet: Triple-Steam Image Captioning,” IEEE Trans. Multimed., vol. 25, pp. 6904–6916, 2023. https://doi.org/10.1109/TMM.2022.3215861

Z. Y. Xu. N, Zhang. H, Liu. A.-A, Nie. W, Su. Y, Nie. J, “Multi-Level Policy and Reward-Based Deep Reinforcement Learning Framework for Image Captioning,” IEEE Trans. Multimed., vol. 22, no. 5, pp. 1372–1383, 2020. https://doi.org/10.1109/TMM.2019.2941820

G. . . Rennie. S.J, Marcheret. E, Mroueh. Y, Ross. J, “Self-critical sequence training for image captioning,” Proc. - 30th IEEE Conf. Comput. Vis. Pattern Recognition, CVPR 2017, vol. 2017-Janua, pp. 1179–1195, 2017. https://doi.org/10.1109/CVPR.2017.131

W. Z. Cao. S, An. G, Zheng. Z, “Vision-Enhanced and Consensus-Aware Transformer for Image Captioning,” IEEE Trans. Circuits Syst. Video Technol., vol. 32, no. 10, pp. 7005–7018, 2022. https://doi.org/10.1109/TCSVT.2022.3178844

W. Z. Zhang. J, Xie, Y, Ding. W, “Cross on Cross Attention: Deep Fusion Transformer for Image Captioning,” IEEE Trans. Circuits Syst. Video Technol., vol. 33, no. 8, pp. 4257–4268, 2023. https://doi.org/10.1109/TCSVT.2023.3243725

W. M. Song. Z, Hu. Z, Zhou. Y, Zhao. Y, Hong. R, “Embedded Heterogeneous Attention Transformer for Cross-Lingual Image Captioning,” IEEE Trans. Multimed., vol. 26, pp. 9008–9020, 2024. https://doi.org/10.1109/TMM.2024.3384678

W. L. Liu A. A, Wu Q, Xu N, Tian H, “Enriched Image Captioning based on Knowledge Divergence and Focus,” IEEE Trans. Circuits Syst. Video Technol., 2025. https://doi.org/10.1109/TCSVT.2024.3525158

J. B. Alayrac et al., “Flamingo: a Visual Language Model for Few-Shot Learning,” Adv. Neural Inf. Process. Syst., vol. 35, no. NeurIPS, 2022.

J. Wang et al., “GIT: A Generative Image-to-text Transformer for Vision and Language,” vol. 2, pp. 1–49, 2022, [Online]. Available: http://arxiv.org/abs/2205.14100

H. Liu, C. Li, Q. Wu, and Y. J. Lee, “Visual Instruction Tuning,” Adv. Neural Inf. Process. Syst., vol. 36, no. NeurIPS, pp. 1–25, 2023.

E. J. Bassey, J. H. Cheng, and D. W. Sun, “Enhancing infrared drying of red dragon fruit by novel and innovative thermoultrasound and microwave-mediated freeze-thaw pretreatments,” Lwt, vol. 202, no. March, p. 116225, 2024. https://doi.org//10.1016/j.lwt.2024.116225

Author biographies is not available.
Download this PDF file
PDF
Statistic
Read Counter : 0 Download : 0

Downloads

Download data is not yet available.

Quick Link

  • Author Guidelines
  • Download Manuscript Template
  • Peer Review Process
  • Editorial Board
  • Reviewer Acknowledgement
  • Aim and Scope
  • Publication Ethics
  • Licensing Term
  • Copyright Notice
  • Open Access Policy
  • Important Dates
  • Author Fees
  • Indexing and Abstracting
  • Archiving Policy
  • Scopus Citation Analysis
  • Statistic
  • Article Withdrawal

Meet Our Editorial Team

Ir. Amrul Faruq, M.Eng., Ph.D
Editor in Chief
Universitas Muhammadiyah Malang
Google Scholar Scopus
Prof. Robert Lis
Editorial Board
Wrocław University of Science and Technology
Orcid  Scopus
Hanung Adi Nugroho
Editorial Board
Universitas Gadjah Mada
Google Scholar Scopus
Roman Voliansky
Editorial Board
Dniprovsky State Technical University, Ukraine
Google Scholar Scopus
Read More
 

KINETIK: Game Technology, Information System, Computer Network, Computing, Electronics, and Control
eISSN : 2503-2267
pISSN : 2503-2259


Address

Program Studi Elektro dan Informatika

Fakultas Teknik, Universitas Muhammadiyah Malang

Jl. Raya Tlogomas 246 Malang

Phone 0341-464318 EXT 247

Contact Info

Principal Contact

Amrul Faruq
Phone: +62 812-9398-6539
Email: faruq@umm.ac.id

Support Contact

Fauzi Dwi Setiawan Sumadi
Phone: +62 815-1145-6946
Email: fauzisumadi@umm.ac.id

© 2020 KINETIK, All rights reserved. This is an open-access article distributed under the terms of the Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License