Quick jump to page content
  • Main Navigation
  • Main Content
  • Sidebar

  • Home
  • Current
  • Archives
  • Join As Reviewer
  • Info
  • Announcements
  • Statistics
  • About
    • About the Journal
    • Submissions
    • Editorial Team
    • Privacy Statement
    • Contact
  • Register
  • Login
  • Home
  • Current
  • Archives
  • Join As Reviewer
  • Info
  • Announcements
  • Statistics
  • About
    • About the Journal
    • Submissions
    • Editorial Team
    • Privacy Statement
    • Contact
  1. Home
  2. Archives
  3. Vol. 10, No. 4, November 2025
  4. Articles

Issue

Vol. 10, No. 4, November 2025

Issue Published : Oct 16, 2025
Creative Commons License

This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.

The Evolution of Image Captioning Models: Trends, Techniques, and Future Challenges

https://doi.org/10.22219/kinetik.v10i4.2305
Abrar Wahid Abrar
Majalengka University
Ade Bastian
Majalengka University
Zacky Hafsari
Majalengka University
Ardi Mardiana
Majalengka University

Corresponding Author(s) : Abrar Wahid Abrar

221410088@unma.ac.id

Kinetik: Game Technology, Information System, Computer Network, Computing, Electronics, and Control, Vol. 10, No. 4, November 2025
Article Published : Oct 16, 2025

Share
WA Share on Facebook Share on Twitter Pinterest Email Telegram
  • Abstract
  • Cite
  • References
  • Authors Details

Abstract

This study provides a comprehensive systematic literature review (SLR) of the evolution of image captioning models from 2017 to 2025, with a particular emphasis on the impending problems, methodological enhancements, and significant architectural developments. The evaluation is guided by the increasing demand for precise and contextually aware image descriptions, and it adheres to the PRISMA methodology. It selects 36 relevant papers from reputable scientific databases. The results indicate a significant transition from traditional CNN-RNN models to Transformer-based architectures, which leads to enhanced semantic coherence and contextual comprehension. Current methodologies, such as prompt engineering and GAN-based augmentation, have further facilitated generalization and diversity, while multimodal fusion solutions, which incorporate attention mechanisms and knowledge integration, have improved caption quality. Additionally, significant areas of concern include data bias, equity in model assessment, and support for low-resource languages. The study underscores the fact that modern vision-language models, such as Flamingo, GIT, and LLaVA, offer robust domain generalization through cross-modal learning and joint embedding. Furthermore, the efficacy of computing in restricted environments is improved by the development of pretraining procedures and lightweight models. This study contributes by identifying future prospects, analyzing technical trade-offs, and delineating research trends, particularly in sectors such as healthcare, construction, and inclusive AI. According to the results, in order to optimize their efficacy in real-world applications, future picture captioning models must prioritize resource efficiency, impartiality, and multilingual capabilities.

Keywords

Computational Efficiency Image Captioning Knowledge Integration Systematic Literature Review Vision-Language Models
Abrar, A. W., Bastian, A., Hafsari, Z., & Mardiana, A. (2025). The Evolution of Image Captioning Models: Trends, Techniques, and Future Challenges. Kinetik: Game Technology, Information System, Computer Network, Computing, Electronics, and Control, 10(4). https://doi.org/10.22219/kinetik.v10i4.2305
  • ACM
  • ACS
  • APA
  • ABNT
  • Chicago
  • Harvard
  • IEEE
  • MLA
  • Turabian
  • Vancouver
Download Citation
Endnote/Zotero/Mendeley (RIS)
BibTeX
References
  1. H. T. Ho et al., “A Review on Vision-Language-Based Approaches: Challenges and Applications,” Comput. Mater. Contin., vol. 82, no. 2, pp. 1733–1756, 2025. https://doi.org/10.32604/cmc.2025.060363
  2. N. M. Khassaf and N. H. M. Ali, “Improving Pre-trained CNN-LSTM Models for Image Captioning with Hyper-Parameter Optimization,” Eng. Technol. Appl. Sci. Res., vol. 14, no. 5, pp. 17337–17343, 2024. https://doi.org/10.48084/etasr.8455
  3. S. Tyagi et al., “Novel Advance Image Caption Generation Utilizing Vision Transformer and Generative Adversarial Networks,” Computers, vol. 13, no. 12, 2024. https://doi.org/10.3390/computers13120305
  4. H. B. Duy et al., “A dental intraoral image dataset of gingivitis for image captioning,” Data Br., vol. 57, p. 110960, 2024. https://doi.org/10.1016/j.dib.2024.110960
  5. Y. Li, X. Zhang, T. Zhang, G. Wang, X. Wang, and S. Li, “A Patch-Level Region-Aware Module with a Multi-Label Framework for Remote Sensing Image Captioning,” Remote Sens., vol. 16, no. 21, pp. 1–20, 2024. https://doi.org/10.3390/rs16213987
  6. K. Cheng, E. Cambria, J. Liu, Y. Chen, and Z. Wu, “KE-RSIC: Remote Sensing Image Captioning Based on Knowledge Embedding,” IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens., vol. 18, pp. 4286–4304, 2024. https://doi.org/10.1109/JSTARS.2024.3523944
  7. S. Das and R. Sharma, “A TextGCN-Based Decoding Approach for Improving Remote Sensing Image Captioning,” IEEE Geosci. Remote Sens. Lett., pp. 1–6, 2024. https://doi.org/10.1109/LGRS.2024.3523134
  8. Q. Lin, S. Wang, X. Ye, R. Wang, R. Yang, and L. Jiao, “CLIP-based Grid Features and Masking for Remote Sensing Image Captioning,” IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens., vol. 18, pp. 2631–2642, 2024. https://doi.org/10.1109/JSTARS.2024.3510414
  9. Y. Yang, T. Liu, Y. Pu, L. Liu, Q. Zhao, and Q. Wan, “Multi-Attentive Network with Diffusion Model,” pp. 1–18, 2024. https://doi.org/10.3390/rs16214083
  10. X. Zhang, J. Shen, Y. Wang, J. Xiao, and J. Li, “Zero-Shot Image Caption Inference System Based on Pretrained Models,” Electron., vol. 13, no. 19, 2024. https://doi.org/10.3390/electronics13193854
  11. P. S. Sherly and P. Velvizhy, “‘Idol talks!’ AI-driven image to text to speech: illustrated by an application to images of deities,” Herit. Sci., vol. 12, no. 1, pp. 1–21, 2024. https://doi.org/10.1186/s40494-024-01490-0
  12. L. Yu, M. Nikandrou, J. Jin, and V. Rieser, “Quality-agnostic Image Captioning to Safely Assist People with Vision Impairment,” IJCAI Int. Jt. Conf. Artif. Intell., vol. 2023-Augus, pp. 6281–6289, 2023. https://doi.org/10.24963/ijcai.2023/697
  13. Y. Li, Y. Ma, Y. Zhou, and X. Yu, “Semantic-Guided Selective Representation for Image Captioning,” IEEE Access, vol. 11, no. December 2022, pp. 14500–14510, 2023. https://doi.org/10.1109/ACCESS.2023.3243952
  14. M. Alansari, K. Alnuaimi, S. Alansari, and S. Javed, “ELTrack: Events-Language Description for Visual Object Tracking,” IEEE Access, vol. 13, no. December 2024, pp. 31351–31367, 2025. https://doi.org/10.1109/ACCESS.2025.3540445
  15. F. Kalantari, K. Faez, H. Amindavar, and S. Nazari, “Improved image reconstruction from brain activity through automatic image captioning,” Sci. Rep., vol. 15, no. 1, pp. 1–17, 2025. https://doi.org/10.1038/s41598-025-89242-3
  16. Y. Qin, S. Ding, and H. Xie, “Advancements in Large-Scale Image and Text Representation Learning: A Comprehensive Review and Outlook,” IEEE Access, vol. PP, p. 1, 2025. https://doi.org/10.1109/ACCESS.2025.3541194
  17. A. Masud, M. B. Hosen, M. Habibullah, M. Anannya, and M. S. Kaiser, “Image captioning in Bengali language using visual attention,” PLoS One, vol. 20, no. 2 February, pp. 1–15, 2025. https://doi.org/10.1371/journal.pone.0309364
  18. B. Patra and D. R. Kisku, “Exploring Bengali Image Descriptions through the combination of diverse CNN Architectures and Transformer Decoders,” Turkish J. Eng., vol. 9, no. 1, pp. 64–78, 2025. https://doi.org/10.31127/tuje.1507442
  19. Y. Tang, Y. Yuan, F. Tao, and M. Tang, “Cross-modal Augmented Transformer for Automated Medical Report Generation,” IEEE J. Transl. Eng. Heal. Med., vol. 13, no. December 2024, pp. 33–48, 2025. https://doi.org/10.1109/JTEHM.2025.3536441
  20. Y. Zhang, J. Tong, and H. Liu, “SCAP: enhancing image captioning through lightweight feature sifting and hierarchical decoding,” Vis. Comput., pp. 0–26, 2025. https://doi.org/10.1007/s00371-025-03824-w
  21. F. Zhao, Z. Yu, T. Wang, and Y. Lv, “Image Captioning Based on Semantic Scenes,” Entropy, vol. 26, no. 10, pp. 1–20, 2024. https://doi.org/10.3390/e26100876
  22. N. Shetty and Y. Li, “Detailed Image Captioning and Hashtag Generation,” Futur. Internet, vol. 16, no. 12, 2024. https://doi.org/10.3390/fi16120444
  23. A. A. E. Osman, M. A. W. Shalaby, M. M. Soliman, and K. M. Elsayed, “Novel concept-based image captioning models using LSTM and multi-encoder transformer architecture,” Sci. Rep., vol. 14, no. 1, pp. 1–15, 2024. https://doi.org/10.1038/s41598-024-69664-1
  24. A. Zheng, S. Zheng, C. Bai, and D. Chen, “Triple-level relationship enhanced transformer for image captioning,” Multimed. Syst., vol. 29, no. 4, pp. 1955–1966, 2023. https://doi.org/10.1007/s00530-023-01073-2
  25. Y. Pan, T. Yao, Y. Li, and T. Mei, “X-Linear Attention Networks for Image Captioning,” Proc. IEEE Comput. Soc. Conf. Comput. Vis. Pattern Recognit., pp. 10968–10977, 2020. https://doi.org/10.1109/CVPR42600.2020.01098
  26. Y. Jung, I. Cho, S. H. Hsu, and M. Golparvar-Fard, “VISUALSITEDIARY: A detector-free Vision-Language Transformer model for captioning photologs for daily construction reporting and image retrievals,” Autom. Constr., vol. 165, no. May, p. 105483, 2024. https://doi.org/10.1016/j.autcon.2024.105483
  27. J. Chen, H. Guo, K. Yi, B. Li, and M. Elhoseiny, “VisualGPT: Data-efficient Adaptation of Pretrained Language Models for Image Captioning,” Proc. IEEE Comput. Soc. Conf. Comput. Vis. Pattern Recognit., vol. 2022-June, pp. 18009–18019, 2022. https://doi.org/10.1109/CVPR52688.2022.01750
  28. Y. Zhou, Y. Zhang, Z. Hu, and M. Wang, “Semi-Autoregressive Transformer for Image Captioning,” Proc. IEEE Int. Conf. Comput. Vis., vol. 2021-Octob, pp. 3132–3136, 2021. https://doi.org/10.1109/ICCVW54120.2021.00350
  29. J. H. Wang, M. Norouzi, and S. M. Tsai, “Augmenting Multimodal Content Representation with Transformers for Misinformation Detection †,” Big Data Cogn. Comput., vol. 8, no. 10, 2024. https://doi.org/10.3390/bdcc8100134
  30. S. Gautam et al., “Kvasir-VQA: A Text-Image Pair GI Tract Dataset,” arXiv Prepr. arXiv2409.01437, 2024. https://doi.org/10.1145/3689096.3689458
  31. Z. Li, D. Liu, H. Wang, C. Zhang, and W. Cai, “Exploring Annotation-free Image Captioning with Retrieval-augmented Pseudo Sentence Generation,” no. Vl, 2023. https://doi.org/10.1145/3696409.3700223
  32. K. Y. Cheng, M. Lange-Hegermann, J. B. Hövener, and B. Schreiweis, “Instance-level medical image classification for text-based retrieval in a medical data integration center,” Comput. Struct. Biotechnol. J., vol. 24, no. February, pp. 434–450, 2024. https://doi.org/10.1016/j.csbj.2024.06.006
  33. X. Guo, X. Di Liu, and J. Jiang, “A Scene Graph Generation Method for Historical District Street-view Imagery: A Case Study in Beijing, China,” Int. Arch. Photogramm. Remote Sens. Spat. Inf. Sci. - ISPRS Arch., vol. 48, no. 3, pp. 209–216, 2024. https://doi.org/10.5194/isprs-archives-XLVIII-3-2024-209-2024
  34. E. K. Holden and K. Korovin, “Graph sequence learning for premise selection,” J. Symb. Comput., vol. 128, p. 102376, 2025. https://doi.org/10.1016/j.jsc.2024.102376
  35. S. Fayou, H. C. Ngo, Y. W. Sek, and Z. Meng, “Clustering swap prediction for image-text pre-training,” Sci. Rep., vol. 14, no. 1, pp. 1–16, 2024. https://doi.org/10.1038/s41598-024-60832-x
  36. A. Sebaq and M. ElHelw, “RSDiff: remote sensing image generation from text using diffusion model,” Neural Comput. Appl., vol. 36, no. 36, pp. 23103–23111, 2024. https://doi.org/10.1007/s00521-024-10363-3
  37. H. Senior, G. Slabaugh, S. Yuan, and L. Rossi, “Graph neural networks in vision-language image understanding: a survey,” Vis. Comput., vol. 41, no. 1, pp. 491–516, 2024. https://doi.org/10.1007/s00371-024-03343-0
  38. W. Hu, F. Zhang, and Y. Zhao, “Thangka image captioning model with Salient Attention and Local Interaction Aggregator,” Herit. Sci., vol. 12, no. 1, pp. 1–21, 2024. https://doi.org/10.1186/s40494-024-01518-5
  39. F. Zhao, Z. Yu, T. Wang, and H. Zhao, “Meshed Context-Aware Beam Search for Image Captioning,” Entropy, vol. 26, no. 10, pp. 1–22, 2024. https://doi.org/10.3390/e26100866
  40. P. Sloan, P. Clatworthy, E. Simpson, and M. Mirmehdi, “Automated Radiology Report Generation: A Review of Recent Advances,” IEEE Rev. Biomed. Eng., vol. XX, no. Xx, pp. 1–24, 2024. https://doi.org/10.1109/RBME.2024.3408456
  41. M. J. Page et al., “The PRISMA 2020 statement: An updated guideline for reporting systematic reviews,” BMJ, vol. 372, 2021. https://doi.org/10.1136/bmj.n71
  42. M. L. Rethlefsen et al., “PRISMA-S: an extension to the PRISMA Statement for Reporting Literature Searches in Systematic Reviews,” Syst. Rev., vol. 10, no. 1, pp. 1–19, 2021. https://doi.org/10.1186/s13643-020-01542-z
  43. N. R. Haddaway, M. J. Page, C. C. Pritchard, and L. A. McGuinness, “PRISMA2020: An R package and Shiny app for producing PRISMA 2020-compliant flow diagrams, with interactivity for optimised digital transparency and Open Synthesis,” Campbell Syst. Rev., vol. 18, no. 2, pp. 1–12, 2022. https://doi.org/10.1002/cl2.1230
  44. W. Z. Zhang. J, Zhang. K, Xie. Y, “Deep Reciprocal Learning for Image Captioning,” IEEE Trans. Circuits Syst. Video Technol., 2025. https://doi.org/10.1109/TCSVT.2025.3539344
  45. W. L. Liu A. A, Wu Q, Xu N, Tian H, “Enriched Image Captioning based on Knowledge Divergence and Focus,” IEEE Trans. Circuits Syst. Video Technol., 2025. https://doi.org/10.1109/TCSVT.2024.3525158
  46. W. M. Song. Z, Hu. Z, Zhou. Y, Zhao. Y, Hong. R, “Embedded Heterogeneous Attention Transformer for Cross-Lingual Image Captioning,” IEEE Trans. Multimed., vol. 26, pp. 9008–9020, 2024. https://doi.org/10.1109/TMM.2024.3384678
  47. M. Z. Li. J, Zhang. L, Zhang. K, Hu. B, Xie. H, “Cascade Semantic Prompt Alignment Network for Image Captioning,” IEEE Trans. Circuits Syst. Video Technol., vol. 34, no. 7, pp. 5266–5281, 2024. https://doi.org/10.1109/TCSVT.2023.3343520
  48. C. Z. Shi. Y, Xia. J, Zhou. M, “A Dual-Feature-Based Adaptive Shared Transformer Network for Image Captioning,” IEEE Trans. Instrum. Meas., vol. 73, pp. 1–13, 2024. https://doi.org/10.1109/TIM.2024.3353830
  49. W. Z. Cao. S, An. G, Zheng. Z, “Vision-Enhanced and Consensus-Aware Transformer for Image Captioning,” IEEE Trans. Circuits Syst. Video Technol., vol. 32, no. 10, pp. 7005–7018, 2022. https://doi.org/10.1109/TCSVT.2022.3178844
  50. L. B. Hu. N, Ming. Y, Fan. C, Feng. F, “TSFNet: Triple-Steam Image Captioning,” IEEE Trans. Multimed., vol. 25, pp. 6904–6916, 2023. https://doi.org/10.1109/TMM.2022.3215861
  51. W. M. Yuan. J, Zhu. S, Huang. S, Zhang. H, Xiao. Y, Li. Z, “Discriminative Style Learning for Cross-Domain Image Captioning,” IEEE Trans. Image Process., vol. 31, pp. 1723–1736, 2022. https://doi.org/10.1109/TIP.2022.3145158
  52. L. J. Zhao. W, Wu. X, “Cross-Domain Image Captioning via Cross-Modal Retrieval and Model Adaptation,” IEEE Trans. Image Process., vol. 30, pp. 1180–1192, 2021. https://doi.org/10.1109/TIP.2020.3042086
  53. Z. J. Yu. N, Hu. X, Song. B, Yang. J, “Topic-Oriented Image Captioning Based on Order-Embedding,” IEEE Trans. Image Process., vol. 28, no. 6, pp. 2743–2754, 2019. https://doi.org/10.1109/TIP.2018.2889922
  54. W. Z. Zhang. J, Xie, Y, Ding. W, “Cross on Cross Attention: Deep Fusion Transformer for Image Captioning,” IEEE Trans. Circuits Syst. Video Technol., vol. 33, no. 8, pp. 4257–4268, 2023. https://doi.org/10.1109/TCSVT.2023.3243725
  55. C. A. B. Wang. J, Xu. W, Wang. Q, “On Distinctive Image Captioning via Comparing and Reweighting,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 45, no. 2, pp. 2088–2103, 2023. https://doi.org/10.1109/TPAMI.2022.3159811
  56. H. H. Jiang. W, Zhou. W, “Double-Stream Position Learning Transformer Network for Image Captioning,” IEEE Trans. Circuits Syst. Video Technol., vol. 32, no. 11, pp. 7706–7718, 2022. https://doi.org/10.1109/TCSVT.2022.3181490
  57. C. T.-S. Chen. L, Zhang. H, Xiao. J, Nie. L, Shao. J, Liu. W, “SCA-CNN: Spatial and channel-wise attention in convolutional networks for image captioning,” Proc. - 30th IEEE Conf. Comput. Vis. Pattern Recognition, CVPR 2017, vol. 2017-Janua, pp. 6298–6306, 2017. https://doi.org/10.1109/CVPR.2017.667
  58. Z. Y. Xu. N, Zhang. H, Liu. A.-A, Nie. W, Su. Y, Nie. J, “Multi-Level Policy and Reward-Based Deep Reinforcement Learning Framework for Image Captioning,” IEEE Trans. Multimed., vol. 22, no. 5, pp. 1372–1383, 2020. https://doi.org/10.1109/TMM.2019.2941820
  59. M. . . Yao. T, Pan. Y, Li. Y, “Incorporating copying mechanism in image captioning for learning novel objects,” Proc. - 30th IEEE Conf. Comput. Vis. Pattern Recognition, CVPR 2017, vol. 2017-Janua, pp. 5263–5271, 2017. https://doi.org/10.1109/CVPR.2017.559
  60. G. . . Rennie. S.J, Marcheret. E, Mroueh. Y, Ross. J, “Self-critical sequence training for image captioning,” Proc. - 30th IEEE Conf. Comput. Vis. Pattern Recognition, CVPR 2017, vol. 2017-Janua, pp. 1179–1195, 2017. https://doi.org/10.1109/CVPR.2017.131
  61. W. M. Wang. D, Hu. Z, Zhou. Y, Hong. R, “A Text-Guided Generation and Refinement Model for Image Captioning,” IEEE Trans. Multimed., vol. 25, pp. 2966–2977, 2023. https://doi.org/10.1109/TMM.2022.3154149
  62. A. S. . Al-Qatf. M, Wang. X, Hawbani. Am, Abdussalam. A, “Image Captioning With Novel Topics Guidance and Retrieval-Based Topics Re-Weighting,” IEEE Trans. Multimed., vol. 25, pp. 5984–5999, 2023. https://doi.org/10.1109/TMM.2022.3202690
  63. L. C. Yang. M, Liu. J, Shen. Y, Zhao. Z, Chen. X, Wu. Q, “An Ensemble of Generation-and Retrieval-Based Image Captioning with Dual Generator Generative Adversarial Network,” IEEE Trans. Image Process., vol. 29, pp. 9627–9640, 2020. https://doi.org/10.1109/TIP.2020.3028651
  64. X. Y. Huang. Y, Chen. J, Ouyang. W, Wan. W, “Image Captioning with End-to-End Attribute Detection and Subsequent Attributes Prediction,” IEEE Trans. Image Process., vol. 29, pp. 4013–4026, 2020. https://doi.org/10.1109/TIP.2020.2969330
  65. F. W. Zhou. L, Zhang. Y, Jiang. Y.-G, Zhang. T, “Re-Caption: Saliency-Enhanced Image Captioning through Two-Phase Learning,” IEEE Trans. Image Process., vol. 29, pp. 694–709, 2020. https://doi.org/10.1109/TIP.2019.2928144
  66. M. H. Xian. T, Li. Z, Tang. Z, “Adaptive Path Selection for Dynamic Image Captioning,” IEEE Trans. Circuits Syst. Video Technol., vol. 32, no. 9, pp. 5762–5775, 2022. https://doi.org/10.1109/TCSVT.2022.3155795
  67. W. Q. Wang. L, Li. H, Hu. W, Zhang. X, Qiu. H, Meng. F, “What Happens in Crowd Scenes: A New Dataset About Crowd Scenes for Image Captioning,” IEEE Trans. Multimed., vol. 25, pp. 5400–5412, 2023. https://doi.org/10.1109/TMM.2022.3192729
  68. L. Y. Jiang. W, Zhu. M, Fang. Y, Shi. G, Zhao. X, “Visual Cluster Grounding for Image Captioning,” IEEE Trans. Image Process., vol. 31, pp. 3920–3934, 2022. https://doi.org/10.1109/TIP.2022.3177318
  69. Z. Y. Wang. Y, Xu. N, Liu. A.-A, Li. W, “High-Order Interaction Learning for Image Captioning,” IEEE Trans. Circuits Syst. Video Technol., vol. 32, no. 7, pp. 4417–4430, 2022. https://doi.org/10.1109/TCSVT.2021.3121062
  70. J. R. Ji. J, Ma. Y, Sun. X, Zhou. Y, Wu. Y, “Knowing What to Learn: A Metric-Oriented Focal Mechanism for Image Captioning,” IEEE Trans. Image Process., vol. 31, pp. 4321–4335, 2022. https://doi.org/10.1109/TIP.2022.3183434
  71. H. Q. Yu. J, Li. J, Yu. Z, “Multimodal Transformer with Multi-View Visual Representation for Image Captioning,” IEEE Trans. Circuits Syst. Video Technol., vol. 30, no. 12, pp. 4467–4480, 2020. https://doi.org/10.1109/TCSVT.2019.2947482
  72. S. A. Ghosal. K, Rana. A, “Aesthetic image captioning from weakly-labelled photographs,” Proc. - 2019 Int. Conf. Comput. Vis. Work. ICCVW 2019, pp. 4550–4560, 2019. https://doi.org/10.1109/ICCVW.2019.00556
  73. C. T.-S. Zhang. M, Yang. Y, Zhang. H, Ji. Y, Shen. H.T, “More is better: Precise and detailed image captioning using online positive recall and missing concepts mining,” IEEE Trans. Image Process., vol. 28, no. 1, pp. 32–44, 2019. https://doi.org/10.1109/TIP.2018.2855415
  74. L. L.-J. Ren. Z, Wang. X, Zhang. N, Lv. X, “Deep reinforcement learning-based image captioning with embedding reward,” Proc. - 30th IEEE Conf. Comput. Vis. Pattern Recognition, CVPR 2017, pp. 1151–1159, 2017. https://doi.org/10.1109/CVPR.2017.128
  75. K. G. Park. C.C, Kim. B, “Attend to you: Personalized image captioning with Context Sequence Memory Networks,” Proc. - 30th IEEE Conf. Comput. Vis. Pattern Recognition, CVPR 2017, vol. 2017-Janua, pp. 6432–6440, 2017. https://doi.org/10.1109/CVPR.2017.681
  76. C. G. . Wang. Y, Lin. Z, Shen. X, Cohen. S, “Skeleton key: Image captioning by skeleton-Attribute decomposition,” Proc. - 30th IEEE Conf. Comput. Vis. Pattern Recognition, CVPR 2017, vol. 2017-Janua, pp. 7378–7387, 2017. https://doi.org/10.1109/CVPR.2017.780
  77. X. Yan, C, Hao, Y, Li, L, Yin, J, Liu, A, Mao, Z, Chen, Z, Gao, “Task-Adaptive Attention for Image Captioning,” IEEE Trans. Circuits Syst. Video Technol., vol. 32, no. 1, pp. 43–51, 2022. https://doi.org/10.1109/TCSVT.2021.3067449
  78. C. F. Zhang. Z, Wu. Q, Wang. Y, “Exploring Pairwise Relationships Adaptively From Linguistic Context in Image Captioning,” IEEE Trans. Multimed., vol. 24, pp. 3101–3113, 2022. https://doi.org/10.1109/TMM.2021.3093725
  79. P. C. Xiao. X, Wang. L, Ding. K, Xiang. S, “Deep Hierarchical Encoder-Decoder Network for Image Captioning,” IEEE Trans. Multimed., vol. 21, no. 11, pp. 2942–2956, 2019. https://doi.org/10.1109/TMM.2019.2915033
Read More

References


H. T. Ho et al., “A Review on Vision-Language-Based Approaches: Challenges and Applications,” Comput. Mater. Contin., vol. 82, no. 2, pp. 1733–1756, 2025. https://doi.org/10.32604/cmc.2025.060363

N. M. Khassaf and N. H. M. Ali, “Improving Pre-trained CNN-LSTM Models for Image Captioning with Hyper-Parameter Optimization,” Eng. Technol. Appl. Sci. Res., vol. 14, no. 5, pp. 17337–17343, 2024. https://doi.org/10.48084/etasr.8455

S. Tyagi et al., “Novel Advance Image Caption Generation Utilizing Vision Transformer and Generative Adversarial Networks,” Computers, vol. 13, no. 12, 2024. https://doi.org/10.3390/computers13120305

H. B. Duy et al., “A dental intraoral image dataset of gingivitis for image captioning,” Data Br., vol. 57, p. 110960, 2024. https://doi.org/10.1016/j.dib.2024.110960

Y. Li, X. Zhang, T. Zhang, G. Wang, X. Wang, and S. Li, “A Patch-Level Region-Aware Module with a Multi-Label Framework for Remote Sensing Image Captioning,” Remote Sens., vol. 16, no. 21, pp. 1–20, 2024. https://doi.org/10.3390/rs16213987

K. Cheng, E. Cambria, J. Liu, Y. Chen, and Z. Wu, “KE-RSIC: Remote Sensing Image Captioning Based on Knowledge Embedding,” IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens., vol. 18, pp. 4286–4304, 2024. https://doi.org/10.1109/JSTARS.2024.3523944

S. Das and R. Sharma, “A TextGCN-Based Decoding Approach for Improving Remote Sensing Image Captioning,” IEEE Geosci. Remote Sens. Lett., pp. 1–6, 2024. https://doi.org/10.1109/LGRS.2024.3523134

Q. Lin, S. Wang, X. Ye, R. Wang, R. Yang, and L. Jiao, “CLIP-based Grid Features and Masking for Remote Sensing Image Captioning,” IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens., vol. 18, pp. 2631–2642, 2024. https://doi.org/10.1109/JSTARS.2024.3510414

Y. Yang, T. Liu, Y. Pu, L. Liu, Q. Zhao, and Q. Wan, “Multi-Attentive Network with Diffusion Model,” pp. 1–18, 2024. https://doi.org/10.3390/rs16214083

X. Zhang, J. Shen, Y. Wang, J. Xiao, and J. Li, “Zero-Shot Image Caption Inference System Based on Pretrained Models,” Electron., vol. 13, no. 19, 2024. https://doi.org/10.3390/electronics13193854

P. S. Sherly and P. Velvizhy, “‘Idol talks!’ AI-driven image to text to speech: illustrated by an application to images of deities,” Herit. Sci., vol. 12, no. 1, pp. 1–21, 2024. https://doi.org/10.1186/s40494-024-01490-0

L. Yu, M. Nikandrou, J. Jin, and V. Rieser, “Quality-agnostic Image Captioning to Safely Assist People with Vision Impairment,” IJCAI Int. Jt. Conf. Artif. Intell., vol. 2023-Augus, pp. 6281–6289, 2023. https://doi.org/10.24963/ijcai.2023/697

Y. Li, Y. Ma, Y. Zhou, and X. Yu, “Semantic-Guided Selective Representation for Image Captioning,” IEEE Access, vol. 11, no. December 2022, pp. 14500–14510, 2023. https://doi.org/10.1109/ACCESS.2023.3243952

M. Alansari, K. Alnuaimi, S. Alansari, and S. Javed, “ELTrack: Events-Language Description for Visual Object Tracking,” IEEE Access, vol. 13, no. December 2024, pp. 31351–31367, 2025. https://doi.org/10.1109/ACCESS.2025.3540445

F. Kalantari, K. Faez, H. Amindavar, and S. Nazari, “Improved image reconstruction from brain activity through automatic image captioning,” Sci. Rep., vol. 15, no. 1, pp. 1–17, 2025. https://doi.org/10.1038/s41598-025-89242-3

Y. Qin, S. Ding, and H. Xie, “Advancements in Large-Scale Image and Text Representation Learning: A Comprehensive Review and Outlook,” IEEE Access, vol. PP, p. 1, 2025. https://doi.org/10.1109/ACCESS.2025.3541194

A. Masud, M. B. Hosen, M. Habibullah, M. Anannya, and M. S. Kaiser, “Image captioning in Bengali language using visual attention,” PLoS One, vol. 20, no. 2 February, pp. 1–15, 2025. https://doi.org/10.1371/journal.pone.0309364

B. Patra and D. R. Kisku, “Exploring Bengali Image Descriptions through the combination of diverse CNN Architectures and Transformer Decoders,” Turkish J. Eng., vol. 9, no. 1, pp. 64–78, 2025. https://doi.org/10.31127/tuje.1507442

Y. Tang, Y. Yuan, F. Tao, and M. Tang, “Cross-modal Augmented Transformer for Automated Medical Report Generation,” IEEE J. Transl. Eng. Heal. Med., vol. 13, no. December 2024, pp. 33–48, 2025. https://doi.org/10.1109/JTEHM.2025.3536441

Y. Zhang, J. Tong, and H. Liu, “SCAP: enhancing image captioning through lightweight feature sifting and hierarchical decoding,” Vis. Comput., pp. 0–26, 2025. https://doi.org/10.1007/s00371-025-03824-w

F. Zhao, Z. Yu, T. Wang, and Y. Lv, “Image Captioning Based on Semantic Scenes,” Entropy, vol. 26, no. 10, pp. 1–20, 2024. https://doi.org/10.3390/e26100876

N. Shetty and Y. Li, “Detailed Image Captioning and Hashtag Generation,” Futur. Internet, vol. 16, no. 12, 2024. https://doi.org/10.3390/fi16120444

A. A. E. Osman, M. A. W. Shalaby, M. M. Soliman, and K. M. Elsayed, “Novel concept-based image captioning models using LSTM and multi-encoder transformer architecture,” Sci. Rep., vol. 14, no. 1, pp. 1–15, 2024. https://doi.org/10.1038/s41598-024-69664-1

A. Zheng, S. Zheng, C. Bai, and D. Chen, “Triple-level relationship enhanced transformer for image captioning,” Multimed. Syst., vol. 29, no. 4, pp. 1955–1966, 2023. https://doi.org/10.1007/s00530-023-01073-2

Y. Pan, T. Yao, Y. Li, and T. Mei, “X-Linear Attention Networks for Image Captioning,” Proc. IEEE Comput. Soc. Conf. Comput. Vis. Pattern Recognit., pp. 10968–10977, 2020. https://doi.org/10.1109/CVPR42600.2020.01098

Y. Jung, I. Cho, S. H. Hsu, and M. Golparvar-Fard, “VISUALSITEDIARY: A detector-free Vision-Language Transformer model for captioning photologs for daily construction reporting and image retrievals,” Autom. Constr., vol. 165, no. May, p. 105483, 2024. https://doi.org/10.1016/j.autcon.2024.105483

J. Chen, H. Guo, K. Yi, B. Li, and M. Elhoseiny, “VisualGPT: Data-efficient Adaptation of Pretrained Language Models for Image Captioning,” Proc. IEEE Comput. Soc. Conf. Comput. Vis. Pattern Recognit., vol. 2022-June, pp. 18009–18019, 2022. https://doi.org/10.1109/CVPR52688.2022.01750

Y. Zhou, Y. Zhang, Z. Hu, and M. Wang, “Semi-Autoregressive Transformer for Image Captioning,” Proc. IEEE Int. Conf. Comput. Vis., vol. 2021-Octob, pp. 3132–3136, 2021. https://doi.org/10.1109/ICCVW54120.2021.00350

J. H. Wang, M. Norouzi, and S. M. Tsai, “Augmenting Multimodal Content Representation with Transformers for Misinformation Detection †,” Big Data Cogn. Comput., vol. 8, no. 10, 2024. https://doi.org/10.3390/bdcc8100134

S. Gautam et al., “Kvasir-VQA: A Text-Image Pair GI Tract Dataset,” arXiv Prepr. arXiv2409.01437, 2024. https://doi.org/10.1145/3689096.3689458

Z. Li, D. Liu, H. Wang, C. Zhang, and W. Cai, “Exploring Annotation-free Image Captioning with Retrieval-augmented Pseudo Sentence Generation,” no. Vl, 2023. https://doi.org/10.1145/3696409.3700223

K. Y. Cheng, M. Lange-Hegermann, J. B. Hövener, and B. Schreiweis, “Instance-level medical image classification for text-based retrieval in a medical data integration center,” Comput. Struct. Biotechnol. J., vol. 24, no. February, pp. 434–450, 2024. https://doi.org/10.1016/j.csbj.2024.06.006

X. Guo, X. Di Liu, and J. Jiang, “A Scene Graph Generation Method for Historical District Street-view Imagery: A Case Study in Beijing, China,” Int. Arch. Photogramm. Remote Sens. Spat. Inf. Sci. - ISPRS Arch., vol. 48, no. 3, pp. 209–216, 2024. https://doi.org/10.5194/isprs-archives-XLVIII-3-2024-209-2024

E. K. Holden and K. Korovin, “Graph sequence learning for premise selection,” J. Symb. Comput., vol. 128, p. 102376, 2025. https://doi.org/10.1016/j.jsc.2024.102376

S. Fayou, H. C. Ngo, Y. W. Sek, and Z. Meng, “Clustering swap prediction for image-text pre-training,” Sci. Rep., vol. 14, no. 1, pp. 1–16, 2024. https://doi.org/10.1038/s41598-024-60832-x

A. Sebaq and M. ElHelw, “RSDiff: remote sensing image generation from text using diffusion model,” Neural Comput. Appl., vol. 36, no. 36, pp. 23103–23111, 2024. https://doi.org/10.1007/s00521-024-10363-3

H. Senior, G. Slabaugh, S. Yuan, and L. Rossi, “Graph neural networks in vision-language image understanding: a survey,” Vis. Comput., vol. 41, no. 1, pp. 491–516, 2024. https://doi.org/10.1007/s00371-024-03343-0

W. Hu, F. Zhang, and Y. Zhao, “Thangka image captioning model with Salient Attention and Local Interaction Aggregator,” Herit. Sci., vol. 12, no. 1, pp. 1–21, 2024. https://doi.org/10.1186/s40494-024-01518-5

F. Zhao, Z. Yu, T. Wang, and H. Zhao, “Meshed Context-Aware Beam Search for Image Captioning,” Entropy, vol. 26, no. 10, pp. 1–22, 2024. https://doi.org/10.3390/e26100866

P. Sloan, P. Clatworthy, E. Simpson, and M. Mirmehdi, “Automated Radiology Report Generation: A Review of Recent Advances,” IEEE Rev. Biomed. Eng., vol. XX, no. Xx, pp. 1–24, 2024. https://doi.org/10.1109/RBME.2024.3408456

M. J. Page et al., “The PRISMA 2020 statement: An updated guideline for reporting systematic reviews,” BMJ, vol. 372, 2021. https://doi.org/10.1136/bmj.n71

M. L. Rethlefsen et al., “PRISMA-S: an extension to the PRISMA Statement for Reporting Literature Searches in Systematic Reviews,” Syst. Rev., vol. 10, no. 1, pp. 1–19, 2021. https://doi.org/10.1186/s13643-020-01542-z

N. R. Haddaway, M. J. Page, C. C. Pritchard, and L. A. McGuinness, “PRISMA2020: An R package and Shiny app for producing PRISMA 2020-compliant flow diagrams, with interactivity for optimised digital transparency and Open Synthesis,” Campbell Syst. Rev., vol. 18, no. 2, pp. 1–12, 2022. https://doi.org/10.1002/cl2.1230

W. Z. Zhang. J, Zhang. K, Xie. Y, “Deep Reciprocal Learning for Image Captioning,” IEEE Trans. Circuits Syst. Video Technol., 2025. https://doi.org/10.1109/TCSVT.2025.3539344

W. L. Liu A. A, Wu Q, Xu N, Tian H, “Enriched Image Captioning based on Knowledge Divergence and Focus,” IEEE Trans. Circuits Syst. Video Technol., 2025. https://doi.org/10.1109/TCSVT.2024.3525158

W. M. Song. Z, Hu. Z, Zhou. Y, Zhao. Y, Hong. R, “Embedded Heterogeneous Attention Transformer for Cross-Lingual Image Captioning,” IEEE Trans. Multimed., vol. 26, pp. 9008–9020, 2024. https://doi.org/10.1109/TMM.2024.3384678

M. Z. Li. J, Zhang. L, Zhang. K, Hu. B, Xie. H, “Cascade Semantic Prompt Alignment Network for Image Captioning,” IEEE Trans. Circuits Syst. Video Technol., vol. 34, no. 7, pp. 5266–5281, 2024. https://doi.org/10.1109/TCSVT.2023.3343520

C. Z. Shi. Y, Xia. J, Zhou. M, “A Dual-Feature-Based Adaptive Shared Transformer Network for Image Captioning,” IEEE Trans. Instrum. Meas., vol. 73, pp. 1–13, 2024. https://doi.org/10.1109/TIM.2024.3353830

W. Z. Cao. S, An. G, Zheng. Z, “Vision-Enhanced and Consensus-Aware Transformer for Image Captioning,” IEEE Trans. Circuits Syst. Video Technol., vol. 32, no. 10, pp. 7005–7018, 2022. https://doi.org/10.1109/TCSVT.2022.3178844

L. B. Hu. N, Ming. Y, Fan. C, Feng. F, “TSFNet: Triple-Steam Image Captioning,” IEEE Trans. Multimed., vol. 25, pp. 6904–6916, 2023. https://doi.org/10.1109/TMM.2022.3215861

W. M. Yuan. J, Zhu. S, Huang. S, Zhang. H, Xiao. Y, Li. Z, “Discriminative Style Learning for Cross-Domain Image Captioning,” IEEE Trans. Image Process., vol. 31, pp. 1723–1736, 2022. https://doi.org/10.1109/TIP.2022.3145158

L. J. Zhao. W, Wu. X, “Cross-Domain Image Captioning via Cross-Modal Retrieval and Model Adaptation,” IEEE Trans. Image Process., vol. 30, pp. 1180–1192, 2021. https://doi.org/10.1109/TIP.2020.3042086

Z. J. Yu. N, Hu. X, Song. B, Yang. J, “Topic-Oriented Image Captioning Based on Order-Embedding,” IEEE Trans. Image Process., vol. 28, no. 6, pp. 2743–2754, 2019. https://doi.org/10.1109/TIP.2018.2889922

W. Z. Zhang. J, Xie, Y, Ding. W, “Cross on Cross Attention: Deep Fusion Transformer for Image Captioning,” IEEE Trans. Circuits Syst. Video Technol., vol. 33, no. 8, pp. 4257–4268, 2023. https://doi.org/10.1109/TCSVT.2023.3243725

C. A. B. Wang. J, Xu. W, Wang. Q, “On Distinctive Image Captioning via Comparing and Reweighting,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 45, no. 2, pp. 2088–2103, 2023. https://doi.org/10.1109/TPAMI.2022.3159811

H. H. Jiang. W, Zhou. W, “Double-Stream Position Learning Transformer Network for Image Captioning,” IEEE Trans. Circuits Syst. Video Technol., vol. 32, no. 11, pp. 7706–7718, 2022. https://doi.org/10.1109/TCSVT.2022.3181490

C. T.-S. Chen. L, Zhang. H, Xiao. J, Nie. L, Shao. J, Liu. W, “SCA-CNN: Spatial and channel-wise attention in convolutional networks for image captioning,” Proc. - 30th IEEE Conf. Comput. Vis. Pattern Recognition, CVPR 2017, vol. 2017-Janua, pp. 6298–6306, 2017. https://doi.org/10.1109/CVPR.2017.667

Z. Y. Xu. N, Zhang. H, Liu. A.-A, Nie. W, Su. Y, Nie. J, “Multi-Level Policy and Reward-Based Deep Reinforcement Learning Framework for Image Captioning,” IEEE Trans. Multimed., vol. 22, no. 5, pp. 1372–1383, 2020. https://doi.org/10.1109/TMM.2019.2941820

M. . . Yao. T, Pan. Y, Li. Y, “Incorporating copying mechanism in image captioning for learning novel objects,” Proc. - 30th IEEE Conf. Comput. Vis. Pattern Recognition, CVPR 2017, vol. 2017-Janua, pp. 5263–5271, 2017. https://doi.org/10.1109/CVPR.2017.559

G. . . Rennie. S.J, Marcheret. E, Mroueh. Y, Ross. J, “Self-critical sequence training for image captioning,” Proc. - 30th IEEE Conf. Comput. Vis. Pattern Recognition, CVPR 2017, vol. 2017-Janua, pp. 1179–1195, 2017. https://doi.org/10.1109/CVPR.2017.131

W. M. Wang. D, Hu. Z, Zhou. Y, Hong. R, “A Text-Guided Generation and Refinement Model for Image Captioning,” IEEE Trans. Multimed., vol. 25, pp. 2966–2977, 2023. https://doi.org/10.1109/TMM.2022.3154149

A. S. . Al-Qatf. M, Wang. X, Hawbani. Am, Abdussalam. A, “Image Captioning With Novel Topics Guidance and Retrieval-Based Topics Re-Weighting,” IEEE Trans. Multimed., vol. 25, pp. 5984–5999, 2023. https://doi.org/10.1109/TMM.2022.3202690

L. C. Yang. M, Liu. J, Shen. Y, Zhao. Z, Chen. X, Wu. Q, “An Ensemble of Generation-and Retrieval-Based Image Captioning with Dual Generator Generative Adversarial Network,” IEEE Trans. Image Process., vol. 29, pp. 9627–9640, 2020. https://doi.org/10.1109/TIP.2020.3028651

X. Y. Huang. Y, Chen. J, Ouyang. W, Wan. W, “Image Captioning with End-to-End Attribute Detection and Subsequent Attributes Prediction,” IEEE Trans. Image Process., vol. 29, pp. 4013–4026, 2020. https://doi.org/10.1109/TIP.2020.2969330

F. W. Zhou. L, Zhang. Y, Jiang. Y.-G, Zhang. T, “Re-Caption: Saliency-Enhanced Image Captioning through Two-Phase Learning,” IEEE Trans. Image Process., vol. 29, pp. 694–709, 2020. https://doi.org/10.1109/TIP.2019.2928144

M. H. Xian. T, Li. Z, Tang. Z, “Adaptive Path Selection for Dynamic Image Captioning,” IEEE Trans. Circuits Syst. Video Technol., vol. 32, no. 9, pp. 5762–5775, 2022. https://doi.org/10.1109/TCSVT.2022.3155795

W. Q. Wang. L, Li. H, Hu. W, Zhang. X, Qiu. H, Meng. F, “What Happens in Crowd Scenes: A New Dataset About Crowd Scenes for Image Captioning,” IEEE Trans. Multimed., vol. 25, pp. 5400–5412, 2023. https://doi.org/10.1109/TMM.2022.3192729

L. Y. Jiang. W, Zhu. M, Fang. Y, Shi. G, Zhao. X, “Visual Cluster Grounding for Image Captioning,” IEEE Trans. Image Process., vol. 31, pp. 3920–3934, 2022. https://doi.org/10.1109/TIP.2022.3177318

Z. Y. Wang. Y, Xu. N, Liu. A.-A, Li. W, “High-Order Interaction Learning for Image Captioning,” IEEE Trans. Circuits Syst. Video Technol., vol. 32, no. 7, pp. 4417–4430, 2022. https://doi.org/10.1109/TCSVT.2021.3121062

J. R. Ji. J, Ma. Y, Sun. X, Zhou. Y, Wu. Y, “Knowing What to Learn: A Metric-Oriented Focal Mechanism for Image Captioning,” IEEE Trans. Image Process., vol. 31, pp. 4321–4335, 2022. https://doi.org/10.1109/TIP.2022.3183434

H. Q. Yu. J, Li. J, Yu. Z, “Multimodal Transformer with Multi-View Visual Representation for Image Captioning,” IEEE Trans. Circuits Syst. Video Technol., vol. 30, no. 12, pp. 4467–4480, 2020. https://doi.org/10.1109/TCSVT.2019.2947482

S. A. Ghosal. K, Rana. A, “Aesthetic image captioning from weakly-labelled photographs,” Proc. - 2019 Int. Conf. Comput. Vis. Work. ICCVW 2019, pp. 4550–4560, 2019. https://doi.org/10.1109/ICCVW.2019.00556

C. T.-S. Zhang. M, Yang. Y, Zhang. H, Ji. Y, Shen. H.T, “More is better: Precise and detailed image captioning using online positive recall and missing concepts mining,” IEEE Trans. Image Process., vol. 28, no. 1, pp. 32–44, 2019. https://doi.org/10.1109/TIP.2018.2855415

L. L.-J. Ren. Z, Wang. X, Zhang. N, Lv. X, “Deep reinforcement learning-based image captioning with embedding reward,” Proc. - 30th IEEE Conf. Comput. Vis. Pattern Recognition, CVPR 2017, pp. 1151–1159, 2017. https://doi.org/10.1109/CVPR.2017.128

K. G. Park. C.C, Kim. B, “Attend to you: Personalized image captioning with Context Sequence Memory Networks,” Proc. - 30th IEEE Conf. Comput. Vis. Pattern Recognition, CVPR 2017, vol. 2017-Janua, pp. 6432–6440, 2017. https://doi.org/10.1109/CVPR.2017.681

C. G. . Wang. Y, Lin. Z, Shen. X, Cohen. S, “Skeleton key: Image captioning by skeleton-Attribute decomposition,” Proc. - 30th IEEE Conf. Comput. Vis. Pattern Recognition, CVPR 2017, vol. 2017-Janua, pp. 7378–7387, 2017. https://doi.org/10.1109/CVPR.2017.780

X. Yan, C, Hao, Y, Li, L, Yin, J, Liu, A, Mao, Z, Chen, Z, Gao, “Task-Adaptive Attention for Image Captioning,” IEEE Trans. Circuits Syst. Video Technol., vol. 32, no. 1, pp. 43–51, 2022. https://doi.org/10.1109/TCSVT.2021.3067449

C. F. Zhang. Z, Wu. Q, Wang. Y, “Exploring Pairwise Relationships Adaptively From Linguistic Context in Image Captioning,” IEEE Trans. Multimed., vol. 24, pp. 3101–3113, 2022. https://doi.org/10.1109/TMM.2021.3093725

P. C. Xiao. X, Wang. L, Ding. K, Xiang. S, “Deep Hierarchical Encoder-Decoder Network for Image Captioning,” IEEE Trans. Multimed., vol. 21, no. 11, pp. 2942–2956, 2019. https://doi.org/10.1109/TMM.2019.2915033

Author biographies is not available.
Download this PDF file
Statistic
Read Counter : 0

Downloads

Download data is not yet available.

Quick Link

  • Author Guidelines
  • Download Manuscript Template
  • Peer Review Process
  • Editorial Board
  • Reviewer Acknowledgement
  • Aim and Scope
  • Publication Ethics
  • Licensing Term
  • Copyright Notice
  • Open Access Policy
  • Important Dates
  • Author Fees
  • Indexing and Abstracting
  • Archiving Policy
  • Scopus Citation Analysis
  • Statistic
  • Article Withdrawal

Meet Our Editorial Team

Ir. Amrul Faruq, M.Eng., Ph.D
Editor in Chief
Universitas Muhammadiyah Malang
Google Scholar Scopus
Agus Eko Minarno
Editorial Board
Universitas Muhammadiyah Malang
Google Scholar  Scopus
Hanung Adi Nugroho
Editorial Board
Universitas Gadjah Mada
Google Scholar Scopus
Roman Voliansky
Editorial Board
Dniprovsky State Technical University, Ukraine
Google Scholar Scopus
Read More
 

KINETIK: Game Technology, Information System, Computer Network, Computing, Electronics, and Control
eISSN : 2503-2267
pISSN : 2503-2259


Address

Program Studi Elektro dan Informatika

Fakultas Teknik, Universitas Muhammadiyah Malang

Jl. Raya Tlogomas 246 Malang

Phone 0341-464318 EXT 247

Contact Info

Principal Contact

Amrul Faruq
Phone: +62 812-9398-6539
Email: faruq@umm.ac.id

Support Contact

Fauzi Dwi Setiawan Sumadi
Phone: +62 815-1145-6946
Email: fauzisumadi@umm.ac.id

© 2020 KINETIK, All rights reserved. This is an open-access article distributed under the terms of the Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License