Quick jump to page content
  • Main Navigation
  • Main Content
  • Sidebar

  • Home
  • Current
  • Archives
  • Join As Reviewer
  • Info
  • Announcements
  • Statistics
  • About
    • About the Journal
    • Submissions
    • Editorial Team
    • Privacy Statement
    • Contact
  • Register
  • Login
  • Home
  • Current
  • Archives
  • Join As Reviewer
  • Info
  • Announcements
  • Statistics
  • About
    • About the Journal
    • Submissions
    • Editorial Team
    • Privacy Statement
    • Contact
  1. Home
  2. Archives
  3. Vol. 7, No. 1, February 2022
  4. Articles

Issue

Vol. 7, No. 1, February 2022

Issue Published : Feb 28, 2022
Creative Commons License

This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.

A Study on Visual Understanding Image Captioning using Different Word Embeddings and CNN-Based Feature Extractions

https://doi.org/10.22219/kinetik.v7i1.1394
Dhomas Hatta Fudholi
Universitas Islam Indonesia
Annisa Zahra
Universitas Islam Indonesia
Royan Abida N. Nayoan
Universitas Islam Indonesia

Corresponding Author(s) : Dhomas Hatta Fudholi

hatta.fudholi@uii.ac.id

Kinetik: Game Technology, Information System, Computer Network, Computing, Electronics, and Control, Vol. 7, No. 1, February 2022
Article Published : Mar 10, 2022

Share
WA Share on Facebook Share on Twitter Pinterest Email Telegram
  • Abstract
  • Cite
  • References
  • Authors Details

Abstract

Image captioning is a task that can provide a description of an image in natural language. Image captioning can be used for a variety of applications, such as image indexing and virtual assistants. In this research, we compared the performance of three different word embeddings, namely, GloVe, Word2Vec, FastText and six CNN-based feature extraction architectures such as, Inception V3, InceptionResNet V2, ResNet152 V2, EfficientNet B3 V1, EfficientNet B7 V1, and NASNetLarge which then will be combined with LSTM as the decoder to perform image captioning. We used ten different household objects (bed, cell phone, chair, couch, oven, potted plant, refrigerator, sink, table, and tv) that were obtained from MSCOCO dataset to develop the model. Then, we created five new captions in Bahasa Indonesia for the selected images. The captions might contain details about the name, the location, the color, the size, and the characteristics of an object and its surrounding area. In our 18 experimental models, we used different combination of the word embedding and CNN-based feature extraction architecture, along with LSTM to train the model. As the result, models that used the combination of Word2Vec + NASNetLarge performed better in generating Indonesian captions than the other models based on BLEU-4 metric.

Keywords

BLEU Deep Learning CNN Image Captioning LSTM Word Embedding
Fudholi, D. H., Zahra, A., & Nayoan, R. A. N. (2022). A Study on Visual Understanding Image Captioning using Different Word Embeddings and CNN-Based Feature Extractions. Kinetik: Game Technology, Information System, Computer Network, Computing, Electronics, and Control, 7(1), 91-98. https://doi.org/10.22219/kinetik.v7i1.1394
  • ACM
  • ACS
  • APA
  • ABNT
  • Chicago
  • Harvard
  • IEEE
  • MLA
  • Turabian
  • Vancouver
Download Citation
Endnote/Zotero/Mendeley (RIS)
BibTeX
References
  1. S. Herdade, A. Kappeler, K. Boakye, and J. Soares, “Image captioning: Transforming objects into words,” Adv. Neural Inf. Process. Syst., vol. 32, no. NeurIPS, pp. 1–11, 2019.
  2. T. Yao, Y. Pan, Y. Li, Z. Qiu, and T. Mei, “Boosting Image Captioning with Attributes,” Proc. IEEE Int. Conf. Comput. Vis., vol. 2017-Octob, pp. 4904–4912, 2017, doi: 10.1109/ICCV.2017.524.
  3. L. Srinivasan and D. Sreekanthan, “Image Captioning-A Deep Learning Approach,” Int. J. Appl. Eng. Res., vol. 13, no. 9, pp. 7239–7242, 2018.
  4. U. Bhoga, V. Aravind, G. Sreeja, and M. Arif, “Image Caption Generation Using CNN and LSTM,” JAC A J. Compos. Theory, vol. XIV, no. VII, pp. 257–263, 2021.
  5. E. Cahyaningtyas and D. Arifianto, “Development of under-resourced Bahasa Indonesia speech corpus,” in 2017 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC), Dec. 2017, vol. 2018-Febru, no. December, pp. 1097–1101. doi: 10.1109/APSIPA.2017.8282191.
  6. A. A. Nugraha, A. Arifianto, and Suyanto, “Generating image description on Indonesian language using convolutional neural network and gated recurrent unit,” 2019 7th Int. Conf. Inf. Commun. Technol. ICoICT 2019, pp. 1–6, 2019, doi: 10.1109/ICoICT.2019.8835370.
  7. P. Shah, V. Bakrola, and S. Pati, “Image captioning using deep neural architectures,” in 2017 International Conference on Innovations in Information, Embedded and Communication Systems (ICIIECS), Mar. 2017, pp. 1–4. doi: 10.1109/ICIIECS.2017.8276124.
  8. N. S. B, L. White, and M. Bennamoun, NNEval : Neural Network Based, vol. 1. Springer International Publishing. doi: 10.1007/978-3-030-01237-3.
  9. Y. Chu, X. Yue, L. Yu, M. Sergei, and Z. Wang, “Automatic Image Captioning Based on ResNet50 and LSTM with Soft Attention,” Wirel. Commun. Mob. Comput., vol. 2020, pp. 1–7, Oct. 2020, doi: 10.1155/2020/8909458.
  10. T.-Y. Lin et al., “Microsoft COCO: Common Objects in Context,” 2014, pp. 740–755. doi: 10.1007/978-3-319-10602-1_48.
  11. C. Sur, “MRRC: multiple role representation crossover interpretation for image captioning with R-CNN feature distribution composition (FDC),” Multimed. Tools Appl., vol. 80, no. 12, pp. 18413–18443, May 2021, doi: 10.1007/s11042-021-10578-9.
  12. M. Z. Hossain, F. Sohel, M. F. Shiratuddin, H. Laga, and M. Bennamoun, “Text to Image Synthesis for Improved Image Captioning,” IEEE Access, vol. 9, pp. 64918–64928, 2021, doi: 10.1109/ACCESS.2021.3075579.
  13. K. Arora, A. Raj, A. Goel, and S. Susan, “A Hybrid Model for Combining Neural Image Caption and k-Nearest Neighbor Approach for Image Captioning,” 2022, pp. 51–59. doi: 10.1007/978-981-16-1249-7_6.
  14. H. Rampal and A. Mohanty, “Efficient CNN-LSTM based Image Captioning using Neural Network Compression,” Dec. 2020.
  15. S. Katiyar and S. K. Borgohain, “Image Captioning using Deep Stacked LSTMs, Contextual Word Embeddings and Data Augmentation,” Feb. 2021.
  16. H. Wei, Z. Li, F. Huang, C. Zhang, H. Ma, and Z. Shi, “Integrating Scene Semantic Knowledge into Image Captioning,” ACM Trans. Multimed. Comput. Commun. Appl., vol. 17, no. 2, pp. 1–22, May 2021, doi: 10.1145/3439734.
  17. M. R. S. Mahadi, A. Arifianto, and K. N. Ramadhani, “Adaptive Attention Generation for Indonesian Image Captioning,” 2020 8th Int. Conf. Inf. Commun. Technol. ICoICT 2020, 2020, doi: 10.1109/ICoICT49345.2020.9166244.
  18. Z. Ren, X. Wang, N. Zhang, X. Lv, and L.-J. Li, “Deep Reinforcement Learning-Based Image Captioning with Embedding Reward,” in 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Jul. 2017, pp. 1151–1159. doi: 10.1109/CVPR.2017.128.
  19. V. Atliha and D. Sesok, “Comparison of VGG and ResNet used as Encoders for Image Captioning,” in 2020 IEEE Open Conference of Electrical, Electronic and Information Sciences (eStream), Apr. 2020, pp. 1–4. doi: 10.1109/eStream50540.2020.9108880.
  20. C. Wang, H. Yang, C. Bartz, and C. Meinel, “Image Captioning with Deep Bidirectional LSTMs,” in Proceedings of the 24th ACM international conference on Multimedia, Oct. 2016, pp. 988–997. doi: 10.1145/2964284.2964299.
  21. Y. Bhatia, A. Bajpayee, D. Raghuvanshi, and H. Mittal, “Image Captioning using Google’s Inception-resnet-v2 and Recurrent Neural Network,” in 2019 Twelfth International Conference on Contemporary Computing (IC3), Aug. 2019, pp. 1–6. doi: 10.1109/IC3.2019.8844921.
  22. M. Humaira, S. Paul, M. Abidur, A. Saha, and F. Muhammad, “A Hybridized Deep Learning Method for Bengali Image Captioning,” Int. J. Adv. Comput. Sci. Appl., vol. 12, no. 2, 2021, doi: 10.14569/IJACSA.2021.0120287.
  23. S. Das, L. Jain, and A. Das, “Deep Learning for Military Image Captioning,” 21st Int. Conf. Inf. Fusion, Cambridge, United Kingdom, pp. 2165–2171, 2018, doi: 10.23919/ICIF.2018.8455321.
  24. V. Atliha and D. Sesok, “Pretrained Word Embeddings for Image Captioning,” in 2021 IEEE Open Conference of Electrical, Electronic and Information Sciences (eStream), Apr. 2021, pp. 1–4. doi: 10.1109/eStream53087.2021.9431465.
  25. S. Yagcioglu, E. Erdem, A. Erdem, and R. Cakici, “A Distributed Representation Based Query Expansion Approach for Image Captioning,” in Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 2: Short Papers), 2015, pp. 106–111. doi: 10.3115/v1/P15-2018.
  26. D. H. Fudholi et al., “Image Captioning with Attention for Smart Local Tourism using EfficientNet,” IOP Conf. Ser. Mater. Sci. Eng., vol. 1077, no. 1, p. 012038, Feb. 2021, doi: 10.1088/1757-899X/1077/1/012038.
  27. C. Szegedy et al., “Going Deeper with Convolutions,” pp. 1–12, Sep. 2014.
  28. O. Albatayneh, L. Forslöf, K. Ksaibati, and D. Ph, “Image Retraining Using TensorFlow Implementation of the Pretrained Inception-v3 Model for Evaluating Gravel Road Dust,” vol. 26, no. 2, pp. 1–10, 2020, doi: 10.1061/(ASCE)IS.1943-555X.0000545.
  29. C. Szegedy, S. Ioffe, V. Vanhoucke, and A. Alemi, “Inception-v4, Inception-ResNet and the Impact of Residual Connections on Learning,” Feb. 2016.
  30. M. Tan and Q. V. Le, “EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks,” May 2019.
  31. L. D. Nguyen, D. Lin, Z. Lin, and J. Cao, “Deep CNNs for microscopic image classification by exploiting transfer learning and feature concatenation,” in 2018 IEEE International Symposium on Circuits and Systems (ISCAS), 2018, no. May, pp. 1–5. doi: 10.1109/ISCAS.2018.8351550.
  32. B. Zoph, V. Vasudevan, J. Shlens, and Q. V. Le, “Learning Transferable Architectures for Scalable Image Recognition,” Jul. 2017.
  33. S. Takkar, A. Jain, and P. Adlakha, “Comparative Study of Different Image Captioning Models,” in 2021 5th International Conference on Computing Methodologies and Communication (ICCMC), Apr. 2021, no. Iccmc, pp. 1366–1371. doi: 10.1109/ICCMC51019.2021.9418451.
  34. E. Reiter, “A Structured Review of the Validity of BLEU,” Comput. Linguist., vol. 44, no. 3, pp. 393–401, Sep. 2018, doi: 10.1162/coli_a_00322.
  35. M. D. Zakir Hossain, F. Sohel, M. F. Shiratuddin, and H. Laga, “A comprehensive survey of deep learning for image captioning,” ACM Comput. Surv., vol. 51, no. 6, 2019, doi: 10.1145/3295748.
Read More

References


S. Herdade, A. Kappeler, K. Boakye, and J. Soares, “Image captioning: Transforming objects into words,” Adv. Neural Inf. Process. Syst., vol. 32, no. NeurIPS, pp. 1–11, 2019.

T. Yao, Y. Pan, Y. Li, Z. Qiu, and T. Mei, “Boosting Image Captioning with Attributes,” Proc. IEEE Int. Conf. Comput. Vis., vol. 2017-Octob, pp. 4904–4912, 2017, doi: 10.1109/ICCV.2017.524.

L. Srinivasan and D. Sreekanthan, “Image Captioning-A Deep Learning Approach,” Int. J. Appl. Eng. Res., vol. 13, no. 9, pp. 7239–7242, 2018.

U. Bhoga, V. Aravind, G. Sreeja, and M. Arif, “Image Caption Generation Using CNN and LSTM,” JAC A J. Compos. Theory, vol. XIV, no. VII, pp. 257–263, 2021.

E. Cahyaningtyas and D. Arifianto, “Development of under-resourced Bahasa Indonesia speech corpus,” in 2017 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC), Dec. 2017, vol. 2018-Febru, no. December, pp. 1097–1101. doi: 10.1109/APSIPA.2017.8282191.

A. A. Nugraha, A. Arifianto, and Suyanto, “Generating image description on Indonesian language using convolutional neural network and gated recurrent unit,” 2019 7th Int. Conf. Inf. Commun. Technol. ICoICT 2019, pp. 1–6, 2019, doi: 10.1109/ICoICT.2019.8835370.

P. Shah, V. Bakrola, and S. Pati, “Image captioning using deep neural architectures,” in 2017 International Conference on Innovations in Information, Embedded and Communication Systems (ICIIECS), Mar. 2017, pp. 1–4. doi: 10.1109/ICIIECS.2017.8276124.

N. S. B, L. White, and M. Bennamoun, NNEval : Neural Network Based, vol. 1. Springer International Publishing. doi: 10.1007/978-3-030-01237-3.

Y. Chu, X. Yue, L. Yu, M. Sergei, and Z. Wang, “Automatic Image Captioning Based on ResNet50 and LSTM with Soft Attention,” Wirel. Commun. Mob. Comput., vol. 2020, pp. 1–7, Oct. 2020, doi: 10.1155/2020/8909458.

T.-Y. Lin et al., “Microsoft COCO: Common Objects in Context,” 2014, pp. 740–755. doi: 10.1007/978-3-319-10602-1_48.

C. Sur, “MRRC: multiple role representation crossover interpretation for image captioning with R-CNN feature distribution composition (FDC),” Multimed. Tools Appl., vol. 80, no. 12, pp. 18413–18443, May 2021, doi: 10.1007/s11042-021-10578-9.

M. Z. Hossain, F. Sohel, M. F. Shiratuddin, H. Laga, and M. Bennamoun, “Text to Image Synthesis for Improved Image Captioning,” IEEE Access, vol. 9, pp. 64918–64928, 2021, doi: 10.1109/ACCESS.2021.3075579.

K. Arora, A. Raj, A. Goel, and S. Susan, “A Hybrid Model for Combining Neural Image Caption and k-Nearest Neighbor Approach for Image Captioning,” 2022, pp. 51–59. doi: 10.1007/978-981-16-1249-7_6.

H. Rampal and A. Mohanty, “Efficient CNN-LSTM based Image Captioning using Neural Network Compression,” Dec. 2020.

S. Katiyar and S. K. Borgohain, “Image Captioning using Deep Stacked LSTMs, Contextual Word Embeddings and Data Augmentation,” Feb. 2021.

H. Wei, Z. Li, F. Huang, C. Zhang, H. Ma, and Z. Shi, “Integrating Scene Semantic Knowledge into Image Captioning,” ACM Trans. Multimed. Comput. Commun. Appl., vol. 17, no. 2, pp. 1–22, May 2021, doi: 10.1145/3439734.

M. R. S. Mahadi, A. Arifianto, and K. N. Ramadhani, “Adaptive Attention Generation for Indonesian Image Captioning,” 2020 8th Int. Conf. Inf. Commun. Technol. ICoICT 2020, 2020, doi: 10.1109/ICoICT49345.2020.9166244.

Z. Ren, X. Wang, N. Zhang, X. Lv, and L.-J. Li, “Deep Reinforcement Learning-Based Image Captioning with Embedding Reward,” in 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Jul. 2017, pp. 1151–1159. doi: 10.1109/CVPR.2017.128.

V. Atliha and D. Sesok, “Comparison of VGG and ResNet used as Encoders for Image Captioning,” in 2020 IEEE Open Conference of Electrical, Electronic and Information Sciences (eStream), Apr. 2020, pp. 1–4. doi: 10.1109/eStream50540.2020.9108880.

C. Wang, H. Yang, C. Bartz, and C. Meinel, “Image Captioning with Deep Bidirectional LSTMs,” in Proceedings of the 24th ACM international conference on Multimedia, Oct. 2016, pp. 988–997. doi: 10.1145/2964284.2964299.

Y. Bhatia, A. Bajpayee, D. Raghuvanshi, and H. Mittal, “Image Captioning using Google’s Inception-resnet-v2 and Recurrent Neural Network,” in 2019 Twelfth International Conference on Contemporary Computing (IC3), Aug. 2019, pp. 1–6. doi: 10.1109/IC3.2019.8844921.

M. Humaira, S. Paul, M. Abidur, A. Saha, and F. Muhammad, “A Hybridized Deep Learning Method for Bengali Image Captioning,” Int. J. Adv. Comput. Sci. Appl., vol. 12, no. 2, 2021, doi: 10.14569/IJACSA.2021.0120287.

S. Das, L. Jain, and A. Das, “Deep Learning for Military Image Captioning,” 21st Int. Conf. Inf. Fusion, Cambridge, United Kingdom, pp. 2165–2171, 2018, doi: 10.23919/ICIF.2018.8455321.

V. Atliha and D. Sesok, “Pretrained Word Embeddings for Image Captioning,” in 2021 IEEE Open Conference of Electrical, Electronic and Information Sciences (eStream), Apr. 2021, pp. 1–4. doi: 10.1109/eStream53087.2021.9431465.

S. Yagcioglu, E. Erdem, A. Erdem, and R. Cakici, “A Distributed Representation Based Query Expansion Approach for Image Captioning,” in Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 2: Short Papers), 2015, pp. 106–111. doi: 10.3115/v1/P15-2018.

D. H. Fudholi et al., “Image Captioning with Attention for Smart Local Tourism using EfficientNet,” IOP Conf. Ser. Mater. Sci. Eng., vol. 1077, no. 1, p. 012038, Feb. 2021, doi: 10.1088/1757-899X/1077/1/012038.

C. Szegedy et al., “Going Deeper with Convolutions,” pp. 1–12, Sep. 2014.

O. Albatayneh, L. Forslöf, K. Ksaibati, and D. Ph, “Image Retraining Using TensorFlow Implementation of the Pretrained Inception-v3 Model for Evaluating Gravel Road Dust,” vol. 26, no. 2, pp. 1–10, 2020, doi: 10.1061/(ASCE)IS.1943-555X.0000545.

C. Szegedy, S. Ioffe, V. Vanhoucke, and A. Alemi, “Inception-v4, Inception-ResNet and the Impact of Residual Connections on Learning,” Feb. 2016.

M. Tan and Q. V. Le, “EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks,” May 2019.

L. D. Nguyen, D. Lin, Z. Lin, and J. Cao, “Deep CNNs for microscopic image classification by exploiting transfer learning and feature concatenation,” in 2018 IEEE International Symposium on Circuits and Systems (ISCAS), 2018, no. May, pp. 1–5. doi: 10.1109/ISCAS.2018.8351550.

B. Zoph, V. Vasudevan, J. Shlens, and Q. V. Le, “Learning Transferable Architectures for Scalable Image Recognition,” Jul. 2017.

S. Takkar, A. Jain, and P. Adlakha, “Comparative Study of Different Image Captioning Models,” in 2021 5th International Conference on Computing Methodologies and Communication (ICCMC), Apr. 2021, no. Iccmc, pp. 1366–1371. doi: 10.1109/ICCMC51019.2021.9418451.

E. Reiter, “A Structured Review of the Validity of BLEU,” Comput. Linguist., vol. 44, no. 3, pp. 393–401, Sep. 2018, doi: 10.1162/coli_a_00322.

M. D. Zakir Hossain, F. Sohel, M. F. Shiratuddin, and H. Laga, “A comprehensive survey of deep learning for image captioning,” ACM Comput. Surv., vol. 51, no. 6, 2019, doi: 10.1145/3295748.

Author Biography

Dhomas Hatta Fudholi, Universitas Islam Indonesia

Scopus Profil: https://www.scopus.com/authid/detail.uri?authorId=35786990500

Google Scholar Profil: https://scholar.google.co.id/citations?hl=en&user=TwBb_VAAAAAJ

Download this PDF file
PDF
Statistic
Read Counter : 145 Download : 94

Downloads

Download data is not yet available.

Quick Link

  • Author Guidelines
  • Download Manuscript Template
  • Peer Review Process
  • Editorial Board
  • Reviewer Acknowledgement
  • Aim and Scope
  • Publication Ethics
  • Licensing Term
  • Copyright Notice
  • Open Access Policy
  • Important Dates
  • Author Fees
  • Indexing and Abstracting
  • Archiving Policy
  • Scopus Citation Analysis
  • Statistic
  • Article Withdrawal

Meet Our Editorial Team

Ir. Amrul Faruq, M.Eng., Ph.D
Editor in Chief
Universitas Muhammadiyah Malang
Google Scholar Scopus
Agus Eko Minarno
Editorial Board
Universitas Muhammadiyah Malang
Google Scholar  Scopus
Hanung Adi Nugroho
Editorial Board
Universitas Gadjah Mada
Google Scholar Scopus
Roman Voliansky
Editorial Board
Dniprovsky State Technical University, Ukraine
Google Scholar Scopus
Read More
 

KINETIK: Game Technology, Information System, Computer Network, Computing, Electronics, and Control
eISSN : 2503-2267
pISSN : 2503-2259


Address

Program Studi Elektro dan Informatika

Fakultas Teknik, Universitas Muhammadiyah Malang

Jl. Raya Tlogomas 246 Malang

Phone 0341-464318 EXT 247

Contact Info

Principal Contact

Amrul Faruq
Phone: +62 812-9398-6539
Email: faruq@umm.ac.id

Support Contact

Fauzi Dwi Setiawan Sumadi
Phone: +62 815-1145-6946
Email: fauzisumadi@umm.ac.id

© 2020 KINETIK, All rights reserved. This is an open-access article distributed under the terms of the Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License