Image Captioning using Hybrid of VGG16 and Bidirectional LSTM Model

Yufis Azhar; M. Randy Anugerah; Muhammad Al Reza Fahlopy; Alfin  Yusriansyah

doi:10.22219/kinetik.v7i4.1568

Issue

Vol. 7, No. 4, November 2022

Issue Published : Nov 30, 2022

This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.

Image Captioning using Hybrid of VGG16 and Bidirectional LSTM Model

https://doi.org/10.22219/kinetik.v7i4.1568

Yufis Azhar

Universitas Muhammadiyah Malang

M. Randy Anugerah

Universitas Muhammadiyah Malang

Muhammad Al Reza Fahlopy

Universitas Muhammadiyah Malang

Alfin Yusriansyah

Universitas Muhammadiyah Malang

Corresponding Author(s) : Yufis Azhar

yufis@umm.ac.id

Kinetik: Game Technology, Information System, Computer Network, Computing, Electronics, and Control, Vol. 7, No. 4, November 2022
Article Published : Nov 30, 2022

Abstract

Image captioning is one of the biggest challenges in the fields of computer vision and natural language processing. Many other studies have raised the topic of image captioning. However, the evaluation results from other studies are still low. Thus, this study focuses on improving the evaluation results from previous studies. In this study, we used the Flickr8k dataset and the VGG16 Convolutional Neural Networks (CNN) model as an encoder to generate feature extraction from images. Recurrent Neural Network (RNN) uses the Bidirectional Long-Short Term Memory (BiLSTM) method as a decoder. The results of the image feature extraction process in the form of feature vectors are then forwarded to Bidirectional LSTM to produce descriptions that match the input image or visual content. The captions provide information on the object’s name, location, color, size, features of an object, and surroundings. A greedy Search algorithm with Argmax function and Beam-Search algorithm are used to calculate Bilingual Evaluation Understudy (BLEU) scores. The results of the evaluation of the best BLEU scores obtained from this study are the VGG16 model with Bidirectional LSTM using Beam Search with parameter K = 3 and the BLEU-1 score is 0.60593, so this score is superior to previous studies.

Keywords

Image Captioning VGG16 Bidirectional LSTM Hybrid Method

Azhar, Y., Anugerah, M. R., Fahlopy, M. A. R., & Yusriansyah, A. . (2022). Image Captioning using Hybrid of VGG16 and Bidirectional LSTM Model. Kinetik: Game Technology, Information System, Computer Network, Computing, Electronics, and Control, 7(4). https://doi.org/10.22219/kinetik.v7i4.1568

Download Citation

References

J. Zhang, K. Li, Z. Wang, X. Zhao, and Z. Wang, “Visual enhanced gLSTM for image captioning,” Expert Syst. Appl., vol. 184, no. June, p. 115462, 2021, doi: 10.1016/j.eswa.2021.115462.
S. Herdade, A. Kappeler, K. Boakye, and J. Soares, “Image captioning: Transforming objects into words,” Adv. Neural Inf. Process. Syst., vol. 32, no. NeurIPS, pp. 1–11, 2019.
Z. Deng, Z. Jiang, R. Lan, W. Huang, and X. Luo, “Image captioning using DenseNet network and adaptive attention,” Signal Process. Image Commun., vol. 85, p. 115836, 2020, doi: 10.1016/j.image.2020.115836.
O. Sargar and S. Kinger, “Image captioning methods and metrics,” 2021 Int. Conf. Emerg. Smart Comput. Informatics, ESCI 2021, pp. 522–526, 2021, doi: 10.1109/ESCI50559.2021.9396839.
I. Hrga and M. Ivašic-Kos, “Deep image captioning: An overview,” 2019 42nd Int. Conv. Inf. Commun. Technol. Electron. Microelectron. MIPRO 2019 - Proc., pp. 995–1000, 2019, doi: 10.23919/MIPRO.2019.8756821.
A. Nursikuwagus, R. Munir, and M. L. Khodra, “Image Captioning menurut Scientific Revolution Kuhn dan Popper,” J. Manaj. Inform., vol. 10, no. 2, pp. 110–121, 2020, doi: 10.34010/jamika.v10i2.2630.
Y. Pan, T. Yao, Y. Li, and T. Mei, “X-Linear Attention Networks for Image Captioning,” Proc. IEEE Comput. Soc. Conf. Comput. Vis. Pattern Recognit., pp. 10968–10977, 2020, doi: 10.1109/CVPR42600.2020.01098.
M. A. Al-Malla, M. A. Al-Malla, A. Jafar, and N. Ghneim, “Pre-trained CNNs as Feature-Extraction Modules for Image Captioning,” ELCVIA Electron. Lett. Comput. Vis. Image Anal., vol. 21, no. 1, pp. 1–16, 2022, doi: 10.5565/rev/elcvia.1436.
K. Chandhar, C. H. Sandeep, M. Akarapu, K. R. Chythanya, and V. Thirupathi, “Deep learning model for automatic image captioning,” Int. Conf. Res. Sci. Eng. Technol., vol. 2418, no. May, p. 020074, 2022, doi: 10.1063/5.0081847.
A. Kumar, “Image Captioning and Image Retrieval,” vol. 4, no. 4, pp. 909–912, 2019.
H. Hejazi and K. Shaalan, “Deep Learning for Arabic Image Captioning: A Comparative Study of Main Factors and Preprocessing Recommendations,” Int. J. Adv. Comput. Sci. Appl., vol. 12, no. 11, pp. 37–44, 2021, doi: 10.14569/IJACSA.2021.0121105.
E. Mulyanto, E. I. Setiawan, E. M. Yuniarno, and M. H. Purnomo, “Automatic Indonesian Image Caption Generation using CNN-LSTM Model and FEEH-ID Dataset,” 2019 IEEE Int. Conf. Comput. Intell. Virtual Environ. Meas. Syst. Appl. CIVEMSA 2019 - Proc., 2019, doi: 10.1109/CIVEMSA45640.2019.9071632.
A. A. Nugraha, A. Arifianto, and Suyanto, “Generating image description on Indonesian language using convolutional neural network and gated recurrent unit,” 2019 7th Int. Conf. Inf. Commun. Technol. ICoICT 2019, pp. 1–6, 2019, doi: 10.1109/ICoICT.2019.8835370.
C. Wang, H. Yang, and C. Meinel, “Image Captioning with Deep Bidirectional LSTMs and Multi-Task Learning,” ACM Trans. Multimed. Comput. Commun. Appl., vol. 14, no. 2s, 2018, doi: 10.1145/3115432.
M. Chohan, A. Khan, M. S. Mahar, S. Hassan, A. Ghafoor, and M. Khan, “Image captioning using deep learning: A systematic literature review,” Int. J. Adv. Comput. Sci. Appl., vol. 11, no. 5, pp. 278–286, 2020, doi: 10.14569/IJACSA.2020.0110537.
Y. Azhar, M. C. Mustaqim, and A. E. Minarno, “Ensemble convolutional neural network for robust batik classification,” IOP Conf. Ser. Mater. Sci. Eng., vol. 1077, no. 1, p. 012053, 2021, doi: 10.1088/1757-899x/1077/1/012053.
S. S. Rawat, K. S. Rawat, and R. Nijhawan, “A novel convolutional neural network-gated recurrent unit approach for image captioning,” Proc. 3rd Int. Conf. Smart Syst. Inven. Technol. ICSSIT 2020, no. Icssit, pp. 704–708, 2020, doi: 10.1109/ICSSIT48917.2020.9214109.
S. Tammina, “Transfer learning using VGG-16 with Deep Convolutional Neural Network for Classifying Images,” Int. J. Sci. Res. Publ., vol. 9, no. 10, p. p9420, 2019, doi: 10.29322/ijsrp.9.10.2019.p9420.
B. I. S. L. Nalbalwar, Advances in Intelligent Systems and Computing 810 Computing , Communication and Signal Processing, vol. 1. 2018.
Y. Imrana, Y. Xiang, L. Ali, and Z. Abdul-Rauf, “A bidirectional LSTM deep learning approach for intrusion detection,” Expert Syst. Appl., vol. 185, no. June, p. 115524, 2021, doi: 10.1016/j.eswa.2021.115524.
S. Welleck, I. Kulikov, S. Roller, E. Dinan, K. Cho, and J. Weston, “Neural Text Generation with Unlikelihood Training,” no. i, pp. 1–17, 2019, [Online]. Available: http://arxiv.org/abs/1908.04319.
C. Meister, T. Vieira, and R. Cotterell, “Best-first beam search,” Trans. Assoc. Comput. Linguist., vol. 8, pp. 795–809, 2020, doi: 10.1162/tacl_a_00346.
A. Deshpande, J. Aneja, L. Wang, A. G. Schwing, and D. Forsyth, “Fast, diverse and accurate image captioning guided by part-of-speech,” Proc. IEEE Comput. Soc. Conf. Comput. Vis. Pattern Recognit., vol. 2019-June, pp. 10687–10696, 2019, doi: 10.1109/CVPR.2019.01095.
J. M. Czum, “Dive Into Deep Learning,” J. Am. Coll. Radiol., vol. 17, no. 5, pp. 637–638, 2020, doi: 10.1016/j.jacr.2020.02.005.
D. H. Fudholi, A. Zahra, and R. A. N. Nayoan, “A Study on Visual Understanding Image Captioning using Different Word Embeddings and CNN-Based Feature Extractions,” Kinet. Game Technol. Inf. Syst. Comput. Network, Comput. Electron. Control, vol. 4, no. 1, pp. 91–98, 2022, doi: 10.22219/kinetik.v7i1.1394.
M. R. S. Mahadi, A. Arifianto, and K. N. Ramadhani, “Adaptive Attention Generation for Indonesian Image Captioning,” 2020 8th Int. Conf. Inf. Commun. Technol. ICoICT 2020, 2020, doi: 10.1109/ICoICT49345.2020.9166244.
M. Kuyu, A. Erdem, and E. Erdem, “Altsözcük Ö ˘ geleri ile Türkçe Görüntü Altyazılama Image Captioning in Turkish with Subword Units,” 2018 26th Signal Process. Commun. Appl. Conf., pp. 1–4.

References

J. Zhang, K. Li, Z. Wang, X. Zhao, and Z. Wang, “Visual enhanced gLSTM for image captioning,” Expert Syst. Appl., vol. 184, no. June, p. 115462, 2021, doi: 10.1016/j.eswa.2021.115462.

S. Herdade, A. Kappeler, K. Boakye, and J. Soares, “Image captioning: Transforming objects into words,” Adv. Neural Inf. Process. Syst., vol. 32, no. NeurIPS, pp. 1–11, 2019.

Z. Deng, Z. Jiang, R. Lan, W. Huang, and X. Luo, “Image captioning using DenseNet network and adaptive attention,” Signal Process. Image Commun., vol. 85, p. 115836, 2020, doi: 10.1016/j.image.2020.115836.

O. Sargar and S. Kinger, “Image captioning methods and metrics,” 2021 Int. Conf. Emerg. Smart Comput. Informatics, ESCI 2021, pp. 522–526, 2021, doi: 10.1109/ESCI50559.2021.9396839.

I. Hrga and M. Ivašic-Kos, “Deep image captioning: An overview,” 2019 42nd Int. Conv. Inf. Commun. Technol. Electron. Microelectron. MIPRO 2019 - Proc., pp. 995–1000, 2019, doi: 10.23919/MIPRO.2019.8756821.

A. Nursikuwagus, R. Munir, and M. L. Khodra, “Image Captioning menurut Scientific Revolution Kuhn dan Popper,” J. Manaj. Inform., vol. 10, no. 2, pp. 110–121, 2020, doi: 10.34010/jamika.v10i2.2630.

Y. Pan, T. Yao, Y. Li, and T. Mei, “X-Linear Attention Networks for Image Captioning,” Proc. IEEE Comput. Soc. Conf. Comput. Vis. Pattern Recognit., pp. 10968–10977, 2020, doi: 10.1109/CVPR42600.2020.01098.

M. A. Al-Malla, M. A. Al-Malla, A. Jafar, and N. Ghneim, “Pre-trained CNNs as Feature-Extraction Modules for Image Captioning,” ELCVIA Electron. Lett. Comput. Vis. Image Anal., vol. 21, no. 1, pp. 1–16, 2022, doi: 10.5565/rev/elcvia.1436.

K. Chandhar, C. H. Sandeep, M. Akarapu, K. R. Chythanya, and V. Thirupathi, “Deep learning model for automatic image captioning,” Int. Conf. Res. Sci. Eng. Technol., vol. 2418, no. May, p. 020074, 2022, doi: 10.1063/5.0081847.

A. Kumar, “Image Captioning and Image Retrieval,” vol. 4, no. 4, pp. 909–912, 2019.

H. Hejazi and K. Shaalan, “Deep Learning for Arabic Image Captioning: A Comparative Study of Main Factors and Preprocessing Recommendations,” Int. J. Adv. Comput. Sci. Appl., vol. 12, no. 11, pp. 37–44, 2021, doi: 10.14569/IJACSA.2021.0121105.

E. Mulyanto, E. I. Setiawan, E. M. Yuniarno, and M. H. Purnomo, “Automatic Indonesian Image Caption Generation using CNN-LSTM Model and FEEH-ID Dataset,” 2019 IEEE Int. Conf. Comput. Intell. Virtual Environ. Meas. Syst. Appl. CIVEMSA 2019 - Proc., 2019, doi: 10.1109/CIVEMSA45640.2019.9071632.

A. A. Nugraha, A. Arifianto, and Suyanto, “Generating image description on Indonesian language using convolutional neural network and gated recurrent unit,” 2019 7th Int. Conf. Inf. Commun. Technol. ICoICT 2019, pp. 1–6, 2019, doi: 10.1109/ICoICT.2019.8835370.

C. Wang, H. Yang, and C. Meinel, “Image Captioning with Deep Bidirectional LSTMs and Multi-Task Learning,” ACM Trans. Multimed. Comput. Commun. Appl., vol. 14, no. 2s, 2018, doi: 10.1145/3115432.

M. Chohan, A. Khan, M. S. Mahar, S. Hassan, A. Ghafoor, and M. Khan, “Image captioning using deep learning: A systematic literature review,” Int. J. Adv. Comput. Sci. Appl., vol. 11, no. 5, pp. 278–286, 2020, doi: 10.14569/IJACSA.2020.0110537.

Y. Azhar, M. C. Mustaqim, and A. E. Minarno, “Ensemble convolutional neural network for robust batik classification,” IOP Conf. Ser. Mater. Sci. Eng., vol. 1077, no. 1, p. 012053, 2021, doi: 10.1088/1757-899x/1077/1/012053.

S. S. Rawat, K. S. Rawat, and R. Nijhawan, “A novel convolutional neural network-gated recurrent unit approach for image captioning,” Proc. 3rd Int. Conf. Smart Syst. Inven. Technol. ICSSIT 2020, no. Icssit, pp. 704–708, 2020, doi: 10.1109/ICSSIT48917.2020.9214109.

S. Tammina, “Transfer learning using VGG-16 with Deep Convolutional Neural Network for Classifying Images,” Int. J. Sci. Res. Publ., vol. 9, no. 10, p. p9420, 2019, doi: 10.29322/ijsrp.9.10.2019.p9420.

B. I. S. L. Nalbalwar, Advances in Intelligent Systems and Computing 810 Computing , Communication and Signal Processing, vol. 1. 2018.

Y. Imrana, Y. Xiang, L. Ali, and Z. Abdul-Rauf, “A bidirectional LSTM deep learning approach for intrusion detection,” Expert Syst. Appl., vol. 185, no. June, p. 115524, 2021, doi: 10.1016/j.eswa.2021.115524.

S. Welleck, I. Kulikov, S. Roller, E. Dinan, K. Cho, and J. Weston, “Neural Text Generation with Unlikelihood Training,” no. i, pp. 1–17, 2019, [Online]. Available: http://arxiv.org/abs/1908.04319.

C. Meister, T. Vieira, and R. Cotterell, “Best-first beam search,” Trans. Assoc. Comput. Linguist., vol. 8, pp. 795–809, 2020, doi: 10.1162/tacl_a_00346.

A. Deshpande, J. Aneja, L. Wang, A. G. Schwing, and D. Forsyth, “Fast, diverse and accurate image captioning guided by part-of-speech,” Proc. IEEE Comput. Soc. Conf. Comput. Vis. Pattern Recognit., vol. 2019-June, pp. 10687–10696, 2019, doi: 10.1109/CVPR.2019.01095.

J. M. Czum, “Dive Into Deep Learning,” J. Am. Coll. Radiol., vol. 17, no. 5, pp. 637–638, 2020, doi: 10.1016/j.jacr.2020.02.005.

D. H. Fudholi, A. Zahra, and R. A. N. Nayoan, “A Study on Visual Understanding Image Captioning using Different Word Embeddings and CNN-Based Feature Extractions,” Kinet. Game Technol. Inf. Syst. Comput. Network, Comput. Electron. Control, vol. 4, no. 1, pp. 91–98, 2022, doi: 10.22219/kinetik.v7i1.1394.

M. R. S. Mahadi, A. Arifianto, and K. N. Ramadhani, “Adaptive Attention Generation for Indonesian Image Captioning,” 2020 8th Int. Conf. Inf. Commun. Technol. ICoICT 2020, 2020, doi: 10.1109/ICoICT49345.2020.9166244.

M. Kuyu, A. Erdem, and E. Erdem, “Altsözcük Ö ˘ geleri ile Türkçe Görüntü Altyazılama Image Captioning in Turkish with Subword Units,” 2018 26th Signal Process. Commun. Appl. Conf., pp. 1–4.

Author Biography

Yufis Azhar, Universitas Muhammadiyah Malang

Google Scholar Profil:

https://scholar.google.com/citations?user=B7GpEhIAAAAJ&hl=en

SINTA Profil:

http://sinta2.ristekdikti.go.id/authors/detail?id=160049&view=overview

Download this PDF file

PDF

Statistic

Read Counter : 37 Download : 28

Downloads

Download data is not yet available.