This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.
Image Captioning using Hybrid of VGG16 and Bidirectional LSTM Model
Corresponding Author(s) : Yufis Azhar
Kinetik: Game Technology, Information System, Computer Network, Computing, Electronics, and Control,
Vol. 7, No. 4, November 2022
Abstract
Image captioning is one of the biggest challenges in the fields of computer vision and natural language processing. Many other studies have raised the topic of image captioning. However, the evaluation results from other studies are still low. Thus, this study focuses on improving the evaluation results from previous studies. In this study, we used the Flickr8k dataset and the VGG16 Convolutional Neural Networks (CNN) model as an encoder to generate feature extraction from images. Recurrent Neural Network (RNN) uses the Bidirectional Long-Short Term Memory (BiLSTM) method as a decoder. The results of the image feature extraction process in the form of feature vectors are then forwarded to Bidirectional LSTM to produce descriptions that match the input image or visual content. The captions provide information on the object’s name, location, color, size, features of an object, and surroundings. A greedy Search algorithm with Argmax function and Beam-Search algorithm are used to calculate Bilingual Evaluation Understudy (BLEU) scores. The results of the evaluation of the best BLEU scores obtained from this study are the VGG16 model with Bidirectional LSTM using Beam Search with parameter K = 3 and the BLEU-1 score is 0.60593, so this score is superior to previous studies.
Keywords
Download Citation
Endnote/Zotero/Mendeley (RIS)BibTeX
- J. Zhang, K. Li, Z. Wang, X. Zhao, and Z. Wang, “Visual enhanced gLSTM for image captioning,” Expert Syst. Appl., vol. 184, no. June, p. 115462, 2021, doi: 10.1016/j.eswa.2021.115462.
- S. Herdade, A. Kappeler, K. Boakye, and J. Soares, “Image captioning: Transforming objects into words,” Adv. Neural Inf. Process. Syst., vol. 32, no. NeurIPS, pp. 1–11, 2019.
- Z. Deng, Z. Jiang, R. Lan, W. Huang, and X. Luo, “Image captioning using DenseNet network and adaptive attention,” Signal Process. Image Commun., vol. 85, p. 115836, 2020, doi: 10.1016/j.image.2020.115836.
- O. Sargar and S. Kinger, “Image captioning methods and metrics,” 2021 Int. Conf. Emerg. Smart Comput. Informatics, ESCI 2021, pp. 522–526, 2021, doi: 10.1109/ESCI50559.2021.9396839.
- I. Hrga and M. Ivašic-Kos, “Deep image captioning: An overview,” 2019 42nd Int. Conv. Inf. Commun. Technol. Electron. Microelectron. MIPRO 2019 - Proc., pp. 995–1000, 2019, doi: 10.23919/MIPRO.2019.8756821.
- A. Nursikuwagus, R. Munir, and M. L. Khodra, “Image Captioning menurut Scientific Revolution Kuhn dan Popper,” J. Manaj. Inform., vol. 10, no. 2, pp. 110–121, 2020, doi: 10.34010/jamika.v10i2.2630.
- Y. Pan, T. Yao, Y. Li, and T. Mei, “X-Linear Attention Networks for Image Captioning,” Proc. IEEE Comput. Soc. Conf. Comput. Vis. Pattern Recognit., pp. 10968–10977, 2020, doi: 10.1109/CVPR42600.2020.01098.
- M. A. Al-Malla, M. A. Al-Malla, A. Jafar, and N. Ghneim, “Pre-trained CNNs as Feature-Extraction Modules for Image Captioning,” ELCVIA Electron. Lett. Comput. Vis. Image Anal., vol. 21, no. 1, pp. 1–16, 2022, doi: 10.5565/rev/elcvia.1436.
- K. Chandhar, C. H. Sandeep, M. Akarapu, K. R. Chythanya, and V. Thirupathi, “Deep learning model for automatic image captioning,” Int. Conf. Res. Sci. Eng. Technol., vol. 2418, no. May, p. 020074, 2022, doi: 10.1063/5.0081847.
- A. Kumar, “Image Captioning and Image Retrieval,” vol. 4, no. 4, pp. 909–912, 2019.
- H. Hejazi and K. Shaalan, “Deep Learning for Arabic Image Captioning: A Comparative Study of Main Factors and Preprocessing Recommendations,” Int. J. Adv. Comput. Sci. Appl., vol. 12, no. 11, pp. 37–44, 2021, doi: 10.14569/IJACSA.2021.0121105.
- E. Mulyanto, E. I. Setiawan, E. M. Yuniarno, and M. H. Purnomo, “Automatic Indonesian Image Caption Generation using CNN-LSTM Model and FEEH-ID Dataset,” 2019 IEEE Int. Conf. Comput. Intell. Virtual Environ. Meas. Syst. Appl. CIVEMSA 2019 - Proc., 2019, doi: 10.1109/CIVEMSA45640.2019.9071632.
- A. A. Nugraha, A. Arifianto, and Suyanto, “Generating image description on Indonesian language using convolutional neural network and gated recurrent unit,” 2019 7th Int. Conf. Inf. Commun. Technol. ICoICT 2019, pp. 1–6, 2019, doi: 10.1109/ICoICT.2019.8835370.
- C. Wang, H. Yang, and C. Meinel, “Image Captioning with Deep Bidirectional LSTMs and Multi-Task Learning,” ACM Trans. Multimed. Comput. Commun. Appl., vol. 14, no. 2s, 2018, doi: 10.1145/3115432.
- M. Chohan, A. Khan, M. S. Mahar, S. Hassan, A. Ghafoor, and M. Khan, “Image captioning using deep learning: A systematic literature review,” Int. J. Adv. Comput. Sci. Appl., vol. 11, no. 5, pp. 278–286, 2020, doi: 10.14569/IJACSA.2020.0110537.
- Y. Azhar, M. C. Mustaqim, and A. E. Minarno, “Ensemble convolutional neural network for robust batik classification,” IOP Conf. Ser. Mater. Sci. Eng., vol. 1077, no. 1, p. 012053, 2021, doi: 10.1088/1757-899x/1077/1/012053.
- S. S. Rawat, K. S. Rawat, and R. Nijhawan, “A novel convolutional neural network-gated recurrent unit approach for image captioning,” Proc. 3rd Int. Conf. Smart Syst. Inven. Technol. ICSSIT 2020, no. Icssit, pp. 704–708, 2020, doi: 10.1109/ICSSIT48917.2020.9214109.
- S. Tammina, “Transfer learning using VGG-16 with Deep Convolutional Neural Network for Classifying Images,” Int. J. Sci. Res. Publ., vol. 9, no. 10, p. p9420, 2019, doi: 10.29322/ijsrp.9.10.2019.p9420.
- B. I. S. L. Nalbalwar, Advances in Intelligent Systems and Computing 810 Computing , Communication and Signal Processing, vol. 1. 2018.
- Y. Imrana, Y. Xiang, L. Ali, and Z. Abdul-Rauf, “A bidirectional LSTM deep learning approach for intrusion detection,” Expert Syst. Appl., vol. 185, no. June, p. 115524, 2021, doi: 10.1016/j.eswa.2021.115524.
- S. Welleck, I. Kulikov, S. Roller, E. Dinan, K. Cho, and J. Weston, “Neural Text Generation with Unlikelihood Training,” no. i, pp. 1–17, 2019, [Online]. Available: http://arxiv.org/abs/1908.04319.
- C. Meister, T. Vieira, and R. Cotterell, “Best-first beam search,” Trans. Assoc. Comput. Linguist., vol. 8, pp. 795–809, 2020, doi: 10.1162/tacl_a_00346.
- A. Deshpande, J. Aneja, L. Wang, A. G. Schwing, and D. Forsyth, “Fast, diverse and accurate image captioning guided by part-of-speech,” Proc. IEEE Comput. Soc. Conf. Comput. Vis. Pattern Recognit., vol. 2019-June, pp. 10687–10696, 2019, doi: 10.1109/CVPR.2019.01095.
- J. M. Czum, “Dive Into Deep Learning,” J. Am. Coll. Radiol., vol. 17, no. 5, pp. 637–638, 2020, doi: 10.1016/j.jacr.2020.02.005.
- D. H. Fudholi, A. Zahra, and R. A. N. Nayoan, “A Study on Visual Understanding Image Captioning using Different Word Embeddings and CNN-Based Feature Extractions,” Kinet. Game Technol. Inf. Syst. Comput. Network, Comput. Electron. Control, vol. 4, no. 1, pp. 91–98, 2022, doi: 10.22219/kinetik.v7i1.1394.
- M. R. S. Mahadi, A. Arifianto, and K. N. Ramadhani, “Adaptive Attention Generation for Indonesian Image Captioning,” 2020 8th Int. Conf. Inf. Commun. Technol. ICoICT 2020, 2020, doi: 10.1109/ICoICT49345.2020.9166244.
- M. Kuyu, A. Erdem, and E. Erdem, “Altsözcük Ö ˘ geleri ile Türkçe Görüntü Altyazılama Image Captioning in Turkish with Subword Units,” 2018 26th Signal Process. Commun. Appl. Conf., pp. 1–4.
References
J. Zhang, K. Li, Z. Wang, X. Zhao, and Z. Wang, “Visual enhanced gLSTM for image captioning,” Expert Syst. Appl., vol. 184, no. June, p. 115462, 2021, doi: 10.1016/j.eswa.2021.115462.
S. Herdade, A. Kappeler, K. Boakye, and J. Soares, “Image captioning: Transforming objects into words,” Adv. Neural Inf. Process. Syst., vol. 32, no. NeurIPS, pp. 1–11, 2019.
Z. Deng, Z. Jiang, R. Lan, W. Huang, and X. Luo, “Image captioning using DenseNet network and adaptive attention,” Signal Process. Image Commun., vol. 85, p. 115836, 2020, doi: 10.1016/j.image.2020.115836.
O. Sargar and S. Kinger, “Image captioning methods and metrics,” 2021 Int. Conf. Emerg. Smart Comput. Informatics, ESCI 2021, pp. 522–526, 2021, doi: 10.1109/ESCI50559.2021.9396839.
I. Hrga and M. Ivašic-Kos, “Deep image captioning: An overview,” 2019 42nd Int. Conv. Inf. Commun. Technol. Electron. Microelectron. MIPRO 2019 - Proc., pp. 995–1000, 2019, doi: 10.23919/MIPRO.2019.8756821.
A. Nursikuwagus, R. Munir, and M. L. Khodra, “Image Captioning menurut Scientific Revolution Kuhn dan Popper,” J. Manaj. Inform., vol. 10, no. 2, pp. 110–121, 2020, doi: 10.34010/jamika.v10i2.2630.
Y. Pan, T. Yao, Y. Li, and T. Mei, “X-Linear Attention Networks for Image Captioning,” Proc. IEEE Comput. Soc. Conf. Comput. Vis. Pattern Recognit., pp. 10968–10977, 2020, doi: 10.1109/CVPR42600.2020.01098.
M. A. Al-Malla, M. A. Al-Malla, A. Jafar, and N. Ghneim, “Pre-trained CNNs as Feature-Extraction Modules for Image Captioning,” ELCVIA Electron. Lett. Comput. Vis. Image Anal., vol. 21, no. 1, pp. 1–16, 2022, doi: 10.5565/rev/elcvia.1436.
K. Chandhar, C. H. Sandeep, M. Akarapu, K. R. Chythanya, and V. Thirupathi, “Deep learning model for automatic image captioning,” Int. Conf. Res. Sci. Eng. Technol., vol. 2418, no. May, p. 020074, 2022, doi: 10.1063/5.0081847.
A. Kumar, “Image Captioning and Image Retrieval,” vol. 4, no. 4, pp. 909–912, 2019.
H. Hejazi and K. Shaalan, “Deep Learning for Arabic Image Captioning: A Comparative Study of Main Factors and Preprocessing Recommendations,” Int. J. Adv. Comput. Sci. Appl., vol. 12, no. 11, pp. 37–44, 2021, doi: 10.14569/IJACSA.2021.0121105.
E. Mulyanto, E. I. Setiawan, E. M. Yuniarno, and M. H. Purnomo, “Automatic Indonesian Image Caption Generation using CNN-LSTM Model and FEEH-ID Dataset,” 2019 IEEE Int. Conf. Comput. Intell. Virtual Environ. Meas. Syst. Appl. CIVEMSA 2019 - Proc., 2019, doi: 10.1109/CIVEMSA45640.2019.9071632.
A. A. Nugraha, A. Arifianto, and Suyanto, “Generating image description on Indonesian language using convolutional neural network and gated recurrent unit,” 2019 7th Int. Conf. Inf. Commun. Technol. ICoICT 2019, pp. 1–6, 2019, doi: 10.1109/ICoICT.2019.8835370.
C. Wang, H. Yang, and C. Meinel, “Image Captioning with Deep Bidirectional LSTMs and Multi-Task Learning,” ACM Trans. Multimed. Comput. Commun. Appl., vol. 14, no. 2s, 2018, doi: 10.1145/3115432.
M. Chohan, A. Khan, M. S. Mahar, S. Hassan, A. Ghafoor, and M. Khan, “Image captioning using deep learning: A systematic literature review,” Int. J. Adv. Comput. Sci. Appl., vol. 11, no. 5, pp. 278–286, 2020, doi: 10.14569/IJACSA.2020.0110537.
Y. Azhar, M. C. Mustaqim, and A. E. Minarno, “Ensemble convolutional neural network for robust batik classification,” IOP Conf. Ser. Mater. Sci. Eng., vol. 1077, no. 1, p. 012053, 2021, doi: 10.1088/1757-899x/1077/1/012053.
S. S. Rawat, K. S. Rawat, and R. Nijhawan, “A novel convolutional neural network-gated recurrent unit approach for image captioning,” Proc. 3rd Int. Conf. Smart Syst. Inven. Technol. ICSSIT 2020, no. Icssit, pp. 704–708, 2020, doi: 10.1109/ICSSIT48917.2020.9214109.
S. Tammina, “Transfer learning using VGG-16 with Deep Convolutional Neural Network for Classifying Images,” Int. J. Sci. Res. Publ., vol. 9, no. 10, p. p9420, 2019, doi: 10.29322/ijsrp.9.10.2019.p9420.
B. I. S. L. Nalbalwar, Advances in Intelligent Systems and Computing 810 Computing , Communication and Signal Processing, vol. 1. 2018.
Y. Imrana, Y. Xiang, L. Ali, and Z. Abdul-Rauf, “A bidirectional LSTM deep learning approach for intrusion detection,” Expert Syst. Appl., vol. 185, no. June, p. 115524, 2021, doi: 10.1016/j.eswa.2021.115524.
S. Welleck, I. Kulikov, S. Roller, E. Dinan, K. Cho, and J. Weston, “Neural Text Generation with Unlikelihood Training,” no. i, pp. 1–17, 2019, [Online]. Available: http://arxiv.org/abs/1908.04319.
C. Meister, T. Vieira, and R. Cotterell, “Best-first beam search,” Trans. Assoc. Comput. Linguist., vol. 8, pp. 795–809, 2020, doi: 10.1162/tacl_a_00346.
A. Deshpande, J. Aneja, L. Wang, A. G. Schwing, and D. Forsyth, “Fast, diverse and accurate image captioning guided by part-of-speech,” Proc. IEEE Comput. Soc. Conf. Comput. Vis. Pattern Recognit., vol. 2019-June, pp. 10687–10696, 2019, doi: 10.1109/CVPR.2019.01095.
J. M. Czum, “Dive Into Deep Learning,” J. Am. Coll. Radiol., vol. 17, no. 5, pp. 637–638, 2020, doi: 10.1016/j.jacr.2020.02.005.
D. H. Fudholi, A. Zahra, and R. A. N. Nayoan, “A Study on Visual Understanding Image Captioning using Different Word Embeddings and CNN-Based Feature Extractions,” Kinet. Game Technol. Inf. Syst. Comput. Network, Comput. Electron. Control, vol. 4, no. 1, pp. 91–98, 2022, doi: 10.22219/kinetik.v7i1.1394.
M. R. S. Mahadi, A. Arifianto, and K. N. Ramadhani, “Adaptive Attention Generation for Indonesian Image Captioning,” 2020 8th Int. Conf. Inf. Commun. Technol. ICoICT 2020, 2020, doi: 10.1109/ICoICT49345.2020.9166244.
M. Kuyu, A. Erdem, and E. Erdem, “Altsözcük Ö ˘ geleri ile Türkçe Görüntü Altyazılama Image Captioning in Turkish with Subword Units,” 2018 26th Signal Process. Commun. Appl. Conf., pp. 1–4.