This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.
A Study on Visual Understanding Image Captioning using Different Word Embeddings and CNN-Based Feature Extractions
Corresponding Author(s) : Dhomas Hatta Fudholi
Kinetik: Game Technology, Information System, Computer Network, Computing, Electronics, and Control,
Vol. 7, No. 1, February 2022
Abstract
Image captioning is a task that can provide a description of an image in natural language. Image captioning can be used for a variety of applications, such as image indexing and virtual assistants. In this research, we compared the performance of three different word embeddings, namely, GloVe, Word2Vec, FastText and six CNN-based feature extraction architectures such as, Inception V3, InceptionResNet V2, ResNet152 V2, EfficientNet B3 V1, EfficientNet B7 V1, and NASNetLarge which then will be combined with LSTM as the decoder to perform image captioning. We used ten different household objects (bed, cell phone, chair, couch, oven, potted plant, refrigerator, sink, table, and tv) that were obtained from MSCOCO dataset to develop the model. Then, we created five new captions in Bahasa Indonesia for the selected images. The captions might contain details about the name, the location, the color, the size, and the characteristics of an object and its surrounding area. In our 18 experimental models, we used different combination of the word embedding and CNN-based feature extraction architecture, along with LSTM to train the model. As the result, models that used the combination of Word2Vec + NASNetLarge performed better in generating Indonesian captions than the other models based on BLEU-4 metric.
Keywords
Download Citation
Endnote/Zotero/Mendeley (RIS)BibTeX
- S. Herdade, A. Kappeler, K. Boakye, and J. Soares, “Image captioning: Transforming objects into words,” Adv. Neural Inf. Process. Syst., vol. 32, no. NeurIPS, pp. 1–11, 2019.
- T. Yao, Y. Pan, Y. Li, Z. Qiu, and T. Mei, “Boosting Image Captioning with Attributes,” Proc. IEEE Int. Conf. Comput. Vis., vol. 2017-Octob, pp. 4904–4912, 2017, doi: 10.1109/ICCV.2017.524.
- L. Srinivasan and D. Sreekanthan, “Image Captioning-A Deep Learning Approach,” Int. J. Appl. Eng. Res., vol. 13, no. 9, pp. 7239–7242, 2018.
- U. Bhoga, V. Aravind, G. Sreeja, and M. Arif, “Image Caption Generation Using CNN and LSTM,” JAC A J. Compos. Theory, vol. XIV, no. VII, pp. 257–263, 2021.
- E. Cahyaningtyas and D. Arifianto, “Development of under-resourced Bahasa Indonesia speech corpus,” in 2017 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC), Dec. 2017, vol. 2018-Febru, no. December, pp. 1097–1101. doi: 10.1109/APSIPA.2017.8282191.
- A. A. Nugraha, A. Arifianto, and Suyanto, “Generating image description on Indonesian language using convolutional neural network and gated recurrent unit,” 2019 7th Int. Conf. Inf. Commun. Technol. ICoICT 2019, pp. 1–6, 2019, doi: 10.1109/ICoICT.2019.8835370.
- P. Shah, V. Bakrola, and S. Pati, “Image captioning using deep neural architectures,” in 2017 International Conference on Innovations in Information, Embedded and Communication Systems (ICIIECS), Mar. 2017, pp. 1–4. doi: 10.1109/ICIIECS.2017.8276124.
- N. S. B, L. White, and M. Bennamoun, NNEval : Neural Network Based, vol. 1. Springer International Publishing. doi: 10.1007/978-3-030-01237-3.
- Y. Chu, X. Yue, L. Yu, M. Sergei, and Z. Wang, “Automatic Image Captioning Based on ResNet50 and LSTM with Soft Attention,” Wirel. Commun. Mob. Comput., vol. 2020, pp. 1–7, Oct. 2020, doi: 10.1155/2020/8909458.
- T.-Y. Lin et al., “Microsoft COCO: Common Objects in Context,” 2014, pp. 740–755. doi: 10.1007/978-3-319-10602-1_48.
- C. Sur, “MRRC: multiple role representation crossover interpretation for image captioning with R-CNN feature distribution composition (FDC),” Multimed. Tools Appl., vol. 80, no. 12, pp. 18413–18443, May 2021, doi: 10.1007/s11042-021-10578-9.
- M. Z. Hossain, F. Sohel, M. F. Shiratuddin, H. Laga, and M. Bennamoun, “Text to Image Synthesis for Improved Image Captioning,” IEEE Access, vol. 9, pp. 64918–64928, 2021, doi: 10.1109/ACCESS.2021.3075579.
- K. Arora, A. Raj, A. Goel, and S. Susan, “A Hybrid Model for Combining Neural Image Caption and k-Nearest Neighbor Approach for Image Captioning,” 2022, pp. 51–59. doi: 10.1007/978-981-16-1249-7_6.
- H. Rampal and A. Mohanty, “Efficient CNN-LSTM based Image Captioning using Neural Network Compression,” Dec. 2020.
- S. Katiyar and S. K. Borgohain, “Image Captioning using Deep Stacked LSTMs, Contextual Word Embeddings and Data Augmentation,” Feb. 2021.
- H. Wei, Z. Li, F. Huang, C. Zhang, H. Ma, and Z. Shi, “Integrating Scene Semantic Knowledge into Image Captioning,” ACM Trans. Multimed. Comput. Commun. Appl., vol. 17, no. 2, pp. 1–22, May 2021, doi: 10.1145/3439734.
- M. R. S. Mahadi, A. Arifianto, and K. N. Ramadhani, “Adaptive Attention Generation for Indonesian Image Captioning,” 2020 8th Int. Conf. Inf. Commun. Technol. ICoICT 2020, 2020, doi: 10.1109/ICoICT49345.2020.9166244.
- Z. Ren, X. Wang, N. Zhang, X. Lv, and L.-J. Li, “Deep Reinforcement Learning-Based Image Captioning with Embedding Reward,” in 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Jul. 2017, pp. 1151–1159. doi: 10.1109/CVPR.2017.128.
- V. Atliha and D. Sesok, “Comparison of VGG and ResNet used as Encoders for Image Captioning,” in 2020 IEEE Open Conference of Electrical, Electronic and Information Sciences (eStream), Apr. 2020, pp. 1–4. doi: 10.1109/eStream50540.2020.9108880.
- C. Wang, H. Yang, C. Bartz, and C. Meinel, “Image Captioning with Deep Bidirectional LSTMs,” in Proceedings of the 24th ACM international conference on Multimedia, Oct. 2016, pp. 988–997. doi: 10.1145/2964284.2964299.
- Y. Bhatia, A. Bajpayee, D. Raghuvanshi, and H. Mittal, “Image Captioning using Google’s Inception-resnet-v2 and Recurrent Neural Network,” in 2019 Twelfth International Conference on Contemporary Computing (IC3), Aug. 2019, pp. 1–6. doi: 10.1109/IC3.2019.8844921.
- M. Humaira, S. Paul, M. Abidur, A. Saha, and F. Muhammad, “A Hybridized Deep Learning Method for Bengali Image Captioning,” Int. J. Adv. Comput. Sci. Appl., vol. 12, no. 2, 2021, doi: 10.14569/IJACSA.2021.0120287.
- S. Das, L. Jain, and A. Das, “Deep Learning for Military Image Captioning,” 21st Int. Conf. Inf. Fusion, Cambridge, United Kingdom, pp. 2165–2171, 2018, doi: 10.23919/ICIF.2018.8455321.
- V. Atliha and D. Sesok, “Pretrained Word Embeddings for Image Captioning,” in 2021 IEEE Open Conference of Electrical, Electronic and Information Sciences (eStream), Apr. 2021, pp. 1–4. doi: 10.1109/eStream53087.2021.9431465.
- S. Yagcioglu, E. Erdem, A. Erdem, and R. Cakici, “A Distributed Representation Based Query Expansion Approach for Image Captioning,” in Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 2: Short Papers), 2015, pp. 106–111. doi: 10.3115/v1/P15-2018.
- D. H. Fudholi et al., “Image Captioning with Attention for Smart Local Tourism using EfficientNet,” IOP Conf. Ser. Mater. Sci. Eng., vol. 1077, no. 1, p. 012038, Feb. 2021, doi: 10.1088/1757-899X/1077/1/012038.
- C. Szegedy et al., “Going Deeper with Convolutions,” pp. 1–12, Sep. 2014.
- O. Albatayneh, L. Forslöf, K. Ksaibati, and D. Ph, “Image Retraining Using TensorFlow Implementation of the Pretrained Inception-v3 Model for Evaluating Gravel Road Dust,” vol. 26, no. 2, pp. 1–10, 2020, doi: 10.1061/(ASCE)IS.1943-555X.0000545.
- C. Szegedy, S. Ioffe, V. Vanhoucke, and A. Alemi, “Inception-v4, Inception-ResNet and the Impact of Residual Connections on Learning,” Feb. 2016.
- M. Tan and Q. V. Le, “EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks,” May 2019.
- L. D. Nguyen, D. Lin, Z. Lin, and J. Cao, “Deep CNNs for microscopic image classification by exploiting transfer learning and feature concatenation,” in 2018 IEEE International Symposium on Circuits and Systems (ISCAS), 2018, no. May, pp. 1–5. doi: 10.1109/ISCAS.2018.8351550.
- B. Zoph, V. Vasudevan, J. Shlens, and Q. V. Le, “Learning Transferable Architectures for Scalable Image Recognition,” Jul. 2017.
- S. Takkar, A. Jain, and P. Adlakha, “Comparative Study of Different Image Captioning Models,” in 2021 5th International Conference on Computing Methodologies and Communication (ICCMC), Apr. 2021, no. Iccmc, pp. 1366–1371. doi: 10.1109/ICCMC51019.2021.9418451.
- E. Reiter, “A Structured Review of the Validity of BLEU,” Comput. Linguist., vol. 44, no. 3, pp. 393–401, Sep. 2018, doi: 10.1162/coli_a_00322.
- M. D. Zakir Hossain, F. Sohel, M. F. Shiratuddin, and H. Laga, “A comprehensive survey of deep learning for image captioning,” ACM Comput. Surv., vol. 51, no. 6, 2019, doi: 10.1145/3295748.
References
S. Herdade, A. Kappeler, K. Boakye, and J. Soares, “Image captioning: Transforming objects into words,” Adv. Neural Inf. Process. Syst., vol. 32, no. NeurIPS, pp. 1–11, 2019.
T. Yao, Y. Pan, Y. Li, Z. Qiu, and T. Mei, “Boosting Image Captioning with Attributes,” Proc. IEEE Int. Conf. Comput. Vis., vol. 2017-Octob, pp. 4904–4912, 2017, doi: 10.1109/ICCV.2017.524.
L. Srinivasan and D. Sreekanthan, “Image Captioning-A Deep Learning Approach,” Int. J. Appl. Eng. Res., vol. 13, no. 9, pp. 7239–7242, 2018.
U. Bhoga, V. Aravind, G. Sreeja, and M. Arif, “Image Caption Generation Using CNN and LSTM,” JAC A J. Compos. Theory, vol. XIV, no. VII, pp. 257–263, 2021.
E. Cahyaningtyas and D. Arifianto, “Development of under-resourced Bahasa Indonesia speech corpus,” in 2017 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC), Dec. 2017, vol. 2018-Febru, no. December, pp. 1097–1101. doi: 10.1109/APSIPA.2017.8282191.
A. A. Nugraha, A. Arifianto, and Suyanto, “Generating image description on Indonesian language using convolutional neural network and gated recurrent unit,” 2019 7th Int. Conf. Inf. Commun. Technol. ICoICT 2019, pp. 1–6, 2019, doi: 10.1109/ICoICT.2019.8835370.
P. Shah, V. Bakrola, and S. Pati, “Image captioning using deep neural architectures,” in 2017 International Conference on Innovations in Information, Embedded and Communication Systems (ICIIECS), Mar. 2017, pp. 1–4. doi: 10.1109/ICIIECS.2017.8276124.
N. S. B, L. White, and M. Bennamoun, NNEval : Neural Network Based, vol. 1. Springer International Publishing. doi: 10.1007/978-3-030-01237-3.
Y. Chu, X. Yue, L. Yu, M. Sergei, and Z. Wang, “Automatic Image Captioning Based on ResNet50 and LSTM with Soft Attention,” Wirel. Commun. Mob. Comput., vol. 2020, pp. 1–7, Oct. 2020, doi: 10.1155/2020/8909458.
T.-Y. Lin et al., “Microsoft COCO: Common Objects in Context,” 2014, pp. 740–755. doi: 10.1007/978-3-319-10602-1_48.
C. Sur, “MRRC: multiple role representation crossover interpretation for image captioning with R-CNN feature distribution composition (FDC),” Multimed. Tools Appl., vol. 80, no. 12, pp. 18413–18443, May 2021, doi: 10.1007/s11042-021-10578-9.
M. Z. Hossain, F. Sohel, M. F. Shiratuddin, H. Laga, and M. Bennamoun, “Text to Image Synthesis for Improved Image Captioning,” IEEE Access, vol. 9, pp. 64918–64928, 2021, doi: 10.1109/ACCESS.2021.3075579.
K. Arora, A. Raj, A. Goel, and S. Susan, “A Hybrid Model for Combining Neural Image Caption and k-Nearest Neighbor Approach for Image Captioning,” 2022, pp. 51–59. doi: 10.1007/978-981-16-1249-7_6.
H. Rampal and A. Mohanty, “Efficient CNN-LSTM based Image Captioning using Neural Network Compression,” Dec. 2020.
S. Katiyar and S. K. Borgohain, “Image Captioning using Deep Stacked LSTMs, Contextual Word Embeddings and Data Augmentation,” Feb. 2021.
H. Wei, Z. Li, F. Huang, C. Zhang, H. Ma, and Z. Shi, “Integrating Scene Semantic Knowledge into Image Captioning,” ACM Trans. Multimed. Comput. Commun. Appl., vol. 17, no. 2, pp. 1–22, May 2021, doi: 10.1145/3439734.
M. R. S. Mahadi, A. Arifianto, and K. N. Ramadhani, “Adaptive Attention Generation for Indonesian Image Captioning,” 2020 8th Int. Conf. Inf. Commun. Technol. ICoICT 2020, 2020, doi: 10.1109/ICoICT49345.2020.9166244.
Z. Ren, X. Wang, N. Zhang, X. Lv, and L.-J. Li, “Deep Reinforcement Learning-Based Image Captioning with Embedding Reward,” in 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Jul. 2017, pp. 1151–1159. doi: 10.1109/CVPR.2017.128.
V. Atliha and D. Sesok, “Comparison of VGG and ResNet used as Encoders for Image Captioning,” in 2020 IEEE Open Conference of Electrical, Electronic and Information Sciences (eStream), Apr. 2020, pp. 1–4. doi: 10.1109/eStream50540.2020.9108880.
C. Wang, H. Yang, C. Bartz, and C. Meinel, “Image Captioning with Deep Bidirectional LSTMs,” in Proceedings of the 24th ACM international conference on Multimedia, Oct. 2016, pp. 988–997. doi: 10.1145/2964284.2964299.
Y. Bhatia, A. Bajpayee, D. Raghuvanshi, and H. Mittal, “Image Captioning using Google’s Inception-resnet-v2 and Recurrent Neural Network,” in 2019 Twelfth International Conference on Contemporary Computing (IC3), Aug. 2019, pp. 1–6. doi: 10.1109/IC3.2019.8844921.
M. Humaira, S. Paul, M. Abidur, A. Saha, and F. Muhammad, “A Hybridized Deep Learning Method for Bengali Image Captioning,” Int. J. Adv. Comput. Sci. Appl., vol. 12, no. 2, 2021, doi: 10.14569/IJACSA.2021.0120287.
S. Das, L. Jain, and A. Das, “Deep Learning for Military Image Captioning,” 21st Int. Conf. Inf. Fusion, Cambridge, United Kingdom, pp. 2165–2171, 2018, doi: 10.23919/ICIF.2018.8455321.
V. Atliha and D. Sesok, “Pretrained Word Embeddings for Image Captioning,” in 2021 IEEE Open Conference of Electrical, Electronic and Information Sciences (eStream), Apr. 2021, pp. 1–4. doi: 10.1109/eStream53087.2021.9431465.
S. Yagcioglu, E. Erdem, A. Erdem, and R. Cakici, “A Distributed Representation Based Query Expansion Approach for Image Captioning,” in Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 2: Short Papers), 2015, pp. 106–111. doi: 10.3115/v1/P15-2018.
D. H. Fudholi et al., “Image Captioning with Attention for Smart Local Tourism using EfficientNet,” IOP Conf. Ser. Mater. Sci. Eng., vol. 1077, no. 1, p. 012038, Feb. 2021, doi: 10.1088/1757-899X/1077/1/012038.
C. Szegedy et al., “Going Deeper with Convolutions,” pp. 1–12, Sep. 2014.
O. Albatayneh, L. Forslöf, K. Ksaibati, and D. Ph, “Image Retraining Using TensorFlow Implementation of the Pretrained Inception-v3 Model for Evaluating Gravel Road Dust,” vol. 26, no. 2, pp. 1–10, 2020, doi: 10.1061/(ASCE)IS.1943-555X.0000545.
C. Szegedy, S. Ioffe, V. Vanhoucke, and A. Alemi, “Inception-v4, Inception-ResNet and the Impact of Residual Connections on Learning,” Feb. 2016.
M. Tan and Q. V. Le, “EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks,” May 2019.
L. D. Nguyen, D. Lin, Z. Lin, and J. Cao, “Deep CNNs for microscopic image classification by exploiting transfer learning and feature concatenation,” in 2018 IEEE International Symposium on Circuits and Systems (ISCAS), 2018, no. May, pp. 1–5. doi: 10.1109/ISCAS.2018.8351550.
B. Zoph, V. Vasudevan, J. Shlens, and Q. V. Le, “Learning Transferable Architectures for Scalable Image Recognition,” Jul. 2017.
S. Takkar, A. Jain, and P. Adlakha, “Comparative Study of Different Image Captioning Models,” in 2021 5th International Conference on Computing Methodologies and Communication (ICCMC), Apr. 2021, no. Iccmc, pp. 1366–1371. doi: 10.1109/ICCMC51019.2021.9418451.
E. Reiter, “A Structured Review of the Validity of BLEU,” Comput. Linguist., vol. 44, no. 3, pp. 393–401, Sep. 2018, doi: 10.1162/coli_a_00322.
M. D. Zakir Hossain, F. Sohel, M. F. Shiratuddin, and H. Laga, “A comprehensive survey of deep learning for image captioning,” ACM Comput. Surv., vol. 51, no. 6, 2019, doi: 10.1145/3295748.