Indonesian Dataset Expansion of Microsoft Research Video Description Corpus and Its Similarity Analysis
Corresponding Author(s) : Faisal Rahutomo
Kinetik: Game Technology, Information System, Computer Network, Computing, Electronics, and Control,
Vol 3, No 4, November 2018
Abstract
This paper describes the academic base of an openly Indonesian dataset in Mendeley Data with DOI: 10.17632/d7vx5cc92y.1 [1]. The dataset is an Indonesian language expansion of Microsoft research video description corpus, an open dataset contains about 120 thousand sentences. The dataset is a useful resource because the sentences are a set of roughly parallel descriptions of more than 2,000 video snippets of 35 languages. Both paraphrase and bilingual relation are available but Indonesian description is not available in the dataset. Therefore, this paper describes the research effort to expand the dataset for the Indonesian language. The research collected 43,753 description texts of 1,959 short videos, parallel with Microsoft’s dataset. Adding more value to the dataset, similarity metrics calculations of the texts were done. The metrics were Cosine, Jaccard, euclidian, and Manhattan with average results were 0.22, 0.33, 2.38, and 6.08 respectively.
Keywords
Download Citation
Endnote/Zotero/Mendeley (RIS)BibTeX
- F. Rahutomo and A. H. Ayatullah, “Indonesian Dataset Expansion of Microsoft Research Video Description Corpus and Its Similarity Analysis,” 2018. [Online]. Available: https://data.mendeley.com/datasets/d7vx5cc92y/1.
- M. D. Harris, Introduction to Natural Language Processing. Reston, VA, USA: Reston Publishing Co., 1985.
- A. Kao and S. R. Poteet, Natural Language Processing and Text Mining. Springer Publishing Company, Incorporated, 2006.
- C. Goutte, N. Cancedda, M. Dymetman, and G. Foster, Learning Machine Translation. The MIT Press, 2009.
- R. Baeza-Yates and B. Ribeiro-Neto, Modern Information Retrieval: The Concepts and Technology Behind Search, 2nd ed. USA: Addison-Wesley Publishing Company, 2008.
- S. Russell and P. Norvig, Artificial Intelligence: A Modern Approach, 3rd ed. Upper Saddle River, NJ, USA: Prentice Hall Press, 2009.
- D. L. Chen and W. B. Dolan, “Collecting Highly Parallel Data for Paraphrase Evaluation,” in Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies - Volume 1, 2011, pp. 190–200.
- F. Rahutomo, Y. Manabe, T. Kitasuka, and M. Aritsugi, “Econo-ESA reduction scheme and the impact of its index matrix density,” in Procedia Computer Science, 2014, vol. 35, no. C.
- F. Rahutomo and M. Aritsugi, “Econo-ESA in semantic text similarity,” Springerplus, vol. 3, no. 1, 2014.
- F. Rahutomo and E. Rohadi, “Pengembangan Piranti Penelitian Sistem Temu Kembali Informasi Bahasa Indonesia,” in Seminar Nasional Sistem Informasi Indonesia (SESINDO), 2015, pp. 313–319.
- D. L. Chen and W. B. Dolan, “Youtube clips,” 2011. [Online]. Available: http://www.cs.utexas.edu/users/ml/clamp/videoDescription/YouTubeClips.tar.
- N. Riza Akbar, R. Faisal, and H. Budi, “Pengembangan Data Uji Sistem Komputasi Kemiripan Teks Secara Semantik Berbahasa Indonesia,” in Seminar Informatika Aplikatif Polinema, 2016.
References
F. Rahutomo and A. H. Ayatullah, “Indonesian Dataset Expansion of Microsoft Research Video Description Corpus and Its Similarity Analysis,” 2018. [Online]. Available: https://data.mendeley.com/datasets/d7vx5cc92y/1.
M. D. Harris, Introduction to Natural Language Processing. Reston, VA, USA: Reston Publishing Co., 1985.
A. Kao and S. R. Poteet, Natural Language Processing and Text Mining. Springer Publishing Company, Incorporated, 2006.
C. Goutte, N. Cancedda, M. Dymetman, and G. Foster, Learning Machine Translation. The MIT Press, 2009.
R. Baeza-Yates and B. Ribeiro-Neto, Modern Information Retrieval: The Concepts and Technology Behind Search, 2nd ed. USA: Addison-Wesley Publishing Company, 2008.
S. Russell and P. Norvig, Artificial Intelligence: A Modern Approach, 3rd ed. Upper Saddle River, NJ, USA: Prentice Hall Press, 2009.
D. L. Chen and W. B. Dolan, “Collecting Highly Parallel Data for Paraphrase Evaluation,” in Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies - Volume 1, 2011, pp. 190–200.
F. Rahutomo, Y. Manabe, T. Kitasuka, and M. Aritsugi, “Econo-ESA reduction scheme and the impact of its index matrix density,” in Procedia Computer Science, 2014, vol. 35, no. C.
F. Rahutomo and M. Aritsugi, “Econo-ESA in semantic text similarity,” Springerplus, vol. 3, no. 1, 2014.
F. Rahutomo and E. Rohadi, “Pengembangan Piranti Penelitian Sistem Temu Kembali Informasi Bahasa Indonesia,” in Seminar Nasional Sistem Informasi Indonesia (SESINDO), 2015, pp. 313–319.
D. L. Chen and W. B. Dolan, “Youtube clips,” 2011. [Online]. Available: http://www.cs.utexas.edu/users/ml/clamp/videoDescription/YouTubeClips.tar.
N. Riza Akbar, R. Faisal, and H. Budi, “Pengembangan Data Uji Sistem Komputasi Kemiripan Teks Secara Semantik Berbahasa Indonesia,” in Seminar Informatika Aplikatif Polinema, 2016.