Document Preprocessing with TF-IDF to Improve the Polarity Classification Performance of Unstructured Sentiment Analysis

Farrikh Alzami; Erika Devi Udayanti; Dwi Puji Prabowo; Rama Aria Megantara

doi:10.22219/kinetik.v5i3.1066

Issue

Vol. 5, No. 3, August 2020

Issue Published : Aug 31, 2020

Document Preprocessing with TF-IDF to Improve the Polarity Classification Performance of Unstructured Sentiment Analysis

https://doi.org/10.22219/kinetik.v5i3.1066

Farrikh Alzami

Universitas Dian Nuswantoro, Semarang

https://orcid.org/0000-0003-2669-3864

Erika Devi Udayanti

Universitas Dian Nuswantoro, Semarang

Dwi Puji Prabowo

Universitas Dian Nuswantoro, Semarang

Rama Aria Megantara

Universitas Dian Nuswantoro, Semarang

Corresponding Author(s) : Farrikh Alzami

alzami@dsn.dinus.ac.id

Kinetik: Game Technology, Information System, Computer Network, Computing, Electronics, and Control, Vol. 5, No. 3, August 2020
Article Published : Aug 27, 2020

Abstract

Sentiment analysis in terms of polarity classification is very important in everyday life, with the existence of polarity, many people can find out whether the respected document has positive or negative sentiment so that it can help in choosing and making decisions. Sentiment analysis usually done manually. Therefore, an automatic sentiment analysis classification process is needed. However, it is rare to find studies that discuss extraction features and which learning models are suitable for unstructured sentiment analysis types with the Amazon food review case. This research explores some extraction features such as Word Bags, TF-IDF, Word2Vector, as well as a combination of TF-IDF and Word2Vector with several machine learning models such as Random Forest, SVM, KNN and Naïve Bayes to find out a combination of feature extraction and learning models that can help add variety to the analysis of polarity sentiments. By assisting with document preparation such as html tags and punctuation and special characters, using snowball stemming, TF-IDF results obtained with SVM are suitable for obtaining a polarity classification in unstructured sentiment analysis for the case of Amazon food review with a performance result of 87,3 percent.

Keywords

Unstructured Sentiment Analysis polarity TF-IDF classification

Alzami, F., Udayanti, E. D., Prabowo, D. P., & Megantara, R. A. (2020). Document Preprocessing with TF-IDF to Improve the Polarity Classification Performance of Unstructured Sentiment Analysis. Kinetik: Game Technology, Information System, Computer Network, Computing, Electronics, and Control, 5(3), 235-242. https://doi.org/10.22219/kinetik.v5i3.1066

Download Citation

References

Agarwal, B., Mittal, N., Bansal, P., & Garg, S. (2015). Sentiment Analysis Using Common-Sense and Context Information. Computational Intelligence and Neuroscience, 2015, 1–9. https://doi.org/10.1155/2015/715730
Cambria, E., Hussain, A., Durrani, T., Havasi, C., Eckl, C., & Munro, J. (2010). Sentic Computing for patient centered applications. IEEE 10th International Conference On Signal Processing Proceedings, 1279–1282. https://doi.org/10.1109/ICOSP.2010.5657072
Ebrahimi, M., Yazdavar, A. H., & Sheth, A. (2017). Challenges of Sentiment Analysis for Dynamic Events. IEEE Intelligent Systems, 32(5), 70–75. https://doi.org/10.1109/MIS.2017.3711649
Xing, F. Z., Cambria, E., & Welsch, R. E. (2018). Natural language based financial forecasting: a survey. Artificial Intelligence Review, 50(1), 49–73. https://doi.org/10.1007/s10462-017-9588-9
Van de Kauter, M., Breesch, D., & Hoste, V. (2015). Fine-grained analysis of explicit and implicit sentiment in financial news articles. Expert Systems with Applications, 42(11), 4999–5010. https://doi.org/10.1016/j.eswa.2015.02.007
Valdivia, A., Luzon, M. V., & Herrera, F. (2017). Sentiment Analysis in TripAdvisor. IEEE Intelligent Systems, 32(4), 72–77. https://doi.org/10.1109/MIS.2017.3121555
Vázquez, S., Muñoz-García, Ó., Campanella, I., Poch, M., Fisas, B., Bel, N., & Andreu, G. (2014). A classification of user-generated content into consumer decision journey stages. Neural Networks, 58, 68–81. https://doi.org/10.1016/j.neunet.2014.05.026
Thompson, J. J., Leung, B. H., Blair, M. R., & Taboada, M. (2017). Sentiment analysis of player chat messaging in the video game StarCraft 2: Extending a lexicon-based model. Knowledge-Based Systems, 137, 149–162. https://doi.org/10.1016/j.knosys.2017.09.022
Wang, K., Liu, X., & Han, Y. (2019). Exploring Goodreads reviews for book impact assessment. Journal of Informetrics, 13(3), 874–886. https://doi.org/10.1016/j.joi.2019.07.003
Bello-Orgaz, G., Jung, J. J., & Camacho, D. (2016). Social big data: Recent achievements and new challenges. Information Fusion, 28, 45–59. https://doi.org/10.1016/j.inffus.2015.08.005
Elghannam, F. (2019). Text representation and classification based on bi-gram alphabet. Journal of King Saud University - Computer and Information Sciences. https://doi.org/10.1016/j.jksuci.2019.01.005
Chalothorn, T., & Ellman, J. (2015). Simple approaches of sentiment analysis via ensemble learning. In Lecture Notes in Electrical Engineering (Vol. 339, pp. 631–639). https://doi.org/10.1007/978-3-662-46578-3_74
Yang, L., Li, Y., Wang, J., & Sherratt, R. S. (2020). Sentiment Analysis for E-Commerce Product Reviews in Chinese Based on Sentiment Lexicon and Deep Learning. IEEE Access, 8, 23522–23530. https://doi.org/10.1109/ACCESS.2020.2969854
Zeng, D., Dai, Y., Li, F., Wang, J., & Sangaiah, A. K. (2019). Aspect based sentiment analysis by a linguistically regularized CNN with gated mechanism. Journal of Intelligent & Fuzzy Systems, 36(5), 3971–3980. https://doi.org/10.3233/JIFS-169958
Khan, K., Baharudin, B., Khan, A., & Ullah, A. (2014). Mining opinion components from unstructured reviews: A review. Journal of King Saud University - Computer and Information Sciences, 26(3), 258–275. https://doi.org/10.1016/j.jksuci.2014.03.009
Hussein, D. M. E.-D. M. (2018). A survey on sentiment analysis challenges. Journal of King Saud University - Engineering Sciences, 30(4), 330–338. https://doi.org/10.1016/j.jksues.2016.04.002
Moraes, R., Valiati, J. F., & Gavião Neto, W. P. (2013). Document-level sentiment classification: An empirical comparison between SVM and ANN. Expert Systems with Applications, 40(2), 621–633. https://doi.org/10.1016/j.eswa.2012.07.059
McAuley, J., & Leskovec, J. (2013). From amateurs to connoisseurs: Modeling the evolution of user expertise through online reviews. WWW 2013 - Proceedings of the 22nd International Conference on World Wide Web, 897–907. Retrieved from https://arxiv.org/abs/1303.4402
Willett, P. (2006). The Porter stemming algorithm: then and now. Program, 40(3), 219–223. https://doi.org/10.1108/00330330610681295
Xie, F., Wu, X., & Zhu, X. (2017). Efficient sequential pattern mining with wildcards for keyphrase extraction. Knowledge-Based Systems, 115, 27–39. https://doi.org/10.1016/j.knosys.2016.10.011
Gencosman, B. C., Ozmutlu, H. C., & Ozmutlu, S. (2014). Character n-gram application for automatic new topic identification. Information Processing & Management, 50(6), 821–856. https://doi.org/10.1016/j.ipm.2014.06.005
Schmidt, C. W. (2019). Improving a tf-idf weighted document vector embedding. Retrieved from http://arxiv.org/abs/1902.09875
Ren, J. (2012). ANN vs. SVM: Which one performs better in classification of MCCs in mammogram imaging. Knowledge-Based Systems, 26, 144–153. https://doi.org/10.1016/j.knosys.2011.07.016
Appel, O., Chiclana, F., Carter, J., & Fujita, H. (2016). A hybrid approach to the sentiment analysis problem at the sentence level. Knowledge-Based Systems, 108, 110–124. https://doi.org/10.1016/j.knosys.2016.05.040
Tripathy, A., Agrawal, A., & Rath, S. K. (2016). Classification of sentiment reviews using n-gram machine learning approach. Expert Systems with Applications, 57, 117–126. https://doi.org/10.1016/j.eswa.2016.03.028

References

Agarwal, B., Mittal, N., Bansal, P., & Garg, S. (2015). Sentiment Analysis Using Common-Sense and Context Information. Computational Intelligence and Neuroscience, 2015, 1–9. https://doi.org/10.1155/2015/715730

Cambria, E., Hussain, A., Durrani, T., Havasi, C., Eckl, C., & Munro, J. (2010). Sentic Computing for patient centered applications. IEEE 10th International Conference On Signal Processing Proceedings, 1279–1282. https://doi.org/10.1109/ICOSP.2010.5657072

Ebrahimi, M., Yazdavar, A. H., & Sheth, A. (2017). Challenges of Sentiment Analysis for Dynamic Events. IEEE Intelligent Systems, 32(5), 70–75. https://doi.org/10.1109/MIS.2017.3711649

Xing, F. Z., Cambria, E., & Welsch, R. E. (2018). Natural language based financial forecasting: a survey. Artificial Intelligence Review, 50(1), 49–73. https://doi.org/10.1007/s10462-017-9588-9

Van de Kauter, M., Breesch, D., & Hoste, V. (2015). Fine-grained analysis of explicit and implicit sentiment in financial news articles. Expert Systems with Applications, 42(11), 4999–5010. https://doi.org/10.1016/j.eswa.2015.02.007

Valdivia, A., Luzon, M. V., & Herrera, F. (2017). Sentiment Analysis in TripAdvisor. IEEE Intelligent Systems, 32(4), 72–77. https://doi.org/10.1109/MIS.2017.3121555

Vázquez, S., Muñoz-García, Ó., Campanella, I., Poch, M., Fisas, B., Bel, N., & Andreu, G. (2014). A classification of user-generated content into consumer decision journey stages. Neural Networks, 58, 68–81. https://doi.org/10.1016/j.neunet.2014.05.026

Thompson, J. J., Leung, B. H., Blair, M. R., & Taboada, M. (2017). Sentiment analysis of player chat messaging in the video game StarCraft 2: Extending a lexicon-based model. Knowledge-Based Systems, 137, 149–162. https://doi.org/10.1016/j.knosys.2017.09.022

Wang, K., Liu, X., & Han, Y. (2019). Exploring Goodreads reviews for book impact assessment. Journal of Informetrics, 13(3), 874–886. https://doi.org/10.1016/j.joi.2019.07.003

Bello-Orgaz, G., Jung, J. J., & Camacho, D. (2016). Social big data: Recent achievements and new challenges. Information Fusion, 28, 45–59. https://doi.org/10.1016/j.inffus.2015.08.005

Elghannam, F. (2019). Text representation and classification based on bi-gram alphabet. Journal of King Saud University - Computer and Information Sciences. https://doi.org/10.1016/j.jksuci.2019.01.005

Chalothorn, T., & Ellman, J. (2015). Simple approaches of sentiment analysis via ensemble learning. In Lecture Notes in Electrical Engineering (Vol. 339, pp. 631–639). https://doi.org/10.1007/978-3-662-46578-3_74

Yang, L., Li, Y., Wang, J., & Sherratt, R. S. (2020). Sentiment Analysis for E-Commerce Product Reviews in Chinese Based on Sentiment Lexicon and Deep Learning. IEEE Access, 8, 23522–23530. https://doi.org/10.1109/ACCESS.2020.2969854

Zeng, D., Dai, Y., Li, F., Wang, J., & Sangaiah, A. K. (2019). Aspect based sentiment analysis by a linguistically regularized CNN with gated mechanism. Journal of Intelligent & Fuzzy Systems, 36(5), 3971–3980. https://doi.org/10.3233/JIFS-169958

Khan, K., Baharudin, B., Khan, A., & Ullah, A. (2014). Mining opinion components from unstructured reviews: A review. Journal of King Saud University - Computer and Information Sciences, 26(3), 258–275. https://doi.org/10.1016/j.jksuci.2014.03.009

Hussein, D. M. E.-D. M. (2018). A survey on sentiment analysis challenges. Journal of King Saud University - Engineering Sciences, 30(4), 330–338. https://doi.org/10.1016/j.jksues.2016.04.002

Moraes, R., Valiati, J. F., & Gavião Neto, W. P. (2013). Document-level sentiment classification: An empirical comparison between SVM and ANN. Expert Systems with Applications, 40(2), 621–633. https://doi.org/10.1016/j.eswa.2012.07.059

McAuley, J., & Leskovec, J. (2013). From amateurs to connoisseurs: Modeling the evolution of user expertise through online reviews. WWW 2013 - Proceedings of the 22nd International Conference on World Wide Web, 897–907. Retrieved from https://arxiv.org/abs/1303.4402

Willett, P. (2006). The Porter stemming algorithm: then and now. Program, 40(3), 219–223. https://doi.org/10.1108/00330330610681295

Xie, F., Wu, X., & Zhu, X. (2017). Efficient sequential pattern mining with wildcards for keyphrase extraction. Knowledge-Based Systems, 115, 27–39. https://doi.org/10.1016/j.knosys.2016.10.011

Gencosman, B. C., Ozmutlu, H. C., & Ozmutlu, S. (2014). Character n-gram application for automatic new topic identification. Information Processing & Management, 50(6), 821–856. https://doi.org/10.1016/j.ipm.2014.06.005

Schmidt, C. W. (2019). Improving a tf-idf weighted document vector embedding. Retrieved from http://arxiv.org/abs/1902.09875

Ren, J. (2012). ANN vs. SVM: Which one performs better in classification of MCCs in mammogram imaging. Knowledge-Based Systems, 26, 144–153. https://doi.org/10.1016/j.knosys.2011.07.016

Appel, O., Chiclana, F., Carter, J., & Fujita, H. (2016). A hybrid approach to the sentiment analysis problem at the sentence level. Knowledge-Based Systems, 108, 110–124. https://doi.org/10.1016/j.knosys.2016.05.040

Tripathy, A., Agrawal, A., & Rath, S. K. (2016). Classification of sentiment reviews using n-gram machine learning approach. Expert Systems with Applications, 57, 117–126. https://doi.org/10.1016/j.eswa.2016.03.028

Issue

Vol. 5, No. 3, August 2020

Document Preprocessing with TF-IDF to Improve the Polarity Classification Performance of Unstructured Sentiment Analysis

Corresponding Author(s) : Farrikh Alzami

Abstract

Keywords

Download Citation

References

Downloads