Document Preprocessing with TF-IDF to Improve the Polarity Classification Performance of Unstructured Sentiment Analysis
Abstract views: 80

Document Preprocessing with TF-IDF to Improve the Polarity Classification Performance of Unstructured Sentiment Analysis

Farrikh Alzami, Erika Devi Udayanti, Dwi Puji Prabowo, Rama Aria Megantara


Sentiment analysis in terms of polarity classification is very important in everyday life, with the existence of polarity, many people can find out whether the respected document has positive or negative sentiment so that it can help in choosing and making decisions. Sentiment analysis usually done manually. Therefore, an automatic sentiment analysis classification process is needed. However, it is rare to find studies that discuss extraction features and which learning models are suitable for unstructured sentiment analysis types with the Amazon food review case. This research explores some extraction features such as Word Bags, TF-IDF, Word2Vector, as well as a combination of TF-IDF and Word2Vector with several machine learning models such as Random Forest, SVM, KNN and Naïve Bayes to find out a combination of feature extraction and learning models that can help add variety to the analysis of polarity sentiments. By assisting with document preparation such as html tags and punctuation and special characters, using snowball stemming, TF-IDF results obtained with SVM are suitable for obtaining a polarity classification in unstructured sentiment analysis for the case of Amazon food review with a performance result of 87,3 percent.


Unstructured Sentiment Analysis, polarity, TF-IDF, classification

Full Text:



[1] Agarwal, B., Mittal, N., Bansal, P., & Garg, S. (2015). Sentiment Analysis Using Common-Sense and Context Information. Computational Intelligence and Neuroscience, 2015, 1–9.

[2] Cambria, E., Hussain, A., Durrani, T., Havasi, C., Eckl, C., & Munro, J. (2010). Sentic Computing for patient centered applications. IEEE 10th International Conference On Signal Processing Proceedings, 1279–1282.

[3] Ebrahimi, M., Yazdavar, A. H., & Sheth, A. (2017). Challenges of Sentiment Analysis for Dynamic Events. IEEE Intelligent Systems, 32(5), 70–75.

[4] Xing, F. Z., Cambria, E., & Welsch, R. E. (2018). Natural language based financial forecasting: a survey. Artificial Intelligence Review, 50(1), 49–73.

[5] Van de Kauter, M., Breesch, D., & Hoste, V. (2015). Fine-grained analysis of explicit and implicit sentiment in financial news articles. Expert Systems with Applications, 42(11), 4999–5010.

[6] Valdivia, A., Luzon, M. V., & Herrera, F. (2017). Sentiment Analysis in TripAdvisor. IEEE Intelligent Systems, 32(4), 72–77.

[7] Vázquez, S., Muñoz-García, Ó., Campanella, I., Poch, M., Fisas, B., Bel, N., & Andreu, G. (2014). A classification of user-generated content into consumer decision journey stages. Neural Networks, 58, 68–81.

[8] Thompson, J. J., Leung, B. H., Blair, M. R., & Taboada, M. (2017). Sentiment analysis of player chat messaging in the video game StarCraft 2: Extending a lexicon-based model. Knowledge-Based Systems, 137, 149–162.

[9] Wang, K., Liu, X., & Han, Y. (2019). Exploring Goodreads reviews for book impact assessment. Journal of Informetrics, 13(3), 874–886.

[10] Bello-Orgaz, G., Jung, J. J., & Camacho, D. (2016). Social big data: Recent achievements and new challenges. Information Fusion, 28, 45–59.

[11] Elghannam, F. (2019). Text representation and classification based on bi-gram alphabet. Journal of King Saud University - Computer and Information Sciences.

[12] Chalothorn, T., & Ellman, J. (2015). Simple approaches of sentiment analysis via ensemble learning. In Lecture Notes in Electrical Engineering (Vol. 339, pp. 631–639).

[13] Yang, L., Li, Y., Wang, J., & Sherratt, R. S. (2020). Sentiment Analysis for E-Commerce Product Reviews in Chinese Based on Sentiment Lexicon and Deep Learning. IEEE Access, 8, 23522–23530.

[14] Zeng, D., Dai, Y., Li, F., Wang, J., & Sangaiah, A. K. (2019). Aspect based sentiment analysis by a linguistically regularized CNN with gated mechanism. Journal of Intelligent & Fuzzy Systems, 36(5), 3971–3980.

[15] Khan, K., Baharudin, B., Khan, A., & Ullah, A. (2014). Mining opinion components from unstructured reviews: A review. Journal of King Saud University - Computer and Information Sciences, 26(3), 258–275.

[16] Hussein, D. M. E.-D. M. (2018). A survey on sentiment analysis challenges. Journal of King Saud University - Engineering Sciences, 30(4), 330–338.

[17] Moraes, R., Valiati, J. F., & Gavião Neto, W. P. (2013). Document-level sentiment classification: An empirical comparison between SVM and ANN. Expert Systems with Applications, 40(2), 621–633.

[18] McAuley, J., & Leskovec, J. (2013). From amateurs to connoisseurs: Modeling the evolution of user expertise through online reviews. WWW 2013 - Proceedings of the 22nd International Conference on World Wide Web, 897–907. Retrieved from

[19] Willett, P. (2006). The Porter stemming algorithm: then and now. Program, 40(3), 219–223.

[20] Xie, F., Wu, X., & Zhu, X. (2017). Efficient sequential pattern mining with wildcards for keyphrase extraction. Knowledge-Based Systems, 115, 27–39.

[21] Gencosman, B. C., Ozmutlu, H. C., & Ozmutlu, S. (2014). Character n-gram application for automatic new topic identification. Information Processing & Management, 50(6), 821–856.

[22] Schmidt, C. W. (2019). Improving a tf-idf weighted document vector embedding. Retrieved from

[23] Ren, J. (2012). ANN vs. SVM: Which one performs better in classification of MCCs in mammogram imaging. Knowledge-Based Systems, 26, 144–153.

[24] Appel, O., Chiclana, F., Carter, J., & Fujita, H. (2016). A hybrid approach to the sentiment analysis problem at the sentence level. Knowledge-Based Systems, 108, 110–124.

[25] Tripathy, A., Agrawal, A., & Rath, S. K. (2016). Classification of sentiment reviews using n-gram machine learning approach. Expert Systems with Applications, 57, 117–126.


  • There are currently no refbacks.

Indexed by: 


Referencing Software:

Checked by:

Supervised by:


View My Stats

Creative Commons License Kinetik : Game Technology, Information System, Computer Network, Computing, Electronics, and Control by is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.