The Effect of Stemming and Removal of Stopwords on the Accuracy of Sentiment Analysis on Indonesian-language Texts
Corresponding Author(s) : Aditya Wiha Pradana
Kinetik: Game Technology, Information System, Computer Network, Computing, Electronics, and Control,
Vol 4, No 4, November 2019
Abstract
Preprocessing is an essential task for sentiment analysis since textual information carries a lot of noisy and unstructured data. Both stemming and stopword removal are pretty popular preprocessing techniques for text classification. However, the prior research gives different results concerning the influence of both methods toward accuracy on sentiment classification. Therefore, this paper conducts further investigations about the effect of stemming and stopword removal on Indonesian language sentiment analysis. Furthermore, we propose four preprocessing conditions which are with using both stemming and stopword removal, without using stemming, without using stopword removal, and without using both. Support Vector Machine was used for the classification algorithm and TF-IDF as a weighting scheme. The result was evaluated using confusion matrix and k-fold cross-validation methods. The experiments result show that all accuracy did not improve and tends to decrease when performing stemming or stopword removal scenarios. This work concludes that the application of stemming and stopword removal technique does not significantly affect the accuracy of sentiment analysis in Indonesian text documents.
Keywords
Download Citation
Endnote/Zotero/Mendeley (RIS)BibTeX
- S. Rosenthal, N. Farra, and P. Nakov, “SemEval-2017 Task 4: Sentiment Analysis in Twitter,” in Proceedings ofthe 11th International Workshop on Semantic Evaluations (SemEval-2017), 2017, pp. 502–518.
- Y. Wang, K. Kim, B. Lee, and H. Y. Youn, “Word clustering based on POS feature for efficient twitter sentiment analysis,” Human-centric Comput. Inf. Sci., vol. 8, no. 17, pp. 1–25, 2019.
- A. Krouska, C. Troussas, and M. Virvou, “The effect of preprocessing techniques on Twitter sentiment analysis,” in IISA 2016 - 7th International Conference on Information, Intelligence, Systems and Applications, 2016.
- S. Symeonidis, D. Effrosynidis, and A. Arampatzis, “A comparative evaluation of pre-processing techniques and their interactions for twitter sentiment analysis,” Expert Syst. Appl., vol. 110, pp. 298–310, 2018.
- M. Mhatre, D. Phondekar, P. Kadam, A. Chawathe, and K. Ghag, “Dimensionality Reduction for Sentiment Analysis using Pre-processing Techniques,” in Proceedings of the IEEE 2017 International Conference on Computing Methodologies and Communication (ICCMC), 2017, pp. 16–21.
- H. M. Zin, N. Mustapha, M. A. A. Murad, and N. M. Sharef, “The Effects of Pre-Processing Strategies in Sentiment Analysis of Online Movie Reviews,” AIP Conf. Proc., vol. 1891, no. 1, pp. 020089–1–020089–7, 2017.
- S. Gharatkar, A. Ingle, T. Naik, and A. Save, “Review Preprocessing Using Data Cleaning And Stemming Technique,” in International Conference on Innovations in information Embedded and Communication Systems (ICIIECS), 2017.
- A. F. Hidayatullah, “The Influence of Stemming on Indonesian Tweet Sentiment Analysis,” in Proceeding of International Conference on Electrical Engineering, Computer Science and Informatics (EECSI 2015), 2015, pp. 127–132.
- K. V. Ghag and K. Shah, “Comparative analysis of effect of stopwords removal on sentiment classification,” in IEEE International Conference on Computer Communication and Control (IC4-2015), 2015.
- R. M. Sallam and M. Hussein, “Improving Arabic Text Categorization using Normalization and Stemming Techniques,” Int. J. Comput. Appl., vol. 135, no. 2, pp. 38–43, 2016.
- E. Haddi, X. Liu, and Y. Shi, “The Role of Text Pre-processing in Sentiment Analysis,” Procedia Comput. Sci., vol. 17, pp. 26–32, 2013.
- A. Fathan Hidayatullah, C. I. Ratnasari, and S. Wisnugroho, “Analysis of Stemming Influence on Indonesian Tweet Classification,” TELKOMNIKA, vol. 14, no. 2, pp. 665–673, 2016.
- Z. Jianqiang and G. Xiaolin, “Comparison research on text pre-processing methods on twitter sentiment analysis,” IEEE Access, vol. 5, pp. 2870–2879, 2017.
- C. Slamet, A. R. Atmadja, D. S. Maylawati, R. S. Lestari, W. Dharmalaksana, and M. A. Ramdhani, “Automated Text Summarization for Indonesian Article Using Vector Space Model Model,” IOP Conf. Ser. Mater. Sci. Eng., vol. 288, no. 1, 2018.
- M. Khader, A. Awajan, and G. Al-Naymat, “The Effects of Natural Language Processing on Big Data Analysis: Sentiment Analysis Case Study,” in ACIT 2018 - 19th International Arab Conference on Information Technology, 2018.
- A. Filcha and M. Hayaty, “Implementasi Algoritma Rabin-Karp untuk Pendeteksi Plagiarisme pada Dokumen Tugas Mahasiswa,” JUITA J. Inform., vol. 7, no. 1, p. 25, 2019.
- S. M. Arif and M. Mustapha, “The Effect of Noise Elimination and Stemming in Sentiment Analysis for Malay Documents,” Proc. Int. Conf. Comput. Math. Stat. (iCMS 2015), pp. 93–102, 2015.
- J. Asian, H. E. Williams, and S. M. M. Tahaghoghi, “Stemming Indonesian,” in ACSC ’05 Proceedings of the Twenty-eighth Australasian conference on Computer Science, 2005, vol. 38, pp. 307–314.
- J. Asian, B. Nazief, and H. Williams, “Stemming Indonesian : A confix-stripping approach,” ACM Trans. Asian Lang. Inf. Process., vol. 6, no. 13, 2007.
- A. Tripathy, A. Agrawal, and S. K. Rath, “Classification of sentiment reviews using n-gram machine learning approach,” Expert Syst. Appl., vol. 57, pp. 117–126, 2016.
- G. Li and J. Li, “Research on Sentiment Classification for Tang Poetry based on TF-IDF and FP-Growth,” in Proceedings of 2018 IEEE 3rd Advanced Information Technology, Electronic and Automation Control Conference, IAEAC 2018, 2018, pp. 630–634.
- Y. A. L. Amrani, M. Lazaar, K. Eddine, and E. L. Kadiri, “Random Forest and Support Vector Machine based Hybrid Approach to Sentiment Analysis,” Procedia Comput. Sci., vol. 127, pp. 511–520, 2018.
- M. Athoillah and R. K. Putri, “Handwritten Arabic Numeral Character Recognition Using Multi Kernel Support Vector Machine,” Kinet. Game Technol. Inf. Syst. Comput. Network, Comput. Electron. Control, vol. 4, no. 2, p. 99, 2019.
- T. T. Wong, “Performance evaluation of classification algorithms by k-fold and leave-one-out cross validation,” Pattern Recognit., vol. 48, no. 9, pp. 2839–2846, 2015.
References
S. Rosenthal, N. Farra, and P. Nakov, “SemEval-2017 Task 4: Sentiment Analysis in Twitter,” in Proceedings ofthe 11th International Workshop on Semantic Evaluations (SemEval-2017), 2017, pp. 502–518.
Y. Wang, K. Kim, B. Lee, and H. Y. Youn, “Word clustering based on POS feature for efficient twitter sentiment analysis,” Human-centric Comput. Inf. Sci., vol. 8, no. 17, pp. 1–25, 2019.
A. Krouska, C. Troussas, and M. Virvou, “The effect of preprocessing techniques on Twitter sentiment analysis,” in IISA 2016 - 7th International Conference on Information, Intelligence, Systems and Applications, 2016.
S. Symeonidis, D. Effrosynidis, and A. Arampatzis, “A comparative evaluation of pre-processing techniques and their interactions for twitter sentiment analysis,” Expert Syst. Appl., vol. 110, pp. 298–310, 2018.
M. Mhatre, D. Phondekar, P. Kadam, A. Chawathe, and K. Ghag, “Dimensionality Reduction for Sentiment Analysis using Pre-processing Techniques,” in Proceedings of the IEEE 2017 International Conference on Computing Methodologies and Communication (ICCMC), 2017, pp. 16–21.
H. M. Zin, N. Mustapha, M. A. A. Murad, and N. M. Sharef, “The Effects of Pre-Processing Strategies in Sentiment Analysis of Online Movie Reviews,” AIP Conf. Proc., vol. 1891, no. 1, pp. 020089–1–020089–7, 2017.
S. Gharatkar, A. Ingle, T. Naik, and A. Save, “Review Preprocessing Using Data Cleaning And Stemming Technique,” in International Conference on Innovations in information Embedded and Communication Systems (ICIIECS), 2017.
A. F. Hidayatullah, “The Influence of Stemming on Indonesian Tweet Sentiment Analysis,” in Proceeding of International Conference on Electrical Engineering, Computer Science and Informatics (EECSI 2015), 2015, pp. 127–132.
K. V. Ghag and K. Shah, “Comparative analysis of effect of stopwords removal on sentiment classification,” in IEEE International Conference on Computer Communication and Control (IC4-2015), 2015.
R. M. Sallam and M. Hussein, “Improving Arabic Text Categorization using Normalization and Stemming Techniques,” Int. J. Comput. Appl., vol. 135, no. 2, pp. 38–43, 2016.
E. Haddi, X. Liu, and Y. Shi, “The Role of Text Pre-processing in Sentiment Analysis,” Procedia Comput. Sci., vol. 17, pp. 26–32, 2013.
A. Fathan Hidayatullah, C. I. Ratnasari, and S. Wisnugroho, “Analysis of Stemming Influence on Indonesian Tweet Classification,” TELKOMNIKA, vol. 14, no. 2, pp. 665–673, 2016.
Z. Jianqiang and G. Xiaolin, “Comparison research on text pre-processing methods on twitter sentiment analysis,” IEEE Access, vol. 5, pp. 2870–2879, 2017.
C. Slamet, A. R. Atmadja, D. S. Maylawati, R. S. Lestari, W. Dharmalaksana, and M. A. Ramdhani, “Automated Text Summarization for Indonesian Article Using Vector Space Model Model,” IOP Conf. Ser. Mater. Sci. Eng., vol. 288, no. 1, 2018.
M. Khader, A. Awajan, and G. Al-Naymat, “The Effects of Natural Language Processing on Big Data Analysis: Sentiment Analysis Case Study,” in ACIT 2018 - 19th International Arab Conference on Information Technology, 2018.
A. Filcha and M. Hayaty, “Implementasi Algoritma Rabin-Karp untuk Pendeteksi Plagiarisme pada Dokumen Tugas Mahasiswa,” JUITA J. Inform., vol. 7, no. 1, p. 25, 2019.
S. M. Arif and M. Mustapha, “The Effect of Noise Elimination and Stemming in Sentiment Analysis for Malay Documents,” Proc. Int. Conf. Comput. Math. Stat. (iCMS 2015), pp. 93–102, 2015.
J. Asian, H. E. Williams, and S. M. M. Tahaghoghi, “Stemming Indonesian,” in ACSC ’05 Proceedings of the Twenty-eighth Australasian conference on Computer Science, 2005, vol. 38, pp. 307–314.
J. Asian, B. Nazief, and H. Williams, “Stemming Indonesian : A confix-stripping approach,” ACM Trans. Asian Lang. Inf. Process., vol. 6, no. 13, 2007.
A. Tripathy, A. Agrawal, and S. K. Rath, “Classification of sentiment reviews using n-gram machine learning approach,” Expert Syst. Appl., vol. 57, pp. 117–126, 2016.
G. Li and J. Li, “Research on Sentiment Classification for Tang Poetry based on TF-IDF and FP-Growth,” in Proceedings of 2018 IEEE 3rd Advanced Information Technology, Electronic and Automation Control Conference, IAEAC 2018, 2018, pp. 630–634.
Y. A. L. Amrani, M. Lazaar, K. Eddine, and E. L. Kadiri, “Random Forest and Support Vector Machine based Hybrid Approach to Sentiment Analysis,” Procedia Comput. Sci., vol. 127, pp. 511–520, 2018.
M. Athoillah and R. K. Putri, “Handwritten Arabic Numeral Character Recognition Using Multi Kernel Support Vector Machine,” Kinet. Game Technol. Inf. Syst. Comput. Network, Comput. Electron. Control, vol. 4, no. 2, p. 99, 2019.
T. T. Wong, “Performance evaluation of classification algorithms by k-fold and leave-one-out cross validation,” Pattern Recognit., vol. 48, no. 9, pp. 2839–2846, 2015.