This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.
Combination of Term Weighting with Class Distribution and Centroid-based Approach for Document Classification
Corresponding Author(s) : Christian Sri Kusuma Aditya
Kinetik: Game Technology, Information System, Computer Network, Computing, Electronics, and Control,
Vol. 8, No. 4, November 2023
Abstract
A text retrieval system requires a method that is able to return a number of documents with high relevance upon user requests. One of the important stages in the text representation process is the weighting process. The use of Term Frequency (TF) considers the number of word occurrences in each document, while Inverse Document Frequency (IDF) considers the wide distribution of words throughout the document collection. However, the TF-IDF weighting cannot represent the distribution of words to documents with many classes or categories. The more unequal the distribution of words in each category, the more important the word features should be. This study developed a new term weighting method where weighting is carried out based on the frequency of occurrence of terms in each class which is integrated with the distribution of centroid-based terms which can minimize intra-cluster similarity and maximize inter-cluster variance. The ICF.TDCB term weighting method has been able to provide the best results in its application to SVM modeling with a dataset of 931 online news documents. The results show that SVM modeling had accuracy of 0.723, outperforming the use of other term weightings such as TF.IDF, ICF & TDCB.
Keywords
Download Citation
Endnote/Zotero/Mendeley (RIS)BibTeX
- https://news.un.org/en/story/2022/03/1113702
- Fayaz, M., Khan, A., Bilal, M., & Khan, S. U. (2022). Machine learning for fake news classification with optimal feature selection. Soft Computing, 26(16), 7763-7771. https://doi.org/10.1007/s00500-022-06773-x
- Daud, S., Ullah, M., Rehman, A., Saba, T., Damaševičius, R., & Sattar, A. (2023). Topic classification of online news articles using optimized machine learning models. Computers, 12(1), 16. https://doi.org/10.3390/computers12010016
- Alodadi, Mohammad, and Vandana P. Janeja. "Similarity in patient support forums using tf-idf and cosine similarity metrics." 2015 International Conference on Healthcare Informatics. IEEE, 2015. https://doi.org/10.1109/ICHI.2015.99
- Qaiser, S., & Ali, R. (2018). Text mining: use of TF-IDF to examine the relevance of words to documents. International Journal of Computer Applications, 181(1), 25-29. https://doi.org/10.5120/ijca2018917395
- Guo, Aizhang, and Tao Yang. "Research and improvement of feature words weight based on TFIDF algorithm." 2016 IEEE Information Technology, Networking, Electronic and Automation Control Conference. IEEE, 2016. https://doi.org/10.1109/ITNEC.2016.7560393
- Uysal, Alper Kursat. "An improved global feature selection scheme for text classification." Expert systems with Applications 43 (2016): 82-92. https://doi.org/10.1016/j.eswa.2015.08.050
- Domeniconi, Giacomo, et al. "A comparison of term weighting schemes for text classification and sentiment analysis with a supervised variant of tf. idf." International Conference on Data Management Technologies and Applications. Springer, Cham, 2015. https://doi.org/10.1007/978-3-319-30162-4_4
- Puspaningrum, Alifia, Daniel Siahaan, and Chastine Fatichah. "Mobile App Review Labeling Using LDA Similarity and Term Frequency-Inverse Cluster Frequency (TF-ICF)." 2018 10th International Conference on Information Technology and Electrical Engineering (ICITEE). IEEE, 2018. https://doi.org/10.1109/ICITEED.2018.8534785
- Lertnattee, Verayuth, and Thanaruk Theeramunkong. "Effect of term distributions on centroid-based text categorization." Information Sciences 158 (2004): 89-115. https://doi.org/10.1016/j.ins.2003.07.007
- Nguyen, T. T., Chang, K., & Hui, S. C. (2013). Supervised term weighting centroid-based classifiers for text categorization. Knowledge and information systems, 35, 61-85. https://doi.org/10.1007/s10115-012-0559-9
- Slamet, Cepi, et al. "Automated text summarization for indonesian article using vector space model." IOP Conference Series: Materials Science and Engineering. Vol. 288. No. 1. IOP Publishing, 2018. https://doi.org/10.1088/1757-899X/288/1/012037
- Wahyudi, Dwi, Teguh Susyanto, and Didik Nugroho. "Implementasi Dan Analisis Algoritma Stemming Nazief & Adriani Dan Porter Pada Dokumen Berbahasa Indonesia." Jurnal Ilmiah SINUS 15.2 (2017): 49-56. http://dx.doi.org/10.30646/sinus.v15i2.305
- Grootendorst, M. (2022). BERTopic: Neural topic modeling with a class-based TF-IDF procedure. arXiv preprint arXiv:2203.05794. https://doi.org/10.48550/arXiv.2203.05794
- Kim, S. W., & Gil, J. M. (2019). Research paper classification systems based on TF-IDF and LDA schemes. Human-centric Computing and Information Sciences, 9, 1-21. https://doi.org/10.1186/s13673-019-0192-7
- Takçı, H., & Güngör, T. (2012). A high performance centroid-based classification approach for language identification. Pattern Recognition Letters, 33(16), 2077-2084. https://doi.org/10.1016/j.patrec.2012.06.012
- Lertnattee, V., & Theeramunkong, T. (2004, October). Analysis of inverse class frequency in centroid-based text classification. In IEEE International Symposium on Communications and Information Technology, 2004. ISCIT 2004. (Vol. 2, pp. 1171-1176). IEEE. https://doi.org/10.1109/ISCIT.2004.1413903
- Cieza, A., Fayed, N., Bickenbach, J., & Prodinger, B. (2019). Refinements of the ICF Linking Rules to strengthen their potential for establishing comparability of health information. Disability and rehabilitation, 41(5), 574-583. https://doi.org/10.3109/09638288.2016.1145258
- Lertnattee, V., & Theeramunkong, T. (2004). Effect of term distributions on centroid-based text categorization. Information Sciences, 158, 89-115. https://doi.org/10.1016/j.ins.2003.07.007
- Liu, C., Wang, W., Tu, G., Xiang, Y., Wang, S., & Lv, F. (2017). A new Centroid-Based Classification model for text categorization. Knowledge-Based Systems, 136, 15-26. https://doi.org/10.1016/j.knosys.2017.08.020
- Guan, H., Zhou, J., & Guo, M. (2009, April). A class-feature-centroid classifier for text categorization. In Proceedings of the 18th international conference on World wide web (pp. 201-210). https://doi.org/10.1145/1526709.1526737
- Huang, W., Liu, H., Zhang, Y., Mi, R., Tong, C., Xiao, W., & Shuai, B. (2021). Railway dangerous goods transportation system risk identification: Comparisons among SVM, PSO-SVM, GA-SVM and GS-SVM. Applied Soft Computing, 109, 107541. https://doi.org/10.1016/j.asoc.2021.107541
- Dai, T. T., & Dong, Y. S. (2020, April). Introduction of SVM related theory and its application research. In 2020 3rd International Conference on Advanced Electronic Materials, Computers and Software Engineering (AEMCSE) (pp. 230-233). IEEE. https://doi.org/10.1109/AEMCSE50948.2020.00056
- Chauhan, V. K., Dahiya, K., & Sharma, A. (2019). Problem formulations and solvers in linear SVM: a review. Artificial Intelligence Review, 52(2), 803-855. https://doi.org/10.1007/s10462-018-9614-6
- Ring, M., & Eskofier, B. M. (2016). An approximation of the Gaussian RBF kernel for efficient classification with SVMs. Pattern Recognition Letters, 84, 107-113. https://doi.org/10.1016/j.patrec.2016.08.013
References
https://news.un.org/en/story/2022/03/1113702
Fayaz, M., Khan, A., Bilal, M., & Khan, S. U. (2022). Machine learning for fake news classification with optimal feature selection. Soft Computing, 26(16), 7763-7771. https://doi.org/10.1007/s00500-022-06773-x
Daud, S., Ullah, M., Rehman, A., Saba, T., Damaševičius, R., & Sattar, A. (2023). Topic classification of online news articles using optimized machine learning models. Computers, 12(1), 16. https://doi.org/10.3390/computers12010016
Alodadi, Mohammad, and Vandana P. Janeja. "Similarity in patient support forums using tf-idf and cosine similarity metrics." 2015 International Conference on Healthcare Informatics. IEEE, 2015. https://doi.org/10.1109/ICHI.2015.99
Qaiser, S., & Ali, R. (2018). Text mining: use of TF-IDF to examine the relevance of words to documents. International Journal of Computer Applications, 181(1), 25-29. https://doi.org/10.5120/ijca2018917395
Guo, Aizhang, and Tao Yang. "Research and improvement of feature words weight based on TFIDF algorithm." 2016 IEEE Information Technology, Networking, Electronic and Automation Control Conference. IEEE, 2016. https://doi.org/10.1109/ITNEC.2016.7560393
Uysal, Alper Kursat. "An improved global feature selection scheme for text classification." Expert systems with Applications 43 (2016): 82-92. https://doi.org/10.1016/j.eswa.2015.08.050
Domeniconi, Giacomo, et al. "A comparison of term weighting schemes for text classification and sentiment analysis with a supervised variant of tf. idf." International Conference on Data Management Technologies and Applications. Springer, Cham, 2015. https://doi.org/10.1007/978-3-319-30162-4_4
Puspaningrum, Alifia, Daniel Siahaan, and Chastine Fatichah. "Mobile App Review Labeling Using LDA Similarity and Term Frequency-Inverse Cluster Frequency (TF-ICF)." 2018 10th International Conference on Information Technology and Electrical Engineering (ICITEE). IEEE, 2018. https://doi.org/10.1109/ICITEED.2018.8534785
Lertnattee, Verayuth, and Thanaruk Theeramunkong. "Effect of term distributions on centroid-based text categorization." Information Sciences 158 (2004): 89-115. https://doi.org/10.1016/j.ins.2003.07.007
Nguyen, T. T., Chang, K., & Hui, S. C. (2013). Supervised term weighting centroid-based classifiers for text categorization. Knowledge and information systems, 35, 61-85. https://doi.org/10.1007/s10115-012-0559-9
Slamet, Cepi, et al. "Automated text summarization for indonesian article using vector space model." IOP Conference Series: Materials Science and Engineering. Vol. 288. No. 1. IOP Publishing, 2018. https://doi.org/10.1088/1757-899X/288/1/012037
Wahyudi, Dwi, Teguh Susyanto, and Didik Nugroho. "Implementasi Dan Analisis Algoritma Stemming Nazief & Adriani Dan Porter Pada Dokumen Berbahasa Indonesia." Jurnal Ilmiah SINUS 15.2 (2017): 49-56. http://dx.doi.org/10.30646/sinus.v15i2.305
Grootendorst, M. (2022). BERTopic: Neural topic modeling with a class-based TF-IDF procedure. arXiv preprint arXiv:2203.05794. https://doi.org/10.48550/arXiv.2203.05794
Kim, S. W., & Gil, J. M. (2019). Research paper classification systems based on TF-IDF and LDA schemes. Human-centric Computing and Information Sciences, 9, 1-21. https://doi.org/10.1186/s13673-019-0192-7
Takçı, H., & Güngör, T. (2012). A high performance centroid-based classification approach for language identification. Pattern Recognition Letters, 33(16), 2077-2084. https://doi.org/10.1016/j.patrec.2012.06.012
Lertnattee, V., & Theeramunkong, T. (2004, October). Analysis of inverse class frequency in centroid-based text classification. In IEEE International Symposium on Communications and Information Technology, 2004. ISCIT 2004. (Vol. 2, pp. 1171-1176). IEEE. https://doi.org/10.1109/ISCIT.2004.1413903
Cieza, A., Fayed, N., Bickenbach, J., & Prodinger, B. (2019). Refinements of the ICF Linking Rules to strengthen their potential for establishing comparability of health information. Disability and rehabilitation, 41(5), 574-583. https://doi.org/10.3109/09638288.2016.1145258
Lertnattee, V., & Theeramunkong, T. (2004). Effect of term distributions on centroid-based text categorization. Information Sciences, 158, 89-115. https://doi.org/10.1016/j.ins.2003.07.007
Liu, C., Wang, W., Tu, G., Xiang, Y., Wang, S., & Lv, F. (2017). A new Centroid-Based Classification model for text categorization. Knowledge-Based Systems, 136, 15-26. https://doi.org/10.1016/j.knosys.2017.08.020
Guan, H., Zhou, J., & Guo, M. (2009, April). A class-feature-centroid classifier for text categorization. In Proceedings of the 18th international conference on World wide web (pp. 201-210). https://doi.org/10.1145/1526709.1526737
Huang, W., Liu, H., Zhang, Y., Mi, R., Tong, C., Xiao, W., & Shuai, B. (2021). Railway dangerous goods transportation system risk identification: Comparisons among SVM, PSO-SVM, GA-SVM and GS-SVM. Applied Soft Computing, 109, 107541. https://doi.org/10.1016/j.asoc.2021.107541
Dai, T. T., & Dong, Y. S. (2020, April). Introduction of SVM related theory and its application research. In 2020 3rd International Conference on Advanced Electronic Materials, Computers and Software Engineering (AEMCSE) (pp. 230-233). IEEE. https://doi.org/10.1109/AEMCSE50948.2020.00056
Chauhan, V. K., Dahiya, K., & Sharma, A. (2019). Problem formulations and solvers in linear SVM: a review. Artificial Intelligence Review, 52(2), 803-855. https://doi.org/10.1007/s10462-018-9614-6
Ring, M., & Eskofier, B. M. (2016). An approximation of the Gaussian RBF kernel for efficient classification with SVMs. Pattern Recognition Letters, 84, 107-113. https://doi.org/10.1016/j.patrec.2016.08.013