Combination of Term Weighting with Class Distribution and Centroid-based Approach for Document Classification

Christian Sri Kusuma Aditya; Fauzi Dwi Setiawan  Sumadi

doi:10.22219/kinetik.v8i4`.1793

Issue

Vol. 8, No. 4, November 2023

Issue Published : Nov 30, 2023

This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.

Combination of Term Weighting with Class Distribution and Centroid-based Approach for Document Classification

https://doi.org/10.22219/kinetik.v8i4`.1793

Christian Sri Kusuma Aditya

Universitas Muhammadiyah Malang

Fauzi Dwi Setiawan Sumadi

Universitas Muhammadiyah Malang

Corresponding Author(s) : Christian Sri Kusuma Aditya

christianskaditya@umm.ac.id

Kinetik: Game Technology, Information System, Computer Network, Computing, Electronics, and Control, Vol. 8, No. 4, November 2023
Article Published : Nov 30, 2023

Abstract

A text retrieval system requires a method that is able to return a number of documents with high relevance upon user requests. One of the important stages in the text representation process is the weighting process. The use of Term Frequency (TF) considers the number of word occurrences in each document, while Inverse Document Frequency (IDF) considers the wide distribution of words throughout the document collection. However, the TF-IDF weighting cannot represent the distribution of words to documents with many classes or categories. The more unequal the distribution of words in each category, the more important the word features should be. This study developed a new term weighting method where weighting is carried out based on the frequency of occurrence of terms in each class which is integrated with the distribution of centroid-based terms which can minimize intra-cluster similarity and maximize inter-cluster variance. The ICF.TDCB term weighting method has been able to provide the best results in its application to SVM modeling with a dataset of 931 online news documents. The results show that SVM modeling had accuracy of 0.723, outperforming the use of other term weightings such as TF.IDF, ICF & TDCB.

Keywords

Term Weighting TF-IDF ICF Term Distribution Centroid Text

Sri Kusuma Aditya, C., & Sumadi, F. D. S. . (2023). Combination of Term Weighting with Class Distribution and Centroid-based Approach for Document Classification. Kinetik: Game Technology, Information System, Computer Network, Computing, Electronics, and Control, 8(4`). https://doi.org/10.22219/kinetik.v8i4`.1793

Download Citation

References

https://news.un.org/en/story/2022/03/1113702
Fayaz, M., Khan, A., Bilal, M., & Khan, S. U. (2022). Machine learning for fake news classification with optimal feature selection. Soft Computing, 26(16), 7763-7771. https://doi.org/10.1007/s00500-022-06773-x
Daud, S., Ullah, M., Rehman, A., Saba, T., Damaševičius, R., & Sattar, A. (2023). Topic classification of online news articles using optimized machine learning models. Computers, 12(1), 16. https://doi.org/10.3390/computers12010016
Alodadi, Mohammad, and Vandana P. Janeja. "Similarity in patient support forums using tf-idf and cosine similarity metrics." 2015 International Conference on Healthcare Informatics. IEEE, 2015. https://doi.org/10.1109/ICHI.2015.99
Qaiser, S., & Ali, R. (2018). Text mining: use of TF-IDF to examine the relevance of words to documents. International Journal of Computer Applications, 181(1), 25-29. https://doi.org/10.5120/ijca2018917395
Guo, Aizhang, and Tao Yang. "Research and improvement of feature words weight based on TFIDF algorithm." 2016 IEEE Information Technology, Networking, Electronic and Automation Control Conference. IEEE, 2016. https://doi.org/10.1109/ITNEC.2016.7560393
Uysal, Alper Kursat. "An improved global feature selection scheme for text classification." Expert systems with Applications 43 (2016): 82-92. https://doi.org/10.1016/j.eswa.2015.08.050
Domeniconi, Giacomo, et al. "A comparison of term weighting schemes for text classification and sentiment analysis with a supervised variant of tf. idf." International Conference on Data Management Technologies and Applications. Springer, Cham, 2015. https://doi.org/10.1007/978-3-319-30162-4_4
Puspaningrum, Alifia, Daniel Siahaan, and Chastine Fatichah. "Mobile App Review Labeling Using LDA Similarity and Term Frequency-Inverse Cluster Frequency (TF-ICF)." 2018 10th International Conference on Information Technology and Electrical Engineering (ICITEE). IEEE, 2018. https://doi.org/10.1109/ICITEED.2018.8534785
Lertnattee, Verayuth, and Thanaruk Theeramunkong. "Effect of term distributions on centroid-based text categorization." Information Sciences 158 (2004): 89-115. https://doi.org/10.1016/j.ins.2003.07.007
Nguyen, T. T., Chang, K., & Hui, S. C. (2013). Supervised term weighting centroid-based classifiers for text categorization. Knowledge and information systems, 35, 61-85. https://doi.org/10.1007/s10115-012-0559-9
Slamet, Cepi, et al. "Automated text summarization for indonesian article using vector space model." IOP Conference Series: Materials Science and Engineering. Vol. 288. No. 1. IOP Publishing, 2018. https://doi.org/10.1088/1757-899X/288/1/012037
Wahyudi, Dwi, Teguh Susyanto, and Didik Nugroho. "Implementasi Dan Analisis Algoritma Stemming Nazief & Adriani Dan Porter Pada Dokumen Berbahasa Indonesia." Jurnal Ilmiah SINUS 15.2 (2017): 49-56. http://dx.doi.org/10.30646/sinus.v15i2.305
Grootendorst, M. (2022). BERTopic: Neural topic modeling with a class-based TF-IDF procedure. arXiv preprint arXiv:2203.05794. https://doi.org/10.48550/arXiv.2203.05794
Kim, S. W., & Gil, J. M. (2019). Research paper classification systems based on TF-IDF and LDA schemes. Human-centric Computing and Information Sciences, 9, 1-21. https://doi.org/10.1186/s13673-019-0192-7
Takçı, H., & Güngör, T. (2012). A high performance centroid-based classification approach for language identification. Pattern Recognition Letters, 33(16), 2077-2084. https://doi.org/10.1016/j.patrec.2012.06.012
Lertnattee, V., & Theeramunkong, T. (2004, October). Analysis of inverse class frequency in centroid-based text classification. In IEEE International Symposium on Communications and Information Technology, 2004. ISCIT 2004. (Vol. 2, pp. 1171-1176). IEEE. https://doi.org/10.1109/ISCIT.2004.1413903
Cieza, A., Fayed, N., Bickenbach, J., & Prodinger, B. (2019). Refinements of the ICF Linking Rules to strengthen their potential for establishing comparability of health information. Disability and rehabilitation, 41(5), 574-583. https://doi.org/10.3109/09638288.2016.1145258
Lertnattee, V., & Theeramunkong, T. (2004). Effect of term distributions on centroid-based text categorization. Information Sciences, 158, 89-115. https://doi.org/10.1016/j.ins.2003.07.007
Liu, C., Wang, W., Tu, G., Xiang, Y., Wang, S., & Lv, F. (2017). A new Centroid-Based Classification model for text categorization. Knowledge-Based Systems, 136, 15-26. https://doi.org/10.1016/j.knosys.2017.08.020
Guan, H., Zhou, J., & Guo, M. (2009, April). A class-feature-centroid classifier for text categorization. In Proceedings of the 18th international conference on World wide web (pp. 201-210). https://doi.org/10.1145/1526709.1526737
Huang, W., Liu, H., Zhang, Y., Mi, R., Tong, C., Xiao, W., & Shuai, B. (2021). Railway dangerous goods transportation system risk identification: Comparisons among SVM, PSO-SVM, GA-SVM and GS-SVM. Applied Soft Computing, 109, 107541. https://doi.org/10.1016/j.asoc.2021.107541
Dai, T. T., & Dong, Y. S. (2020, April). Introduction of SVM related theory and its application research. In 2020 3rd International Conference on Advanced Electronic Materials, Computers and Software Engineering (AEMCSE) (pp. 230-233). IEEE. https://doi.org/10.1109/AEMCSE50948.2020.00056
Chauhan, V. K., Dahiya, K., & Sharma, A. (2019). Problem formulations and solvers in linear SVM: a review. Artificial Intelligence Review, 52(2), 803-855. https://doi.org/10.1007/s10462-018-9614-6
Ring, M., & Eskofier, B. M. (2016). An approximation of the Gaussian RBF kernel for efficient classification with SVMs. Pattern Recognition Letters, 84, 107-113. https://doi.org/10.1016/j.patrec.2016.08.013

References

https://news.un.org/en/story/2022/03/1113702

Fayaz, M., Khan, A., Bilal, M., & Khan, S. U. (2022). Machine learning for fake news classification with optimal feature selection. Soft Computing, 26(16), 7763-7771. https://doi.org/10.1007/s00500-022-06773-x

Daud, S., Ullah, M., Rehman, A., Saba, T., Damaševičius, R., & Sattar, A. (2023). Topic classification of online news articles using optimized machine learning models. Computers, 12(1), 16. https://doi.org/10.3390/computers12010016

Alodadi, Mohammad, and Vandana P. Janeja. "Similarity in patient support forums using tf-idf and cosine similarity metrics." 2015 International Conference on Healthcare Informatics. IEEE, 2015. https://doi.org/10.1109/ICHI.2015.99

Qaiser, S., & Ali, R. (2018). Text mining: use of TF-IDF to examine the relevance of words to documents. International Journal of Computer Applications, 181(1), 25-29. https://doi.org/10.5120/ijca2018917395

Guo, Aizhang, and Tao Yang. "Research and improvement of feature words weight based on TFIDF algorithm." 2016 IEEE Information Technology, Networking, Electronic and Automation Control Conference. IEEE, 2016. https://doi.org/10.1109/ITNEC.2016.7560393

Uysal, Alper Kursat. "An improved global feature selection scheme for text classification." Expert systems with Applications 43 (2016): 82-92. https://doi.org/10.1016/j.eswa.2015.08.050

Domeniconi, Giacomo, et al. "A comparison of term weighting schemes for text classification and sentiment analysis with a supervised variant of tf. idf." International Conference on Data Management Technologies and Applications. Springer, Cham, 2015. https://doi.org/10.1007/978-3-319-30162-4_4

Puspaningrum, Alifia, Daniel Siahaan, and Chastine Fatichah. "Mobile App Review Labeling Using LDA Similarity and Term Frequency-Inverse Cluster Frequency (TF-ICF)." 2018 10th International Conference on Information Technology and Electrical Engineering (ICITEE). IEEE, 2018. https://doi.org/10.1109/ICITEED.2018.8534785

Lertnattee, Verayuth, and Thanaruk Theeramunkong. "Effect of term distributions on centroid-based text categorization." Information Sciences 158 (2004): 89-115. https://doi.org/10.1016/j.ins.2003.07.007

Nguyen, T. T., Chang, K., & Hui, S. C. (2013). Supervised term weighting centroid-based classifiers for text categorization. Knowledge and information systems, 35, 61-85. https://doi.org/10.1007/s10115-012-0559-9

Slamet, Cepi, et al. "Automated text summarization for indonesian article using vector space model." IOP Conference Series: Materials Science and Engineering. Vol. 288. No. 1. IOP Publishing, 2018. https://doi.org/10.1088/1757-899X/288/1/012037

Wahyudi, Dwi, Teguh Susyanto, and Didik Nugroho. "Implementasi Dan Analisis Algoritma Stemming Nazief & Adriani Dan Porter Pada Dokumen Berbahasa Indonesia." Jurnal Ilmiah SINUS 15.2 (2017): 49-56. http://dx.doi.org/10.30646/sinus.v15i2.305

Grootendorst, M. (2022). BERTopic: Neural topic modeling with a class-based TF-IDF procedure. arXiv preprint arXiv:2203.05794. https://doi.org/10.48550/arXiv.2203.05794

Kim, S. W., & Gil, J. M. (2019). Research paper classification systems based on TF-IDF and LDA schemes. Human-centric Computing and Information Sciences, 9, 1-21. https://doi.org/10.1186/s13673-019-0192-7

Takçı, H., & Güngör, T. (2012). A high performance centroid-based classification approach for language identification. Pattern Recognition Letters, 33(16), 2077-2084. https://doi.org/10.1016/j.patrec.2012.06.012

Lertnattee, V., & Theeramunkong, T. (2004, October). Analysis of inverse class frequency in centroid-based text classification. In IEEE International Symposium on Communications and Information Technology, 2004. ISCIT 2004. (Vol. 2, pp. 1171-1176). IEEE. https://doi.org/10.1109/ISCIT.2004.1413903

Cieza, A., Fayed, N., Bickenbach, J., & Prodinger, B. (2019). Refinements of the ICF Linking Rules to strengthen their potential for establishing comparability of health information. Disability and rehabilitation, 41(5), 574-583. https://doi.org/10.3109/09638288.2016.1145258

Lertnattee, V., & Theeramunkong, T. (2004). Effect of term distributions on centroid-based text categorization. Information Sciences, 158, 89-115. https://doi.org/10.1016/j.ins.2003.07.007

Liu, C., Wang, W., Tu, G., Xiang, Y., Wang, S., & Lv, F. (2017). A new Centroid-Based Classification model for text categorization. Knowledge-Based Systems, 136, 15-26. https://doi.org/10.1016/j.knosys.2017.08.020

Guan, H., Zhou, J., & Guo, M. (2009, April). A class-feature-centroid classifier for text categorization. In Proceedings of the 18th international conference on World wide web (pp. 201-210). https://doi.org/10.1145/1526709.1526737

Huang, W., Liu, H., Zhang, Y., Mi, R., Tong, C., Xiao, W., & Shuai, B. (2021). Railway dangerous goods transportation system risk identification: Comparisons among SVM, PSO-SVM, GA-SVM and GS-SVM. Applied Soft Computing, 109, 107541. https://doi.org/10.1016/j.asoc.2021.107541

Dai, T. T., & Dong, Y. S. (2020, April). Introduction of SVM related theory and its application research. In 2020 3rd International Conference on Advanced Electronic Materials, Computers and Software Engineering (AEMCSE) (pp. 230-233). IEEE. https://doi.org/10.1109/AEMCSE50948.2020.00056

Chauhan, V. K., Dahiya, K., & Sharma, A. (2019). Problem formulations and solvers in linear SVM: a review. Artificial Intelligence Review, 52(2), 803-855. https://doi.org/10.1007/s10462-018-9614-6

Ring, M., & Eskofier, B. M. (2016). An approximation of the Gaussian RBF kernel for efficient classification with SVMs. Pattern Recognition Letters, 84, 107-113. https://doi.org/10.1016/j.patrec.2016.08.013

Author Biography

Christian Sri Kusuma Aditya, Universitas Muhammadiyah Malang

Profil Scopus: https://www.scopus.com/authid/detail.uri?authorId=57211342456

Profil Google Scholar: https://scholar.google.co.id/citations?hl=id&user=vCgGD8sAAAAJ

Issue

Vol. 8, No. 4, November 2023

Combination of Term Weighting with Class Distribution and Centroid-based Approach for Document Classification

Corresponding Author(s) : Christian Sri Kusuma Aditya

Abstract

Keywords

Download Citation

References

Author Biography

Downloads