Quick jump to page content
  • Main Navigation
  • Main Content
  • Sidebar

  • Home
  • Current
  • Archives
  • Join As Reviewer
  • Info
  • Announcements
  • Statistics
  • About
    • About the Journal
    • Submissions
    • Editorial Team
    • Privacy Statement
    • Contact
  • Register
  • Login
  • Home
  • Current
  • Archives
  • Join As Reviewer
  • Info
  • Announcements
  • Statistics
  • About
    • About the Journal
    • Submissions
    • Editorial Team
    • Privacy Statement
    • Contact
  1. Home
  2. Archives
  3. Vol. 8, No. 4, November 2023
  4. Articles

Issue

Vol. 8, No. 4, November 2023

Issue Published : Nov 30, 2023
Creative Commons License

This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.

Combination of Term Weighting with Class Distribution and Centroid-based Approach for Document Classification

https://doi.org/10.22219/kinetik.v8i4`.1793
Christian Sri Kusuma Aditya
Universitas Muhammadiyah Malang
Fauzi Dwi Setiawan Sumadi
Universitas Muhammadiyah Malang

Corresponding Author(s) : Christian Sri Kusuma Aditya

christianskaditya@umm.ac.id

Kinetik: Game Technology, Information System, Computer Network, Computing, Electronics, and Control, Vol. 8, No. 4, November 2023
Article Published : Nov 30, 2023

Share
WA Share on Facebook Share on Twitter Pinterest Email Telegram
  • Abstract
  • Cite
  • References
  • Authors Details

Abstract

A text retrieval system requires a method that is able to return a number of documents with high relevance upon user requests. One of the important stages in the text representation process is the weighting process. The use of Term Frequency (TF) considers the number of word occurrences in each document, while Inverse Document Frequency (IDF) considers the wide distribution of words throughout the document collection. However, the TF-IDF weighting cannot represent the distribution of words to documents with many classes or categories. The more unequal the distribution of words in each category, the more important the word features should be. This study developed a new term weighting method where weighting is carried out based on the frequency of occurrence of terms in each class which is integrated with the distribution of centroid-based terms which can minimize intra-cluster similarity and maximize inter-cluster variance. The ICF.TDCB term weighting method has been able to provide the best results in its application to SVM modeling with a dataset of 931 online news documents. The results show that SVM modeling had accuracy of 0.723, outperforming the use of other term weightings such as TF.IDF, ICF & TDCB.

Keywords

Term Weighting TF-IDF ICF Term Distribution Centroid Text
Sri Kusuma Aditya, C., & Sumadi, F. D. S. . (2023). Combination of Term Weighting with Class Distribution and Centroid-based Approach for Document Classification. Kinetik: Game Technology, Information System, Computer Network, Computing, Electronics, and Control, 8(4`). https://doi.org/10.22219/kinetik.v8i4`.1793
  • ACM
  • ACS
  • APA
  • ABNT
  • Chicago
  • Harvard
  • IEEE
  • MLA
  • Turabian
  • Vancouver
Download Citation
Endnote/Zotero/Mendeley (RIS)
BibTeX
References
  1. https://news.un.org/en/story/2022/03/1113702
  2. Fayaz, M., Khan, A., Bilal, M., & Khan, S. U. (2022). Machine learning for fake news classification with optimal feature selection. Soft Computing, 26(16), 7763-7771. https://doi.org/10.1007/s00500-022-06773-x
  3. Daud, S., Ullah, M., Rehman, A., Saba, T., Damaševičius, R., & Sattar, A. (2023). Topic classification of online news articles using optimized machine learning models. Computers, 12(1), 16. https://doi.org/10.3390/computers12010016
  4. Alodadi, Mohammad, and Vandana P. Janeja. "Similarity in patient support forums using tf-idf and cosine similarity metrics." 2015 International Conference on Healthcare Informatics. IEEE, 2015. https://doi.org/10.1109/ICHI.2015.99
  5. Qaiser, S., & Ali, R. (2018). Text mining: use of TF-IDF to examine the relevance of words to documents. International Journal of Computer Applications, 181(1), 25-29. https://doi.org/10.5120/ijca2018917395
  6. Guo, Aizhang, and Tao Yang. "Research and improvement of feature words weight based on TFIDF algorithm." 2016 IEEE Information Technology, Networking, Electronic and Automation Control Conference. IEEE, 2016. https://doi.org/10.1109/ITNEC.2016.7560393
  7. Uysal, Alper Kursat. "An improved global feature selection scheme for text classification." Expert systems with Applications 43 (2016): 82-92. https://doi.org/10.1016/j.eswa.2015.08.050
  8. Domeniconi, Giacomo, et al. "A comparison of term weighting schemes for text classification and sentiment analysis with a supervised variant of tf. idf." International Conference on Data Management Technologies and Applications. Springer, Cham, 2015. https://doi.org/10.1007/978-3-319-30162-4_4
  9. Puspaningrum, Alifia, Daniel Siahaan, and Chastine Fatichah. "Mobile App Review Labeling Using LDA Similarity and Term Frequency-Inverse Cluster Frequency (TF-ICF)." 2018 10th International Conference on Information Technology and Electrical Engineering (ICITEE). IEEE, 2018. https://doi.org/10.1109/ICITEED.2018.8534785
  10. Lertnattee, Verayuth, and Thanaruk Theeramunkong. "Effect of term distributions on centroid-based text categorization." Information Sciences 158 (2004): 89-115. https://doi.org/10.1016/j.ins.2003.07.007
  11. Nguyen, T. T., Chang, K., & Hui, S. C. (2013). Supervised term weighting centroid-based classifiers for text categorization. Knowledge and information systems, 35, 61-85. https://doi.org/10.1007/s10115-012-0559-9
  12. Slamet, Cepi, et al. "Automated text summarization for indonesian article using vector space model." IOP Conference Series: Materials Science and Engineering. Vol. 288. No. 1. IOP Publishing, 2018. https://doi.org/10.1088/1757-899X/288/1/012037
  13. Wahyudi, Dwi, Teguh Susyanto, and Didik Nugroho. "Implementasi Dan Analisis Algoritma Stemming Nazief & Adriani Dan Porter Pada Dokumen Berbahasa Indonesia." Jurnal Ilmiah SINUS 15.2 (2017): 49-56. http://dx.doi.org/10.30646/sinus.v15i2.305
  14. Grootendorst, M. (2022). BERTopic: Neural topic modeling with a class-based TF-IDF procedure. arXiv preprint arXiv:2203.05794. https://doi.org/10.48550/arXiv.2203.05794
  15. Kim, S. W., & Gil, J. M. (2019). Research paper classification systems based on TF-IDF and LDA schemes. Human-centric Computing and Information Sciences, 9, 1-21. https://doi.org/10.1186/s13673-019-0192-7
  16. Takçı, H., & Güngör, T. (2012). A high performance centroid-based classification approach for language identification. Pattern Recognition Letters, 33(16), 2077-2084. https://doi.org/10.1016/j.patrec.2012.06.012
  17. Lertnattee, V., & Theeramunkong, T. (2004, October). Analysis of inverse class frequency in centroid-based text classification. In IEEE International Symposium on Communications and Information Technology, 2004. ISCIT 2004. (Vol. 2, pp. 1171-1176). IEEE. https://doi.org/10.1109/ISCIT.2004.1413903
  18. Cieza, A., Fayed, N., Bickenbach, J., & Prodinger, B. (2019). Refinements of the ICF Linking Rules to strengthen their potential for establishing comparability of health information. Disability and rehabilitation, 41(5), 574-583. https://doi.org/10.3109/09638288.2016.1145258
  19. Lertnattee, V., & Theeramunkong, T. (2004). Effect of term distributions on centroid-based text categorization. Information Sciences, 158, 89-115. https://doi.org/10.1016/j.ins.2003.07.007
  20. Liu, C., Wang, W., Tu, G., Xiang, Y., Wang, S., & Lv, F. (2017). A new Centroid-Based Classification model for text categorization. Knowledge-Based Systems, 136, 15-26. https://doi.org/10.1016/j.knosys.2017.08.020
  21. Guan, H., Zhou, J., & Guo, M. (2009, April). A class-feature-centroid classifier for text categorization. In Proceedings of the 18th international conference on World wide web (pp. 201-210). https://doi.org/10.1145/1526709.1526737
  22. Huang, W., Liu, H., Zhang, Y., Mi, R., Tong, C., Xiao, W., & Shuai, B. (2021). Railway dangerous goods transportation system risk identification: Comparisons among SVM, PSO-SVM, GA-SVM and GS-SVM. Applied Soft Computing, 109, 107541. https://doi.org/10.1016/j.asoc.2021.107541
  23. Dai, T. T., & Dong, Y. S. (2020, April). Introduction of SVM related theory and its application research. In 2020 3rd International Conference on Advanced Electronic Materials, Computers and Software Engineering (AEMCSE) (pp. 230-233). IEEE. https://doi.org/10.1109/AEMCSE50948.2020.00056
  24. Chauhan, V. K., Dahiya, K., & Sharma, A. (2019). Problem formulations and solvers in linear SVM: a review. Artificial Intelligence Review, 52(2), 803-855. https://doi.org/10.1007/s10462-018-9614-6
  25. Ring, M., & Eskofier, B. M. (2016). An approximation of the Gaussian RBF kernel for efficient classification with SVMs. Pattern Recognition Letters, 84, 107-113. https://doi.org/10.1016/j.patrec.2016.08.013
Read More

References


https://news.un.org/en/story/2022/03/1113702

Fayaz, M., Khan, A., Bilal, M., & Khan, S. U. (2022). Machine learning for fake news classification with optimal feature selection. Soft Computing, 26(16), 7763-7771. https://doi.org/10.1007/s00500-022-06773-x

Daud, S., Ullah, M., Rehman, A., Saba, T., Damaševičius, R., & Sattar, A. (2023). Topic classification of online news articles using optimized machine learning models. Computers, 12(1), 16. https://doi.org/10.3390/computers12010016

Alodadi, Mohammad, and Vandana P. Janeja. "Similarity in patient support forums using tf-idf and cosine similarity metrics." 2015 International Conference on Healthcare Informatics. IEEE, 2015. https://doi.org/10.1109/ICHI.2015.99

Qaiser, S., & Ali, R. (2018). Text mining: use of TF-IDF to examine the relevance of words to documents. International Journal of Computer Applications, 181(1), 25-29. https://doi.org/10.5120/ijca2018917395

Guo, Aizhang, and Tao Yang. "Research and improvement of feature words weight based on TFIDF algorithm." 2016 IEEE Information Technology, Networking, Electronic and Automation Control Conference. IEEE, 2016. https://doi.org/10.1109/ITNEC.2016.7560393

Uysal, Alper Kursat. "An improved global feature selection scheme for text classification." Expert systems with Applications 43 (2016): 82-92. https://doi.org/10.1016/j.eswa.2015.08.050

Domeniconi, Giacomo, et al. "A comparison of term weighting schemes for text classification and sentiment analysis with a supervised variant of tf. idf." International Conference on Data Management Technologies and Applications. Springer, Cham, 2015. https://doi.org/10.1007/978-3-319-30162-4_4

Puspaningrum, Alifia, Daniel Siahaan, and Chastine Fatichah. "Mobile App Review Labeling Using LDA Similarity and Term Frequency-Inverse Cluster Frequency (TF-ICF)." 2018 10th International Conference on Information Technology and Electrical Engineering (ICITEE). IEEE, 2018. https://doi.org/10.1109/ICITEED.2018.8534785

Lertnattee, Verayuth, and Thanaruk Theeramunkong. "Effect of term distributions on centroid-based text categorization." Information Sciences 158 (2004): 89-115. https://doi.org/10.1016/j.ins.2003.07.007

Nguyen, T. T., Chang, K., & Hui, S. C. (2013). Supervised term weighting centroid-based classifiers for text categorization. Knowledge and information systems, 35, 61-85. https://doi.org/10.1007/s10115-012-0559-9

Slamet, Cepi, et al. "Automated text summarization for indonesian article using vector space model." IOP Conference Series: Materials Science and Engineering. Vol. 288. No. 1. IOP Publishing, 2018. https://doi.org/10.1088/1757-899X/288/1/012037

Wahyudi, Dwi, Teguh Susyanto, and Didik Nugroho. "Implementasi Dan Analisis Algoritma Stemming Nazief & Adriani Dan Porter Pada Dokumen Berbahasa Indonesia." Jurnal Ilmiah SINUS 15.2 (2017): 49-56. http://dx.doi.org/10.30646/sinus.v15i2.305

Grootendorst, M. (2022). BERTopic: Neural topic modeling with a class-based TF-IDF procedure. arXiv preprint arXiv:2203.05794. https://doi.org/10.48550/arXiv.2203.05794

Kim, S. W., & Gil, J. M. (2019). Research paper classification systems based on TF-IDF and LDA schemes. Human-centric Computing and Information Sciences, 9, 1-21. https://doi.org/10.1186/s13673-019-0192-7

Takçı, H., & Güngör, T. (2012). A high performance centroid-based classification approach for language identification. Pattern Recognition Letters, 33(16), 2077-2084. https://doi.org/10.1016/j.patrec.2012.06.012

Lertnattee, V., & Theeramunkong, T. (2004, October). Analysis of inverse class frequency in centroid-based text classification. In IEEE International Symposium on Communications and Information Technology, 2004. ISCIT 2004. (Vol. 2, pp. 1171-1176). IEEE. https://doi.org/10.1109/ISCIT.2004.1413903

Cieza, A., Fayed, N., Bickenbach, J., & Prodinger, B. (2019). Refinements of the ICF Linking Rules to strengthen their potential for establishing comparability of health information. Disability and rehabilitation, 41(5), 574-583. https://doi.org/10.3109/09638288.2016.1145258

Lertnattee, V., & Theeramunkong, T. (2004). Effect of term distributions on centroid-based text categorization. Information Sciences, 158, 89-115. https://doi.org/10.1016/j.ins.2003.07.007

Liu, C., Wang, W., Tu, G., Xiang, Y., Wang, S., & Lv, F. (2017). A new Centroid-Based Classification model for text categorization. Knowledge-Based Systems, 136, 15-26. https://doi.org/10.1016/j.knosys.2017.08.020

Guan, H., Zhou, J., & Guo, M. (2009, April). A class-feature-centroid classifier for text categorization. In Proceedings of the 18th international conference on World wide web (pp. 201-210). https://doi.org/10.1145/1526709.1526737

Huang, W., Liu, H., Zhang, Y., Mi, R., Tong, C., Xiao, W., & Shuai, B. (2021). Railway dangerous goods transportation system risk identification: Comparisons among SVM, PSO-SVM, GA-SVM and GS-SVM. Applied Soft Computing, 109, 107541. https://doi.org/10.1016/j.asoc.2021.107541

Dai, T. T., & Dong, Y. S. (2020, April). Introduction of SVM related theory and its application research. In 2020 3rd International Conference on Advanced Electronic Materials, Computers and Software Engineering (AEMCSE) (pp. 230-233). IEEE. https://doi.org/10.1109/AEMCSE50948.2020.00056

Chauhan, V. K., Dahiya, K., & Sharma, A. (2019). Problem formulations and solvers in linear SVM: a review. Artificial Intelligence Review, 52(2), 803-855. https://doi.org/10.1007/s10462-018-9614-6

Ring, M., & Eskofier, B. M. (2016). An approximation of the Gaussian RBF kernel for efficient classification with SVMs. Pattern Recognition Letters, 84, 107-113. https://doi.org/10.1016/j.patrec.2016.08.013

Author Biography

Christian Sri Kusuma Aditya, Universitas Muhammadiyah Malang

Profil Scopus: https://www.scopus.com/authid/detail.uri?authorId=57211342456

Profil Google Scholar: https://scholar.google.co.id/citations?hl=id&user=vCgGD8sAAAAJ

Download this PDF file
PDF
Statistic
Read Counter : 12 Download : 12

Downloads

Download data is not yet available.

Quick Link

  • Author Guidelines
  • Download Manuscript Template
  • Peer Review Process
  • Editorial Board
  • Reviewer Acknowledgement
  • Aim and Scope
  • Publication Ethics
  • Licensing Term
  • Copyright Notice
  • Open Access Policy
  • Important Dates
  • Author Fees
  • Indexing and Abstracting
  • Archiving Policy
  • Scopus Citation Analysis
  • Statistic
  • Article Withdrawal

Meet Our Editorial Team

Ir. Amrul Faruq, M.Eng., Ph.D
Editor in Chief
Universitas Muhammadiyah Malang
Google Scholar Scopus
Agus Eko Minarno
Editorial Board
Universitas Muhammadiyah Malang
Google Scholar  Scopus
Hanung Adi Nugroho
Editorial Board
Universitas Gadjah Mada
Google Scholar Scopus
Roman Voliansky
Editorial Board
Dniprovsky State Technical University, Ukraine
Google Scholar Scopus
Read More
 

KINETIK: Game Technology, Information System, Computer Network, Computing, Electronics, and Control
eISSN : 2503-2267
pISSN : 2503-2259


Address

Program Studi Elektro dan Informatika

Fakultas Teknik, Universitas Muhammadiyah Malang

Jl. Raya Tlogomas 246 Malang

Phone 0341-464318 EXT 247

Contact Info

Principal Contact

Amrul Faruq
Phone: +62 812-9398-6539
Email: faruq@umm.ac.id

Support Contact

Fauzi Dwi Setiawan Sumadi
Phone: +62 815-1145-6946
Email: fauzisumadi@umm.ac.id

© 2020 KINETIK, All rights reserved. This is an open-access article distributed under the terms of the Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License