
This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.
A Data-Driven Framework Integrating Clustering and Classification for Fair Tuition Grouping (UKT) Prediction
Corresponding Author(s) : Windy Chikita Cornia Putri
Kinetik: Game Technology, Information System, Computer Network, Computing, Electronics, and Control,
Vol. 11, No. 3, August 2026 (Article in Progress)
Abstract
This study aims to identify the most effective combination of feature selection techniques and classification algorithms for predicting student tuition groups (Uang Kuliah Tunggal, UKT) based on pre-admission data. Three feature selection methods Exploratory Factor Analysis (EFA), Recursive Feature Elimination (RFE), and Random Forest Feature Importance (RFFI) were employed and combined with five supervised learning models: Decision Tree, Random Forest, Support Vector Machine (SVM) with RBF kernel, Naïve Bayes, and K-Nearest Neighbor (KNN). The results demonstrate that the EFA–SVM (RBF) combination achieved the best performance, with an average accuracy exceeding 98%, outperforming other models across most faculties. EFA also yielded the highest Silhouette Score (0.2933), indicating a more stable and distinct cluster structure compared to RFE (0.2564) and RFFI (0.2575). These findings highlight the critical role of appropriate feature selection in improving classification accuracy and model generalization, particularly when emphasizing socioeconomic variables such as parental income, land area, housing conditions, and basic family facilities. The integration of factor-based dimensionality reduction with non-linear classification algorithms proved effective in developing a more transparent and equitable UKT prediction model. This research contributes to the advancement of data-driven decision support systems in higher education and provides a foundation for future automation in tuition group determination processes.
Keywords
Download Citation
Endnote/Zotero/Mendeley (RIS)BibTeX
- Ministry of Education and Culture of the Republic of Indonesia. (2020). Regulation on the implementation of the single tuition fee (UKT) policy in public universities. Jakarta: Ministry of Education and Culture of the Republic of Indonesia. Retrieved from https://peraturan.bpk.go.id/
- Hasan, M., & Lubis, R. (2023). Analysis of the single tuition fee (UKT) policy and its implications for social equity among public university students in Indonesia. Journal of Educational Policy, 12(1), 45–58. https://doi.org/10.21009/jkp.2023.12.1.45
- Yates, H., & Chamberlain, C. (2017). Machine learning and higher education. EDUCAUSE Review. https://er.educause.edu/articles/2017/12/machine-learning-and-higher-education
- Kosztyán, Z. T., Boda, G., & Kádek, T. (2020). Analyzing and clustering students’ application preferences for higher education institutions. PLoS One, 15(7), e0235420. https://doi.org/10.1371/journal.pone.0235420
- Mohamed Nafuri, A. F., Sani, N. S., Zainudin, N. F. A., Rahman, A. H. A., & Aliff, M. (2022). Clustering analysis for classifying student academic performance in higher education. Applied Sciences, 12(19), 9467. https://doi.org/10.3390/app12199467
- Minor, R. (2023). How tuition fees affected student enrollment at higher education institutions: The aftermath of a German quasi-experiment. Journal for Labour Market Research, 57(1). https://doi.org/10.1186/s12651-023-00354-7
- Lundin, H. (2024). Tuition fees for international students: A policy instrument of higher education institutions? Studies in Higher Education. https://doi.org/10.1080/21568235.2024.2353757
- Putri, W. C. C., Yustanti, W., & Yohannes, E. (2025). A comparative study of supervised feature selection methods for predicting Uang Kuliah Tunggal (UKT) groups. J-ICON: Jurnal Komputer dan Informatika, 13(2), 68–76. Universitas Nusa Cendana.
- Yu, S., Cai, Y., Pan, B., & Leung, M.-F. (2024). Semi-supervised feature selection of educational data mining for student performance analysis. Electronics, 13(3), 659. https://doi.org/10.3390/electronics13030659
- Garrido-Labrador, J. L., Fernández-García, A. J., López-Morales, J. M., & García-Sánchez, P. (2024). Ensemble methods and semi-supervised learning for student classification: A systematic review. Information Sciences, 658, 119785. https://doi.org/10.1016/j.ins.2024.00088
- Han, J., Kamber, M., & Pei, J. (2012). Data mining: Concepts and techniques (3rd ed.). Morgan Kaufmann.
- Guanin-Fajardo, J. H., et al. (2024). Predicting academic success of college students using machine learning: Feature selection, balancing techniques, and interpretation. Data, 9(4), 60. https://doi.org/10.3390/data9040060
- Yates, D. S., & Chamberlain, S. (2017). Principles of data wrangling: Practical techniques for data preparation. O’Reilly Media.
- Nguyen, H. T., & Do, T. T. (2023). An effective data preprocessing framework for educational datasets: Improving student performance prediction. Education and Information Technologies, 28(2), 1893–1912. https://doi.org/10.1007/s10639-022-11346-9
- Yusliani, N. (2022). The effect of Chi-Square feature selection on question classification. Sinkron: Jurnal Politeknik Pancasila, 6(3), 77–84.
- Mustapha, S., Shah, N., & Arshad, M. (2023). A comparative study of feature selection methods. Informatics, 6(5), 86. https://doi.org/10.3390/informatics6050086
- Tariq, M. A. (2024). A study on comparative analysis of feature selection. Journal of Information and Organizational Sciences, 48(2), 133–146.
- Haryanto, A., & Widodo, A. (2024). Evaluating recursive feature elimination stability on socio-economic surveys. Indonesian Journal of Artificial Intelligence, 11(2), 87–99
- Gul, M. N., et al. (2025). Data-driven decisions in education using a comprehensive machine learning framework. Information Retrieval Journal, 28(3), 211–229. https://doi.org/10.1007/s10791-025-09585-3
- Basri, F., & Jannah, M. (2023). Hybrid Chi-Square–LASSO feature selection for imbalanced educational data. Journal of Educational Data Science, 2(1), 15–29
- Cappelli, F., et al. (2024). Random forest and feature-importance measures for multidimensional classification. International Journal of Environmental Research and Public Health, 21(7), 867. https://doi.org/10.3390/ijerph21070867
- Wibowo, F. A. S., et al. (2025). Impact of feature selection on decision tree and random forest for classifying student study success. Barekeng Journal of Mathematics and Applications, 19(1), 51–61.
- Malik, S., et al. (2025). Advancing educational data mining for enhanced student performance prediction: Integrating feature selection and latent factor analysis. Scientific Reports, 15(1), 92324. https://doi.org/10.1038/s41598-025-92324-x
- MacQueen, J. B. (1967). Some methods for classification and analysis of multivariate observations. In Proceedings of the Fifth Berkeley Symposium on Mathematical Statistics and Probability (Vol. 1, pp. 281–297). University of California Press.
- Jain, A. K. (2010). Data clustering: 50 years beyond K-Means. Pattern Recognition Letters, 31(8), 651–666. https://doi.org/10.1016/j.patrec.2009.09.011
- Zhang, Y., & Ma, X. (2023). Dealing with imbalanced datasets in educational prediction: A review of resampling and ensemble methods. Education and Information Technologies, 28(5), 4557–4578. https://doi.org/10.1007/s10639-023-11526-0
- Shu, Y., & Li, C. (2025). Application of improved clustering algorithm in mixed teaching of modern educational technology. Smart Learning Environments, 12(1), 39. https://doi.org/10.1007/s44163-025-00393-8
- Stats StackExchange. (2013). Do low silhouette widths mean the data has little underlying structure? Retrieved October 2025, from https://stats.stackexchange.com/questions/45232/do-low-silhouette-widths-mean-the-data-has-little-underlying-structure
- BMC Bioinformatics. (2022). Assessing clustering performance with silhouette score and related validation indices in high-dimensional biological data. BMC Bioinformatics, 23(1), 412. https://doi.org/10.1186/s12859-022-04957-3
- Quinlan, J. R. (1993). C4.5: Programs for Machine Learning. Morgan Kaufmann.
- Han, J., Kamber, M., & Pei, J. (2021). Data mining: Concepts and techniques (4th ed.). Morgan Kaufmann.
- Breiman, L. (2001). Random forests. Machine Learning, 45(1), 5–32. https://doi.org/10.1023/A:1010933404324
- Cortes, C., & Vapnik, V. (1995). Support-vector networks. Machine Learning, 20(3), 273–297. https://doi.org/10.1007/BF00994018
- Smola, A. J., & Schölkopf, B. (2004). A tutorial on support vector regression. Statistics and Computing, 14(3), 199–222. https://doi.org/10.1023/B:STCO.0000035301.49549.88
- Maron, M. E. (1961). Automatic indexing: An experimental inquiry. Journal of the ACM, 8(3), 404–417.
- Rish, I. (2001). An empirical study of the Naïve Bayes classifier. In IJCAI Workshop on Empirical Methods in AI (pp. 41–46).
- Cover, T., & Hart, P. (1967). Nearest neighbor pattern classification. IEEE Transactions on Information Theory, 13(1), 21–27. https://doi.org/10.1109/TIT.1967.1053964
- Altman, N. S. (1992). An introduction to kernel and nearest neighbor nonparametric regression. The American Statistician, 46(3), 175–185. https://doi.org/10.1080/00031305.1992.10475879
- Powers, D. M. W. (2011). Evaluation: From precision, recall and F-measure to ROC, informedness, markedness and correlation. Journal of Machine Learning Technologies, 2(1), 37–63.
- Sokolova, M., & Lapalme, G. (2009). A systematic analysis of performance measures for classification tasks. Information Processing & Management, 45(4), 427–437. https://doi.org/10.1016/j.ipm.2009.03.002
References
Ministry of Education and Culture of the Republic of Indonesia. (2020). Regulation on the implementation of the single tuition fee (UKT) policy in public universities. Jakarta: Ministry of Education and Culture of the Republic of Indonesia. Retrieved from https://peraturan.bpk.go.id/
Hasan, M., & Lubis, R. (2023). Analysis of the single tuition fee (UKT) policy and its implications for social equity among public university students in Indonesia. Journal of Educational Policy, 12(1), 45–58. https://doi.org/10.21009/jkp.2023.12.1.45
Yates, H., & Chamberlain, C. (2017). Machine learning and higher education. EDUCAUSE Review. https://er.educause.edu/articles/2017/12/machine-learning-and-higher-education
Kosztyán, Z. T., Boda, G., & Kádek, T. (2020). Analyzing and clustering students’ application preferences for higher education institutions. PLoS One, 15(7), e0235420. https://doi.org/10.1371/journal.pone.0235420
Mohamed Nafuri, A. F., Sani, N. S., Zainudin, N. F. A., Rahman, A. H. A., & Aliff, M. (2022). Clustering analysis for classifying student academic performance in higher education. Applied Sciences, 12(19), 9467. https://doi.org/10.3390/app12199467
Minor, R. (2023). How tuition fees affected student enrollment at higher education institutions: The aftermath of a German quasi-experiment. Journal for Labour Market Research, 57(1). https://doi.org/10.1186/s12651-023-00354-7
Lundin, H. (2024). Tuition fees for international students: A policy instrument of higher education institutions? Studies in Higher Education. https://doi.org/10.1080/21568235.2024.2353757
Putri, W. C. C., Yustanti, W., & Yohannes, E. (2025). A comparative study of supervised feature selection methods for predicting Uang Kuliah Tunggal (UKT) groups. J-ICON: Jurnal Komputer dan Informatika, 13(2), 68–76. Universitas Nusa Cendana.
Yu, S., Cai, Y., Pan, B., & Leung, M.-F. (2024). Semi-supervised feature selection of educational data mining for student performance analysis. Electronics, 13(3), 659. https://doi.org/10.3390/electronics13030659
Garrido-Labrador, J. L., Fernández-García, A. J., López-Morales, J. M., & García-Sánchez, P. (2024). Ensemble methods and semi-supervised learning for student classification: A systematic review. Information Sciences, 658, 119785. https://doi.org/10.1016/j.ins.2024.00088
Han, J., Kamber, M., & Pei, J. (2012). Data mining: Concepts and techniques (3rd ed.). Morgan Kaufmann.
Guanin-Fajardo, J. H., et al. (2024). Predicting academic success of college students using machine learning: Feature selection, balancing techniques, and interpretation. Data, 9(4), 60. https://doi.org/10.3390/data9040060
Yates, D. S., & Chamberlain, S. (2017). Principles of data wrangling: Practical techniques for data preparation. O’Reilly Media.
Nguyen, H. T., & Do, T. T. (2023). An effective data preprocessing framework for educational datasets: Improving student performance prediction. Education and Information Technologies, 28(2), 1893–1912. https://doi.org/10.1007/s10639-022-11346-9
Yusliani, N. (2022). The effect of Chi-Square feature selection on question classification. Sinkron: Jurnal Politeknik Pancasila, 6(3), 77–84.
Mustapha, S., Shah, N., & Arshad, M. (2023). A comparative study of feature selection methods. Informatics, 6(5), 86. https://doi.org/10.3390/informatics6050086
Tariq, M. A. (2024). A study on comparative analysis of feature selection. Journal of Information and Organizational Sciences, 48(2), 133–146.
Haryanto, A., & Widodo, A. (2024). Evaluating recursive feature elimination stability on socio-economic surveys. Indonesian Journal of Artificial Intelligence, 11(2), 87–99
Gul, M. N., et al. (2025). Data-driven decisions in education using a comprehensive machine learning framework. Information Retrieval Journal, 28(3), 211–229. https://doi.org/10.1007/s10791-025-09585-3
Basri, F., & Jannah, M. (2023). Hybrid Chi-Square–LASSO feature selection for imbalanced educational data. Journal of Educational Data Science, 2(1), 15–29
Cappelli, F., et al. (2024). Random forest and feature-importance measures for multidimensional classification. International Journal of Environmental Research and Public Health, 21(7), 867. https://doi.org/10.3390/ijerph21070867
Wibowo, F. A. S., et al. (2025). Impact of feature selection on decision tree and random forest for classifying student study success. Barekeng Journal of Mathematics and Applications, 19(1), 51–61.
Malik, S., et al. (2025). Advancing educational data mining for enhanced student performance prediction: Integrating feature selection and latent factor analysis. Scientific Reports, 15(1), 92324. https://doi.org/10.1038/s41598-025-92324-x
MacQueen, J. B. (1967). Some methods for classification and analysis of multivariate observations. In Proceedings of the Fifth Berkeley Symposium on Mathematical Statistics and Probability (Vol. 1, pp. 281–297). University of California Press.
Jain, A. K. (2010). Data clustering: 50 years beyond K-Means. Pattern Recognition Letters, 31(8), 651–666. https://doi.org/10.1016/j.patrec.2009.09.011
Zhang, Y., & Ma, X. (2023). Dealing with imbalanced datasets in educational prediction: A review of resampling and ensemble methods. Education and Information Technologies, 28(5), 4557–4578. https://doi.org/10.1007/s10639-023-11526-0
Shu, Y., & Li, C. (2025). Application of improved clustering algorithm in mixed teaching of modern educational technology. Smart Learning Environments, 12(1), 39. https://doi.org/10.1007/s44163-025-00393-8
Stats StackExchange. (2013). Do low silhouette widths mean the data has little underlying structure? Retrieved October 2025, from https://stats.stackexchange.com/questions/45232/do-low-silhouette-widths-mean-the-data-has-little-underlying-structure
BMC Bioinformatics. (2022). Assessing clustering performance with silhouette score and related validation indices in high-dimensional biological data. BMC Bioinformatics, 23(1), 412. https://doi.org/10.1186/s12859-022-04957-3
Quinlan, J. R. (1993). C4.5: Programs for Machine Learning. Morgan Kaufmann.
Han, J., Kamber, M., & Pei, J. (2021). Data mining: Concepts and techniques (4th ed.). Morgan Kaufmann.
Breiman, L. (2001). Random forests. Machine Learning, 45(1), 5–32. https://doi.org/10.1023/A:1010933404324
Cortes, C., & Vapnik, V. (1995). Support-vector networks. Machine Learning, 20(3), 273–297. https://doi.org/10.1007/BF00994018
Smola, A. J., & Schölkopf, B. (2004). A tutorial on support vector regression. Statistics and Computing, 14(3), 199–222. https://doi.org/10.1023/B:STCO.0000035301.49549.88
Maron, M. E. (1961). Automatic indexing: An experimental inquiry. Journal of the ACM, 8(3), 404–417.
Rish, I. (2001). An empirical study of the Naïve Bayes classifier. In IJCAI Workshop on Empirical Methods in AI (pp. 41–46).
Cover, T., & Hart, P. (1967). Nearest neighbor pattern classification. IEEE Transactions on Information Theory, 13(1), 21–27. https://doi.org/10.1109/TIT.1967.1053964
Altman, N. S. (1992). An introduction to kernel and nearest neighbor nonparametric regression. The American Statistician, 46(3), 175–185. https://doi.org/10.1080/00031305.1992.10475879
Powers, D. M. W. (2011). Evaluation: From precision, recall and F-measure to ROC, informedness, markedness and correlation. Journal of Machine Learning Technologies, 2(1), 37–63.
Sokolova, M., & Lapalme, G. (2009). A systematic analysis of performance measures for classification tasks. Information Processing & Management, 45(4), 427–437. https://doi.org/10.1016/j.ipm.2009.03.002