Classification of Lexile Level Reading Load Using the K-Means Clustering and Random Forest Method
Abstract views: 114

Classification of Lexile Level Reading Load Using the K-Means Clustering and Random Forest Method

Harits Ar Rosyid, Utomo Pujianto, Moch Rajendra Yudhistira


There are various ways to improve the quality of someone's education, one of them is reading. By reading, insight and knowledge of various kinds of things can increase. But, the ability and someone's understanding of reading is different. This can be a problem for readers if the reading material exceeds his comprehension ability. Therefore, it is necessary to determine the load of reading material using Lexile Levels. Lexile Levels are a value that gives a size the complexity of reading material and someone's reading ability. Thus, the reading material will be classified based a value on the Lexile Levels. Lexile Levels will cluster the reading material into 2 clusters which is easy, and difficult. The clustering process will use the k-means method. After the clustering process, reading material will be classified using the reading load Random Forest method. The k-means method was chosen because of the method has a simple computing process and fast also. Random Forest algorithm is a method that can build decision tree and it’s able to build several decision trees then choose the best tree. The results of this experiment indicate that the experiment scenario uses 2 cluster and SMOTE and GIFS preprocessing are carried out shows good results with an accuracy of 76.03%, precision of 81.85% and recall of 76.05%.


Text Classification, Lexile Levels, Clustering, K-Means, Random Forest


[1] J. Oakhill, “Children’s difficulties in reading comprehension,” Educational Psychology Review, vol. 5, no. 3, pp. 223–237, 1993.

[2] K. Glasswell and M. P. Ford, “Teaching flexibly with leveled texts: More power for your reading block,” The Reading Teacher, vol. 64, no. 1, pp. 57–60, 2010.

[3] C. Lennon and H. Burdick, “The lexile framework as an approach for reading measurement and success,” electronic publication on www. lexile. com, 2004.

[4] M. Awad and R. Khanna, Efficient learning machines: theories, concepts, and applications for engineers and system designers. Apress, 2015.

[5] S. Yaram, “Machine learning algorithms for document clustering and fraud detection,” in 2016 International Conference on Data Science and Engineering (ICDSE), 2016, pp. 1–6.

[6] M. Fernández-Delgado, E. Cernadas, S. Barro, and D. Amorim, “Do we Need Hundreds of Classifiers to Solve Real World Classification Problems?,” Journal of Machine Learning Research, vol. 15, pp. 3133–3181, 2014.

[7] L. Breiman, “Random Forests,” Machine Learning, vol. 45, no. 1, pp. 5–32, Oct. 2001.

[8] J. R. Quinlan, “Induction of decision trees,” Mach Learn, vol. 1, no. 1, pp. 81–106, Mar. 1986.

[9] B. Wang, “a new clustering algorithm compared with the simple K-Means,” in 2009 International Conference on Management and Service Science, 2009, pp. 1–5.

[10] I. H. Witten, E. Frank, M. A. Hall, and C. J. Pal, Data Mining: Practical machine learning tools and techniques. Morgan Kaufmann, 2016.

[11] S. Robertson, “Understanding inverse document frequency: on theoretical arguments for IDF,” Journal of Documentation, vol. 60, no. 5, pp. 503–520, Oct. 2004.

[12] Z. K. A. Baizal, M. A. Bijaksana, and A. S. Sastrawan, “Analisis pengaruh metode over sampling dalam churn prediction untuk perusahaan telekomunikasi,” Jurnal Fakultas Hukum UII, 2009.

[13] W. Zhu, J. Feng, and Y. Lin, “Using Gini-Index for Feature Selection in Text Categorization,” presented at the 2014 International Conference on Information, Business and Education Technology (ICIBET 2014), 2014.

[14] A. Van Assche, C. Vens, H. Blockeel, and S. Džeroski, “First order random forests: Learning relational classifiers with complex aggregates,” Mach Learn, vol. 64, no. 1, pp. 149–182, Sep. 2006.

[15] L. Breiman, “Bagging Predictors,” Machine Learning, vol. 24, no. 2, pp. 123–140, Aug. 1996.

[16] U. Pujianto, “Random forest and novel under-sampling strategy for data imbalance in software defect prediction,” International Journal of Engineering and Technology(UAE), vol. 7, pp. 39–42, Jan. 2018.

[17] E. Olivetti, S. Greiner, and P. Avesani, “Statistical independence for the evaluation of classifier-based diagnosis,” Brain Inf., vol. 2, no. 1, pp. 13–19, Mar. 2015.


  • There are currently no refbacks.

Indexed by: 


Referencing Software:

Checked by:

Supervised by:


View My Stats

Creative Commons License Kinetik : Game Technology, Information System, Computer Network, Computing, Electronics, and Control by is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.