
This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.
Comparative Evaluation of BM25–FAISS and Small-LLM–GPT in Retrieval-Augmented Generation Concept Map Assessment
Corresponding Author(s) : Didik Dwi Prasetya
Kinetik: Game Technology, Information System, Computer Network, Computing, Electronics, and Control,
Vol. 11, No. 1, February 2026
Abstract
The development of Large Language Models (LLMs) has opened up new opportunities in the development of automated concept map-based assessment systems. One promising approach is Retrieval-Augmented Generation (RAG), which combines search capabilities to find relevant information with generation to produce more meaningful context-based assessments. This study compares two search methods, namely BM25 based on keyword matching and FAISS based on vector representation, as well as two generative models, namely Small-LLM and GPT, in the task of concept map proposition assessment in the relational database domain. The results show that the FAISS–GPT combination provides the best performance with a Macro-F1 score of 0.338, a QWK score of 0.146, and the lowest error with an MAE of 0.973 and an RMSE of 1.321, indicating a slight but noticeable improvement in agreement with expert scores compared to other configurations. Additionally, this combination also displayed an Explanation Relevance Score (ERS) of 0.79, demonstrating GPT's ability to generate more relevant, consistent, and human-like explanations. In contrast, Small-LLM had lower accuracy despite excelling in computational time efficiency, making it a viable option for resource-constrained systems. Overall, the results of this study confirm that the integration of dense retrieval FAISS and large GPT language models can improve the quality of concept-based automatic assessment in terms of accuracy, consistency, and semantic relevance, thereby potentially strengthening concept-based learning systems in digital education.
Keywords
Download Citation
Endnote/Zotero/Mendeley (RIS)BibTeX
- Y. Fukui, Y. Kawata, K. Kobashi, Y. Nagatani, and H. Iguchi, “Evaluation of a retrieval-augmented generation system using a Japanese Institutional Nuclear Medicine Manual and large language model-automated scoring,” Radiol Phys Technol, vol. 18, no. 3, pp. 861–876, 2025, https://doi.org/10.1007/s12194-025-00941-y
- R. Tharaniya Sairaj and S. R. Balasundaram, “Ontology Mapping for Retrieval Augmented Modelling to Reduce Factual Hallucinations in Pretrained Language Model-Based Auto-Generated Questions,” Appl Ontol, vol. 20, no. 1, pp. 69–88, 2025, https://doi.org/10.1177/15705838251343009
- T. Evans and I. Jeong, “Concept maps as assessment for learning in university mathematics,” Educational Studies in Mathematics, vol. 113, no. 3, pp. 475–498, 2023, https://doi.org/10.1007/s10649-023-10209-0
- A. Jackson, E. Barrella, and C. Bodnar, “Application of concept maps as an assessment tool in engineering education: Systematic literature review,” Journal of Engineering Education, vol. 113, no. 4, pp. 752–766, Oct. 2024, https://doi.org/10.1002/jee.20548
- M. K. Watson, J. Pelkey, C. R. Noyes, and M. O. Rodgers, “Assessing Conceptual Knowledge Using Three Concept Map Scoring Methods,” Journal of Engineering Education, vol. 105, no. 1, pp. 118–146, 2016, https://doi.org/10.1002/jee.20111
- M. Fonseca et al., “The effectiveness of concept mapping as a tool for developing critical thinking in undergraduate medical education – a BEME systematic review: BEME Guide No. 81,” Med Teach, vol. 46, no. 9, pp. 1120–1133, Sep. 2024, https://doi.org/10.1080/0142159X.2023.2281248
- C. Cohn, N. Hutchins, T. Le, and G. Biswas, “A Chain-of-Thought Prompting Approach with LLMs for Evaluating Students’ Formative Assessment Responses in Science,” Proceedings of the AAAI Conference on Artificial Intelligence, vol. 38, no. 21, pp. 23182–23190, 2024, https://doi.org/10.1609/aaai.v38i21.30364
- M. Klesel and H. F. Wittmann, “Retrieval-Augmented Generation (RAG),” Business & Information Systems Engineering, vol. 67, no. 4, pp. 551–561, 2025, https://doi.org/10.1007/s12599-025-00945-3.
- L. Li and P. Sen, “Unraveling the Influence of Training Data and Internal Structures in Large Language Models for Enhanced Explainability (Student Abstract),” Proceedings of the AAAI Conference on Artificial Intelligence, vol. 39, no. 28, pp. 29407–29409, 2025, https://doi.org/10.1609/aaai.v39i28.35268
- F. Liu, J. Jung, W. Feinstein, J. D’Ambrogia, and G. Jung, “Aggregated Knowledge Model: Enhancing Domain-Specific QA with Fine-Tuned and Retrieval-Augmented Generation Models,” in AIMLSystems ’24. Association for Computing Machinery, 2025, pp. 1–7. https://doi.org/10.1145/3703412.3703434
- G.-G. Lee, E. Latif, X. Wu, N. Liu, and X. Zhai, “Applying large language models and chain-of-thought for automatic scoring,” Computers and Education: Artificial Intelligence, vol. 6, p. 100213, 2024, https://doi.org/10.1016/j.caeai.2024.100213
- E. Doostmohammadi, T. Norlund, M. Kuhlmann, and R. Johansson, “Surface-Based Retrieval Reduces Perplexity of Retrieval-Augmented Language Models,” in ACL 2023, A. Rogers, J. Boyd-Graber, and N. Okazaki, Eds., Association for Computational Linguistics, 2023, pp. 521–529. https://doi.org/10.18653/v1/2023.acl-short.45
- S. Marchesin, A. Purpura, and G. Silvello, “Focal elements of neural information retrieval models. An outlook through a reproducibility study,” Inf Process Manag, vol. 57, no. 6, p. 102109, Nov. 2020, https://doi.org/10.1016/j.ipm.2019.102109
- A. Bindle, P. Singla, S. Sharma, A. Khakimov, R. I. Alkanhel, and A. Muthanna, “A Hybrid Large Language Model for Context-Aware Document Ranking in Telecommunication Data,” IEEE Access, vol. 13, pp. 120345–120359, 2025, https://doi.org/10.1109/ACCESS.2025.3585637
- A. Moglia, K. Georgiou, P. Cerveri, L. Mainardi, R. M. Satava, and A. Cuschieri, “Large language models in healthcare: from a systematic review on medical examinations to a comparative analysis on fundamentals of robotic surgery online test,” Artif Intell Rev, vol. 57, no. 9, p. 231, Aug. 2024, https://doi.org/10.1007/s10462-024-10849-5
- W. Li and H. Liu, “Applying large language models for automated essay scoring for non-native Japanese,” Humanit Soc Sci Commun, vol. 11, no. 1, p. 723, 2024, https://doi.org/10.1057/s41599-024-03209-9
- R. Xu, Y. Hong, F. Zhang, and H. Xu, “Evaluation of the integration of retrieval-augmented generation in large language model for breast cancer nursing care responses,” Sci Rep, vol. 14, no. 1, 2024, https://doi.org/10.1038/s41598-024-81052-3
- M. K. Watson, J. Pelkey, C. R. Noyes, and M. O. Rodgers, “Assessing Conceptual Knowledge Using Three Concept Map Scoring Methods,” Journal of Engineering Education, vol. 105, no. 1, pp. 118–146, 2016, https://doi.org/10.1002/jee.20111
- M. Abo El-Enen, S. Saad, and T. Nazmy, “A survey on retrieval-augmentation generation (RAG) models for healthcare applications,” Neural Comput Appl, vol. 37, no. 33, pp. 28191–28267, 2025, https://doi.org/10.1007/s00521-025-11666-9
- S. Elmitwalli, J. Mehegan, S. Braznell, and A. Gallagher, “Scalable evaluation framework for retrieval augmented generation in tobacco research using large Language models,” Sci Rep, vol. 15, no. 1, p. 22760, Jul. 2025, https://doi.org/10.1038/s41598-025-05726-2
- J. Lee, S. Ahn, D. Kim, and D. Kim, “Performance comparison of retrieval-augmented generation and fine-tuned large language models for construction safety management knowledge retrieval,” Autom Constr, vol. 168, p. 105846, Dec. 2024, https://doi.org/10.1016/j.autcon.2024.105846
- D. D. Prasetya, A. Pinandito, Y. Hayashi, and T. Hirashima, “Analysis of quality of knowledge structure and students’ perceptions in extension concept mapping,” Res Pract Technol Enhanc Learn, vol. 17, no. 1, p. 14, 2022, https://doi.org/10.1186/s41039-022-00189-9
- Q. Chen, W. Zhou, J. Cheng, and J. Yang, “An Enhanced Retrieval Scheme for a Large Language Model with a Joint Strategy of Probabilistic Relevance and Semantic Association in the Vertical Domain,” Applied Sciences, vol. 14, no. 24, p. 11529, Dec. 2024, https://doi.org/10.3390/app142411529
- S. Xu, Z. Yan, C. Dai, and F. Wu, “MEGA-RAG: a retrieval-augmented generation framework with multi-evidence guided answer refinement for mitigating hallucinations of LLMs in public health,” Front Public Health, vol. 13, Oct. 2025, https://doi.org/10.3389/fpubh.2025.1635381
- M. Ramesh et al., “Assessing WildfireGPT: a comparative analysis of AI models for quantitative wildfire spread prediction,” Natural Hazards, vol. 121, no. 11, pp. 13117–13130, Jun. 2025, https://doi.org/10.1007/s11069-025-07344-7
- V. Ramnarain-Seetohul, Y. Rosunally, and V. Bassoo, “A Unified Conceptual Hybrid Framework for the Automated Assessment of Short Answers,” Int J Artif Intell Educ, Jun. 2025, https://doi.org/10.1007/s40593-025-00487-5
- N. Lotfy, A. Shehab, M. Elhoseny, and A. Abu-Elfetouh, “An Enhanced Automatic Arabic Essay Scoring System Based on Machine Learning Algorithms,” Computers, Materials & Continua, vol. 77, no. 1, pp. 1227–1249, 2023, https://doi.org/10.32604/cmc.2023.039185
- A. Doewes, N. A. Kurdhi, and A. Saxena, “Evaluating Quadratic Weighted Kappa as the Standard Performance Metric for Automated Essay Scoring,” 2023, https://doi.org/10.5281/zenodo.8115784.
- Y. Wang, Y. Wan, X. Lei, Q. Chen, and H. Hu, “A retrieval augmented generation based optimization approach for medical knowledge understanding and reasoning in large language models,” Array, vol. 28, p. 100504, 2025, https://doi.org/10.1016/j.array.2025.100504
- C. Yao and S. Fujita, “Adaptive Control of Retrieval-Augmented Generation for Large Language Models Through Reflective Tags,” Electronics (Basel), vol. 13, no. 23, p. 4643, Nov. 2024, https://doi.org/10.3390/electronics13234643
- Z. Li, Z. Wang, W. Wang, K. Hung, H. Xie, and F. L. Wang, “Retrieval-augmented generation for educational application: A systematic survey,” Computers and Education: Artificial Intelligence, vol. 8, p. 100417, Jun. 2025, https://doi.org/10.1016/j.caeai.2025.100417
- Z. Gu, W. Jia, M. Piccardi, and P. Yu, “Empowering large language models for automated clinical assessment with generation-augmented retrieval and hierarchical chain-of-thought,” Artif Intell Med, vol. 162, p. 103078, Apr. 2025, https://doi.org/10.1016/j.artmed.2025.103078
- Y. Miao, Y. Zhao, Y. Luo, H. Wang, and Y. Wu, “Improving Large Language Model Applications in the Medical and Nursing Domains With Retrieval-Augmented Generation: Scoping Review,” J Med Internet Res, vol. 27, p. e80557, Oct. 2025, https://doi.org/10.2196/80557
- S. Guizani, T. Mazhar, T. Shahzad, W. Ahmad, A. Bibi, and H. Hamam, “A systematic literature review to implement large language model in higher education: issues and solutions,” Discover Education, vol. 4, no. 1, p. 35, Feb. 2025, https://doi.org/10.1007/s44217-025-00424-7
- D. D. Prasetya, T. Widiyaningtyas, and T. Hirashima, “Interrelatedness patterns of knowledge representation in extension concept mapping,” Res Pract Technol Enhanc Learn, vol. 20, p. 009, May 2024, https://doi.org/10.58459/rptel.2025.20009
References
Y. Fukui, Y. Kawata, K. Kobashi, Y. Nagatani, and H. Iguchi, “Evaluation of a retrieval-augmented generation system using a Japanese Institutional Nuclear Medicine Manual and large language model-automated scoring,” Radiol Phys Technol, vol. 18, no. 3, pp. 861–876, 2025, https://doi.org/10.1007/s12194-025-00941-y
R. Tharaniya Sairaj and S. R. Balasundaram, “Ontology Mapping for Retrieval Augmented Modelling to Reduce Factual Hallucinations in Pretrained Language Model-Based Auto-Generated Questions,” Appl Ontol, vol. 20, no. 1, pp. 69–88, 2025, https://doi.org/10.1177/15705838251343009
T. Evans and I. Jeong, “Concept maps as assessment for learning in university mathematics,” Educational Studies in Mathematics, vol. 113, no. 3, pp. 475–498, 2023, https://doi.org/10.1007/s10649-023-10209-0
A. Jackson, E. Barrella, and C. Bodnar, “Application of concept maps as an assessment tool in engineering education: Systematic literature review,” Journal of Engineering Education, vol. 113, no. 4, pp. 752–766, Oct. 2024, https://doi.org/10.1002/jee.20548
M. K. Watson, J. Pelkey, C. R. Noyes, and M. O. Rodgers, “Assessing Conceptual Knowledge Using Three Concept Map Scoring Methods,” Journal of Engineering Education, vol. 105, no. 1, pp. 118–146, 2016, https://doi.org/10.1002/jee.20111
M. Fonseca et al., “The effectiveness of concept mapping as a tool for developing critical thinking in undergraduate medical education – a BEME systematic review: BEME Guide No. 81,” Med Teach, vol. 46, no. 9, pp. 1120–1133, Sep. 2024, https://doi.org/10.1080/0142159X.2023.2281248
C. Cohn, N. Hutchins, T. Le, and G. Biswas, “A Chain-of-Thought Prompting Approach with LLMs for Evaluating Students’ Formative Assessment Responses in Science,” Proceedings of the AAAI Conference on Artificial Intelligence, vol. 38, no. 21, pp. 23182–23190, 2024, https://doi.org/10.1609/aaai.v38i21.30364
M. Klesel and H. F. Wittmann, “Retrieval-Augmented Generation (RAG),” Business & Information Systems Engineering, vol. 67, no. 4, pp. 551–561, 2025, https://doi.org/10.1007/s12599-025-00945-3.
L. Li and P. Sen, “Unraveling the Influence of Training Data and Internal Structures in Large Language Models for Enhanced Explainability (Student Abstract),” Proceedings of the AAAI Conference on Artificial Intelligence, vol. 39, no. 28, pp. 29407–29409, 2025, https://doi.org/10.1609/aaai.v39i28.35268
F. Liu, J. Jung, W. Feinstein, J. D’Ambrogia, and G. Jung, “Aggregated Knowledge Model: Enhancing Domain-Specific QA with Fine-Tuned and Retrieval-Augmented Generation Models,” in AIMLSystems ’24. Association for Computing Machinery, 2025, pp. 1–7. https://doi.org/10.1145/3703412.3703434
G.-G. Lee, E. Latif, X. Wu, N. Liu, and X. Zhai, “Applying large language models and chain-of-thought for automatic scoring,” Computers and Education: Artificial Intelligence, vol. 6, p. 100213, 2024, https://doi.org/10.1016/j.caeai.2024.100213
E. Doostmohammadi, T. Norlund, M. Kuhlmann, and R. Johansson, “Surface-Based Retrieval Reduces Perplexity of Retrieval-Augmented Language Models,” in ACL 2023, A. Rogers, J. Boyd-Graber, and N. Okazaki, Eds., Association for Computational Linguistics, 2023, pp. 521–529. https://doi.org/10.18653/v1/2023.acl-short.45
S. Marchesin, A. Purpura, and G. Silvello, “Focal elements of neural information retrieval models. An outlook through a reproducibility study,” Inf Process Manag, vol. 57, no. 6, p. 102109, Nov. 2020, https://doi.org/10.1016/j.ipm.2019.102109
A. Bindle, P. Singla, S. Sharma, A. Khakimov, R. I. Alkanhel, and A. Muthanna, “A Hybrid Large Language Model for Context-Aware Document Ranking in Telecommunication Data,” IEEE Access, vol. 13, pp. 120345–120359, 2025, https://doi.org/10.1109/ACCESS.2025.3585637
A. Moglia, K. Georgiou, P. Cerveri, L. Mainardi, R. M. Satava, and A. Cuschieri, “Large language models in healthcare: from a systematic review on medical examinations to a comparative analysis on fundamentals of robotic surgery online test,” Artif Intell Rev, vol. 57, no. 9, p. 231, Aug. 2024, https://doi.org/10.1007/s10462-024-10849-5
W. Li and H. Liu, “Applying large language models for automated essay scoring for non-native Japanese,” Humanit Soc Sci Commun, vol. 11, no. 1, p. 723, 2024, https://doi.org/10.1057/s41599-024-03209-9
R. Xu, Y. Hong, F. Zhang, and H. Xu, “Evaluation of the integration of retrieval-augmented generation in large language model for breast cancer nursing care responses,” Sci Rep, vol. 14, no. 1, 2024, https://doi.org/10.1038/s41598-024-81052-3
M. K. Watson, J. Pelkey, C. R. Noyes, and M. O. Rodgers, “Assessing Conceptual Knowledge Using Three Concept Map Scoring Methods,” Journal of Engineering Education, vol. 105, no. 1, pp. 118–146, 2016, https://doi.org/10.1002/jee.20111
M. Abo El-Enen, S. Saad, and T. Nazmy, “A survey on retrieval-augmentation generation (RAG) models for healthcare applications,” Neural Comput Appl, vol. 37, no. 33, pp. 28191–28267, 2025, https://doi.org/10.1007/s00521-025-11666-9
S. Elmitwalli, J. Mehegan, S. Braznell, and A. Gallagher, “Scalable evaluation framework for retrieval augmented generation in tobacco research using large Language models,” Sci Rep, vol. 15, no. 1, p. 22760, Jul. 2025, https://doi.org/10.1038/s41598-025-05726-2
J. Lee, S. Ahn, D. Kim, and D. Kim, “Performance comparison of retrieval-augmented generation and fine-tuned large language models for construction safety management knowledge retrieval,” Autom Constr, vol. 168, p. 105846, Dec. 2024, https://doi.org/10.1016/j.autcon.2024.105846
D. D. Prasetya, A. Pinandito, Y. Hayashi, and T. Hirashima, “Analysis of quality of knowledge structure and students’ perceptions in extension concept mapping,” Res Pract Technol Enhanc Learn, vol. 17, no. 1, p. 14, 2022, https://doi.org/10.1186/s41039-022-00189-9
Q. Chen, W. Zhou, J. Cheng, and J. Yang, “An Enhanced Retrieval Scheme for a Large Language Model with a Joint Strategy of Probabilistic Relevance and Semantic Association in the Vertical Domain,” Applied Sciences, vol. 14, no. 24, p. 11529, Dec. 2024, https://doi.org/10.3390/app142411529
S. Xu, Z. Yan, C. Dai, and F. Wu, “MEGA-RAG: a retrieval-augmented generation framework with multi-evidence guided answer refinement for mitigating hallucinations of LLMs in public health,” Front Public Health, vol. 13, Oct. 2025, https://doi.org/10.3389/fpubh.2025.1635381
M. Ramesh et al., “Assessing WildfireGPT: a comparative analysis of AI models for quantitative wildfire spread prediction,” Natural Hazards, vol. 121, no. 11, pp. 13117–13130, Jun. 2025, https://doi.org/10.1007/s11069-025-07344-7
V. Ramnarain-Seetohul, Y. Rosunally, and V. Bassoo, “A Unified Conceptual Hybrid Framework for the Automated Assessment of Short Answers,” Int J Artif Intell Educ, Jun. 2025, https://doi.org/10.1007/s40593-025-00487-5
N. Lotfy, A. Shehab, M. Elhoseny, and A. Abu-Elfetouh, “An Enhanced Automatic Arabic Essay Scoring System Based on Machine Learning Algorithms,” Computers, Materials & Continua, vol. 77, no. 1, pp. 1227–1249, 2023, https://doi.org/10.32604/cmc.2023.039185
A. Doewes, N. A. Kurdhi, and A. Saxena, “Evaluating Quadratic Weighted Kappa as the Standard Performance Metric for Automated Essay Scoring,” 2023, https://doi.org/10.5281/zenodo.8115784.
Y. Wang, Y. Wan, X. Lei, Q. Chen, and H. Hu, “A retrieval augmented generation based optimization approach for medical knowledge understanding and reasoning in large language models,” Array, vol. 28, p. 100504, 2025, https://doi.org/10.1016/j.array.2025.100504
C. Yao and S. Fujita, “Adaptive Control of Retrieval-Augmented Generation for Large Language Models Through Reflective Tags,” Electronics (Basel), vol. 13, no. 23, p. 4643, Nov. 2024, https://doi.org/10.3390/electronics13234643
Z. Li, Z. Wang, W. Wang, K. Hung, H. Xie, and F. L. Wang, “Retrieval-augmented generation for educational application: A systematic survey,” Computers and Education: Artificial Intelligence, vol. 8, p. 100417, Jun. 2025, https://doi.org/10.1016/j.caeai.2025.100417
Z. Gu, W. Jia, M. Piccardi, and P. Yu, “Empowering large language models for automated clinical assessment with generation-augmented retrieval and hierarchical chain-of-thought,” Artif Intell Med, vol. 162, p. 103078, Apr. 2025, https://doi.org/10.1016/j.artmed.2025.103078
Y. Miao, Y. Zhao, Y. Luo, H. Wang, and Y. Wu, “Improving Large Language Model Applications in the Medical and Nursing Domains With Retrieval-Augmented Generation: Scoping Review,” J Med Internet Res, vol. 27, p. e80557, Oct. 2025, https://doi.org/10.2196/80557
S. Guizani, T. Mazhar, T. Shahzad, W. Ahmad, A. Bibi, and H. Hamam, “A systematic literature review to implement large language model in higher education: issues and solutions,” Discover Education, vol. 4, no. 1, p. 35, Feb. 2025, https://doi.org/10.1007/s44217-025-00424-7
D. D. Prasetya, T. Widiyaningtyas, and T. Hirashima, “Interrelatedness patterns of knowledge representation in extension concept mapping,” Res Pract Technol Enhanc Learn, vol. 20, p. 009, May 2024, https://doi.org/10.58459/rptel.2025.20009