Development and Validation of an Algorithmic Randomization Model for a CEFR-Aligned English Proficiency Test in College Students

Jamjumrat Deeprom

doi:10.66947/pasaa.v73ispc.2068

Authors

Jamjumrat Deeprom Lecturer, International College for Sustainability Studies, Srinakharinwirot University

DOI:

https://doi.org/10.66947/pasaa.v73ispc.2068

Keywords:

Algorithm-based test generation, SWU-SET, CEFR, psychometric properties, computerized test system

Abstract

Amid growing demands for standardized, scalable, and equitable English language assessments in higher education, institutions face challenges related to item exposure, content imbalance, and the resource-intensive nature of manual test construction. This study developed and validated an algorithm-based item randomization and test assembly model for constructing CEFR-aligned assessments from a pre-validated item bank. Grounded in Assessment Engineering and expert input, the model comprises five components—Listening, Vocabulary, Usage and Functional Language, Structure, and Reading—mapped to CEFR levels A2 to C1. Phase 1 involved focus group consultation with English language specialists to design a randomized test blueprint. Phase 2 assessed psychometric properties using confirmatory factor analysis and reliability testing with 300 undergraduates. The test demonstrated excellent model fit, high internal consistency (α = .806–.894), and a clear factorial structure, with Usage and Functional Language emerging as the strongest predictor of overall proficiency. The algorithm ensured thematic balance, avoided item repetition, and upheld difficulty calibration—overcoming common challenges in manual test construction. These results support the model’s feasibility and relevance as a scalable solution for modernizing English language assessment in higher education.

References

Aera, A. P. A. (2014). Standards for educational and psychological testing. American Educational Research Association.

Aizawa, I., Rose, H., Thompson, G., & Curle, S. (2023). Beyond the threshold: Exploring English language proficiency, linguistic challenges, and academic language skills of Japanese students in an English medium instruction programme. Language Teaching Research, 27(4), 837–861. https://doi.org/10.1177/1362168820965510

Aljumah, F. H. (2020). Second language acquisition: A framework and historical background on its research. English Language Teaching, 13(8), 200–207. https://doi.org/10.5539/elt.v13n8p200

Al Lawati, Z.A. (2023). Investigating the characteristics of language test specifications and item writer guidelines, and their effect on item development: A mixed-method case study. Language Testing in Asia, 13, 1–17. https://doi.org/10.1186/s40468-023-00233-5

Anggia, H., & Habók, A. (2023). Textual complexity adjustments to the English reading comprehension test for undergraduate EFL students. Heliyon, 9(1), Article e12891. https://doi.org/10.1016/j.heliyon.2023.e12891

Athiworakun, C., & Wudthayagorn, J. (2018). Mapping Srinakharinwirot University - Standardized English Test (SWU-SET) onto the Common European Framework of Reference (CEFR). Suranaree Journal of Social Science, 12(2), 69–84. https://doi.org/10.55766/CTUU4836

Bachman, L. F., & Palmer, A. S. (1996). Language testing in practice. Oxford University Press.

Bachman, L. F., & Palmer, A. S. (2010). Language assessment in practice. Oxford University Press.

Berger, A. (2020). Specifying progression in academic speaking: A keyword analysis of CEFR-based proficiency descriptors. Language Assessment Quarterly, 17(1), 85–99. https://doi.org/10.1080/15434303.2019.1689981

Canale, M., & Swain, M. (1980). Theoretical bases of communicative approaches to second language teaching and testing. Applied Linguistics, 1, 1–47. https://doi.org/10.1093/applin/I.1.1

Cheewasukthaworn, K. (2022). Developing a standardized English proficiency test in alignment with the CEFR. PASAA, 63, 66–92. https://doi.org/10.58837/CHULA.PASAA.63.1.3

Chen, X., Aryadoust, V., & Zhang, W. (2025). A systematic review of differential item functioning in second language assessment. Language Testing, 42(2), 193–222. https://doi.org/10.1177/02655322241290188

Circi, R., Hicks, J., & Sikali, E. (2023). Automatic item generation: Foundations and machine learning-based approaches for assessments. Frontiers in Education, 8, Article 858273. https://doi.org/10.3389/feduc.2023.858273

Council of Europe. (2020). Common European Framework of Reference for Languages: Learning, teaching, assessment – Companion volume. Council of Europe Publishing. https://www.coe.int/lang-cefr.

Dunn, K. J. (2024). Random-item Rasch models and explanatory extensions: A worked example using L2 vocabulary test item responses. Research Methods in Applied Linguistics, 3(3), Article 100143. https://doi.org/10.1016/j.rmal.2024.100143

Eghan, R. E., Osei-Sarpong, E., Awashie, G. E., Borkor, R. N., Yaokumah, E., & N’ganomah, A. A. (2026). Item response theory for trait assessment in randomized item pool for computer based test. Scientific African, 31, Article e03226. https://doi.org/10.1016/j.sciaf.2026.e03226

Eisenchlas, S. A. (2009). Conceptualizing ‘communication’ in second language acquisition. Australian Journal of Linguistics, 29(1), 45–58. https://doi.org/10.1080/07268600802516376

Embretson, S. E., & Yang, X. (2007). Automatic item generation and cognitive psychology. In C. R. Rao & S. Sinharay (Eds.), Handbook of statistics: Psychometrics, Volume 26 (pp. 747–768). Elsevier North Holland.

Falcão, F., Pereira, D. M., Gonçalves, N., De Champlain, A., Costa, P., & Pêgo, J. M. (2023). A suggestive approach for assessing item quality, usability and validity of automatic item generation. Advances in Health Sciences Education, 28, 1441–1465. https://doi.org/10.1007/s10459-023-10225-y

Field, A. (2009). Discovering statistics using SPSS: Introducing statistical method (3rd ed.). Sage.

Fuchimoto, K., & Songmuang, P. (2026). Review of automated parallel test form assembly. Behaviormetrika. https://doi.org/10.1007/s41237-026-00293-w

Fulcher, G., & Davidson, F. (2007). Language testing and assessment: An advanced resource book. Routledge.

Gan, Q. (2024). Different registers, different grammars in second language production? The dative alternation in spoken and written Chinese learner English. Lingua, 309, Article 103790. https://doi.org/10.1016/j.lingua.2024.103790

Gierl, M. J., & Haladyna, T. (2013). Automatic item generation: Theory and practice. Routledge.

Gierl, M. J., & Lai, H. (2013). Instructional topics in educational measurement (ITEMS) module: Using automated processes to generate test items. Educational Measurement: Issues and Practice, 32(3), 36–50. https://doi.org/10.1111/emip.12018

Gierl, M. J., Zhou, J., & Alves, C. (2008). Developing a taxonomy of item model types to promote assessment engineering. Journal of Technology, Learning, and Assessment, 7(2), 1–51.

Goodman, B., Yessenbekova, K., & Curle, S. (2024). English-medium education in Kazakhstan: A multifaceted exploration of student and alumni perceptions on language proficiency, academic performance, and career prospects. International Journal of Educational Research, 128, Article 102451. https://doi.org/10.1016/j.ijer.2024.102451

Hair, J. F., Black, W. C., Babin, B. J., & Anderson, R. E. (2018). Multivariate data analysis (8th ed.). Cengage Learning.

Haladyna, T. M., & Rodriguez, M. C. (2013). Developing and validating test items. Routledge. https://doi.org/10.4324/9780203850381

Harding, L. (2014). Communicative language testing: Current issues and future research. Language Assessment Quarterly, 11(2), 186–197. https://doi.org/10.1080/15434303.2014.895829

Hennink, M., & Kaiser, B. N. (2022). Sample sizes for saturation in qualitative research: A systematic review of empirical tests. Social Science & Medicine, 292, Article 114523. https://doi.org/10.1016/j.socscimed.2021.114523

Huang, H. T. D., Hung, S. T. A., Chao, H. Y., Chen, J. H., Lin, T. P., & Shih, C. L. (2021). Developing and validating a computerized adaptive testing system for measuring the English proficiency of Taiwanese EFL university students. Language Assessment Quarterly, 19(2), 162–188. https://doi.org/10.1080/15434303.2021.1984490

Jatobá, V. M. G., Farias, J. S., Freire, V., Ruela, A. S., & Delgado, K. V. (2020). ALICAT: A customized approach to item selection process in computerized adaptive testing, Journal of the Brazillian Computer Society, 26, Article 4. https://doi.org/10.1186/s13173-020-00098-z

Khamboonruang, A. (2025). Argument-based validation of Chulalongkorn University Language Institute (CULI) Test: A Rasch-based evidence investigation. Language Testing in Asia, 15, Article 10. https://doi.org/10.1186/s40468-025-00346-z

Khan, A., David, A. R., Ahmad, A. H., Ali, A., & Lah, S. C. (2023). Initial insights into CEFR adoption at a language faculty of a public university in Malaysia. PASAA, 67, 330–360. https://doi.org/10.58837/CHULA.PASAA.67.1.11

Kim, M., & Crossley, S. A. (2020). Exploring the construct validity of the ECCE: Latent structure of a CEFR-based high-intermediate level English language proficiency test. Language Assessment Quarterly, 17(4), 434–457. https://doi.org/10.1080/15434303.2020.1775234

Kirsten, K., Greefrath, G., & Emmrich, R. (2026). Technology-based versus paper-pencil: Sources of mode effects in large-scale assessment. International Journal of Mathematical Education in Science and Technology, 1–28. https://doi.org/10.1080/0020739X.2025.2584340

Kıyak, Y. S., & Kononowicz, A. A. (2025). Using a hybrid of AI and template-based method in automatic item generation to create multiple-choice questions in medical education: Hybrid AIG. JMIR Formative Research, 9, Article e65726. https://doi.org/10.2196/65726

Kline, R. B. (2016). Principles and practice of structural equation modeling (4th ed.). Guilford Press.

Leslie, T., & Gierl, M. J. (2023). Using automatic item generation to create multiple-choice questions for pharmacy assessment. American Journal of Pharmaceutical Education, 87(10), Article 100081. https://doi.org/10.1016/j.ajpe.2023.100081

Li, Y., Teng, W., Tsai, L., & Lin, T. M. Y. (2022). Does English proficiency support the economic development of non-English-speaking countries? The case of Asia. International Journal of Educational Development, 92, Article 102623. https://doi.org/10.1016/j.ijedudev.2022.102623

Liao, L., Ye, S. X., & Yang, J. (2023). A mini review of communicative language testing. Frontiers in Psychology, 14, Article 1058411. https://doi.org/10.3389/fpsyg.2023.1058411

Luo, X. (2020). Automated test assembly with mixed-integer programming: The effects of modeling approaches and solvers. Journal of Educational Measurement, 57(4), 547–565. https://doi.org/10.1111/jedm.12262

Nikolaus, M., & Fourtassi, A. (2023). Communicative feedback in language acquisition. New Ideas in Psychology, 68, Article 100985. https://doi.org/10.1016/j.newideapsych.2022.100985

Nunnally, J. C., & Bernstein, I. H. (1994). Psychometric theory (3rd ed.). McGraw-Hill.

Mohd Noh, M. F., & Mohd Matore, M. E. E. (2022). Rater severity differences in English language as a second language speaking assessment based on rating experience, training experience, and teaching experience through many-faceted Rasch measurement analysis. Frontiers in Psychology, 13, Article 941084. https://doi.org/10.3389/fpsyg.2022.941084

Office of the Education Council. (2023). Education in Thailand 2022. Ministry of Education. https://backoffice.onec.go.th/uploads/Book/2057-file.pdf

Office of the Higher Education Commission. (2024). Announcement of the Higher Education Standards Committee: Policy on raising English language standards in higher education institutions 2024. https://www.ops.go.th/th/e-book/edu-standard/download/3253/9625/16

Piamsai, C. (2023). Development and use of CEFR based self-assessment in a Thai tertiary context. PASAA, 66(1), 81–126. https://doi.org/10.58837/CHULA.PASAA.66.1.3

Proietti, G. S., Matteucci, M., & Mignani, S. (2020). Automated test assembly for large-scale standardized assessment: Practical issues and possible solutions. Psych, 2(4), 315–337. https://doi.org/10.3390/psych2040024

Pugh, D., De Champlain, A., Gierl, M., Lai, H., & Touchie, C. (2020). Can automated item generation be used to develop high quality MCQs that assess application of knowledge? Research and Practices in Technology Enhanced Learning, 15, Article 12. https://doi.org/10.1186/s41039-020-00134-8

Rafatbakhsh, E., Ahmadi, A., Moloodi, A., & Mehrpour, S. (2020). Development and validation of an automatic item generation system for English idioms. Educational Measurement: Issues and Practice, 40(2), 49–59. https://doi.org/10.1111/emip.12401

Rausch, A., Seifried, J., Wuttke, E., Kögler, K., & Brandt, S. (2016). Reliability and validity of a computer-based assessment of cognitive and non-cognitive facets of problem-solving competence in the business domain. Empirical Research in Vocational Education and Training, 8, Article 9. https://doi.org/10.1186/s40461-016-0035-y

Rice, N., Pêgo, J. M., Collares, C. F., Kisielewska, J., & Gale, T. (2022). The development and implementation of a computer adaptive progress test across European countries. Computers and Education: Artificial Intelligence, 3, Article 100083. https://doi.org/10.1016/j.caeai.2022.100083

Rovinelli, R. J., & Hambleton, R. K. (1977). On the use of content specialists in the assessment of criterion-referenced test item validity. Tijdschrift voor Onderwijsresearch, 2(2), 49–60.

Russell, M., Goldberg, A., & O'Connor, K. (2003). Computer-based testing and validity: A look back into the future. Assessment in Education: Principles, Policy & Practice, 10(3), 279–293. https://doi.org/10.1080/0969594032000148145

Schnoor, B., Hartig, J., Klinger, T., Naumann, A., & Usanova, I. (2023). Measuring the development of general language skills in English as a foreign language—Longitudinal invariance of the C-test. Language Testing, 40(3), 796–819. https://doi.org/10.1177/02655322231159829

Song, Y., Du, J., & Zheng, Q. (2025). Automatic item generation for educational assessments: A systematic literature review. Interactive Learning Environments, 33(9), 5386–5405. https://doi.org/10.1080/10494820.2025.2482588

Soper, D.S. (2025). A-priori sample size calculator for structural equation models [Software]. https://www.danielsoper.com/statcalc

Tomasik, M. J., Berger, S., & Moser, U. (2018). On the development of a computer-based tool for formative student assessment: Epistemological, methodological, and practical issues. Frontiers in Psychology, 9, Article 2245. https://doi.org/10.3389/fpsyg.2018.02245

Trilling, B., & Fadel, C. (2009). 21st century skills: Learning for life in our times. Jossey-Bass/Wiley.

UNESCO. (2023). Global Education Monitoring Report 2023: Technology in education – A tool on whose terms? UNESCO. https://doi.org/10.54676/UZQV8501

Van Wijk, E. V., Donkers, J., De Laat, P. C. J., Meiboom, A. A., Jacobs, B., Ravesloot, J. H., Tio, R. A., Van Der Vleuten, C. P. M., Langers, A. M. J., & Bremers, A. J. A. (2024). Computer adaptive vs. non-adaptive medical progress testing: Feasibility, test performance, and student experiences. Perspectives on Medical Education, 13(1), 406–416. https://doi.org/10.5334/pme.1345

Waluyo, B., Zahabi, A., & Ruangsung, L. (2024). Language assessment at a Thai university: A CEFR-based test of English proficiency development. rEFLections, 31(1) 25–47. https://doi.org/10.61508/refl.v31i1.270418

Weiss, D. J. (2013). Item banking, test development, and test delivery. In K. F. Geisinger, B. A. Bracken, J. F. Carlson, J. C. Hansen, N. R. Kuncel, S. P. Reise, & M. C. Rodriguez (Eds.), APA handbook of testing and assessment in psychology, Vol. 1. Test theory and testing and assessment in industrial and organizational psychology (pp. 185–200). American Psychological Association. https://doi.org/10.1037/14047-010

Westacott, R., Badger, K., Kluth, D., Gurnell, M., Reed, M. W. R., & Sam, A. H. (2023). Automated Item Generation: Impact of item variants on performance and standard setting. BMC Medical Education, 23, Article 659. https://doi.org/10.1186/s12909-023-04457-0

Zhou, R., Samad, A., & Perinpasingam, T. (2024). A systematic review of cross-cultural communicative competence in EFL teaching: Insights from China. Humanities and Social Sciences Communications, 11, Article 1750. https://doi.org/10.1057/s41599-024-04071-5

Zhu, A., Mofreh, S. A. M., & Salem, S. (2023). The application of language proficiency scales in education context: A systematic literature review. Sage Open, 13(3). 1–19. https://doi.org/10.1177/21582440231199692