Implementing the Rasch Rating Scale Model for Assessing Writing Performance

Sareh Sahebalam; Purya Baghaei; Mojgan Rashtchi

doi:10.57647/jntell.2026.0501.03

10.57647/jntell.2026.0501.03

Implementing the Rasch Rating Scale Model for Assessing Writing Performance

PDF

Sareh Sahebalam¹,
Purya Baghaei*²,
Mojgan Rashtchi ¹,

Department of Teaching English as a Foreign Language, NT.C., Islamic Azad University, Tehran, Iran
Department of Teaching English as a Foreign Language, Ma.C., Islamic Azad University, Mashhad, Iran

Received: 2025-10-25

Revised: 2025-11-09

Accepted: 2025-12-20

Published in Issue 2026-03-31

This work is licensed under a Creative Commons Attribution 4.0 International License.

How to Cite

Sahebalam, S., Baghaei, P., & Rashtchi, M. (2026). Implementing the Rasch Rating Scale Model for Assessing Writing Performance. Journal of New Trends in English Language Learning (JNTELL), 5(1). https://doi.org/10.57647/jntell.2026.0501.03

PDF views: 93

Abstract

Rater-mediated assessments require raters to make complex judgments, often expressed through ordinal rating scale categories, about test-takers’ performances. When the focus is on the quality of these judgments, it becomes essential to evaluate the ratings for their psychometric soundness, including validity, reliability, and fairness. In applied linguistics and second-language assessment research, the Many-Facet Rasch Model (MFRM; Linacre, 1989) has been widely used to investigate rater performance. However, the MFRM’s complexity, combined with the limited accessibility of user-friendly software, presents challenges for researchers and practitioners. In this study, the researchers propose using the Rating Scale Model (RSM), a more straightforward and widely recognized framework, as an alternative for analyzing judged performance. To explore its utility, the RSM was applied to scores assigned by five raters to 156 compositions written by English as a Foreign Language Learners (EFL). The findings indicate that the RSM can successfully convert ordinal ratings into interval-level measures for both test-takers and tasks, while also allowing examination of item fit, person fit, rating scale functioning, and dimensionality. Moreover, the model proved capable of investigating interaction effects, such as the influence of rater and examinee gender, as well as the interplay between raters and specific items. Overall, this study demonstrates that the RSM, while more limited than the MFRM, offers a practical and accessible tool for evaluating the quality of assessor judgments in performance assessment. We also highlight both the promise and the constraints of relying on RSM in this context, with implications for future inquiry and implementation.

Keywords

Rasch Rating Scale Model,
Writing Performance,
Assessment

PDF

References

Andrich, D. (1978a). A rating formulation for ordered response categories. Psychometrika, 43(4), 561–573. https://doi.org/10.1007/BF02293814
Andrich, D. (2013). An expanded derivation of the threshold structure of the polytomous Rasch model that dispels any "threshold disorder controversy." Educational and Psychological Measurement, 73(1), 78–124. https://doi.org/10.1177/0013164412450877
Arias, R. M. (2010). Performance assessment. Papeles del Psicólogo, 31(1), 85–96.
Aryadoust, V., Tan, H. A. H., & Ng, L. Y. (2019). A Scientometric Review of Rasch Measurement: The Rise and Progress of a Specialty. Frontiers in Psychology, 10, Article 2197. https://doi.org/10.3389/fpsyg.2019.02197
Baghaei, P. (2021). Mokken scale analysis in language assessment. Waxmann Verlag.
Baghaei, P., & Effatpanah, F. (2024). Nonparametric kernel smoothing item response theory analysis of Likert items. Psych, 6(1), 236–259.
https://doi.org/10.3390/psych6010015
Baharudin, H., Maskor, Z. M., & Matore, M. E. E. M. (2022). The raters' differences in Arabic writing rubrics through the Many-Facet Rasch measurement model. Frontiers in Psychology, 13, Article 988272. https://doi.org/10.3389/fpsyg.2022.988272
Bonk, W. J., & Ockey, G. J. (2003). A many-facet Rasch analysis of the second language group oral discussion task. Language Testing, 20(1), 89–110. https://doi.org/10.1191/0265532203lt245oa
Brown, A. (1995). The effect of rater variables in the development of an occupation-specific language performance test. Language Testing, 12(1), 1–15. https://doi.org/10.1177/026553229501200101
Cai, H., & Yan, X. (2024). Triangulating natural language processing (NLP)-based analysis of rater comments and many-facet Rasch measurement (MFRM): An innovative approach to investigating raters' application of rating scales in writing assessment. Language Testing, 41(2), 290–316. https://doi.org/10.1177/02655322231210231
Cambridge English (2024). Assessing writing for Cambridge English Qualifications: A guide for teachers. Cambridge University Press & Assessment. Retrieved from: https://www.cambridgeenglish.org/images/231794-cambridge-english-assessing-writing-performance-at-level-b1.pdf
Carr, N. T. (2011). Designing and analyzing language tests. Oxford University Press.
Coe, R., McCaffrey, D. F., Casabianca, J. M., Lockwood, J. R., & Guarino, C. (2024). Monitoring rater quality in observational systems: Issues due to unreliable estimates of rater quality. Educational Assessment, 29(2), 124–146. https://doi.org/10.1080/10627197.2024.2354311
Cumming, A., Kantor, R., & Powers, D. E. (2001). Scoring TOEFL essays and TOEFL 2000 prototype writing tasks: An investigation into raters' decision making and development of a preliminary analytic framework (TOEFL Monograph Series MS-22). Educational Testing Service.
Dewberry, C., Davies‐Muir, A., & Newell, S. (2013). Impact and causes of rater severity/leniency in appraisals without postevaluation communication between raters and ratees. International Journal of Selection and Assessment, 21(3), 286–293. https://doi.org/10.1111/ijsa.12038
Eckes, T. (2005). Examining rater effects in TestDaF writing and speaking performance assessments: A many-facet Rasch analysis. Language Assessment Quarterly, 2(3), 197–221. https://doi.org/10.1207/s15434311laq0203_2
Eckes, T. (2023). Introduction to many-facet Rasch measurement. Peter Lang. https://doi.org/10.3726/978-3-653-04844-5
Eckes, T. (2015). Introduction to many-facet Rasch measurement. Analyzing and evaluating rater-mediated assessments (2nd rev. and updated ed.). Peter Lang. https://doi.org/10.3726/978-3-653-04844-5
Effatpanah, F., & Baghaei, P. (2022). Exploring rater quality in rater-mediated assessment using the nonparametric item characteristic curve estimation. Psychological Test and Assessment Modeling, 64(3), 216–252.
Effatpanah, F., & Baghaei, P. (2024). Examining the dimensionality of linguistic features in L2 writing using the Rasch measurement model. Educational Methods & Psychometrics, 2, 1–22. https://dx.doi.org/10.61186/emp.2024.3
Elder, C. (1993). How do subject specialists construe classroom language proficiency? Language Testing, 10(3), 235–254. https://doi.org/10.1177/026553229301000303
Elder, C., Barkhuizen, G., Knoch, U., & von Randow, J. (2007). Evaluating rater responses to an online training program for L2 writing assessment. Language Testing, 24(1), 37–64. https://doi.org/10.1177/0265532207071511
Engelhard Jr, G., & Wang, J. (2024). Invariant measurement: Using Rasch models in the social, behavioral, and health sciences. Routledge. https://doi.org/10.4324/9781003458746
Engelhard, G. Jr. (2013). Invariant measurement: Using Rasch models in the social, behavioral, and health sciences. Routledge/Taylor & Francis Group.
Gordon, R. A., Peng, F., Curby, T. W., & Zinsser, K. M. (2021). An introduction to the many-facet Rasch model as a method to improve observational quality measures with an application to measuring the teaching of emotion skills. Early Childhood Research Quarterly, 55, 149–164. https://doi.org/10.1016/j.ecresq.2020.11.005
Hamp-Lyons, L. (2007). Worrying about rating. Assessing Writing, 12(1), 1–9. https://doi.org/10.1016/j.asw.2007.05.002
Heene, M. (2020). Applying the Rasch model: fundamental measurement in the human sciences. https://doi.org/10.4324/9780429030499
Hoyt, W. T. (2000). Rater bias in psychological research: When is it a problem and what can we do about it? Psychological Methods, 5(1), 64–86. https://doi.org/10.1037/1082-989X.5.1.64
Huang, J., & Chen, G. (2022). Individualized feedback to raters in language assessment: Impacts on rater effects. Assessing Writing, 52, Article 100623. https://doi.org/10.1016/j.asw.2022.100623
In'nami, Y., & Koizumi, R. (2015). Task and rater effects in L2 speaking and writing: A synthesis of generalizability studies. Language Testing, 33(3), 341–366. https://doi.org/10.1177/0265532215587390
Kane, M., Crooks, T., & Cohen, A. (1999). Validating measures of performance. Educational Measurement: Issues and Practice, 18(2), 5–17. https://doi.org/10.1111/j.1745-3992.1999.tb00010.x
Kissi, P. S., Arthur, Y. D., Baffoe, S., Kontor, S. D., Amuzu, I. K., Ezuma, B. I., & Affum-Osei, E. (2022). Item and rater variabilities in students' evaluation of teaching in a university in Ghana: Application of Many-Facet Rasch Model. Heliyon, 8(12), Article e12548. https://doi.org/10.1016/j.heliyon.2022.e12548
Kondo-Brown, K. (2002). A FACETS analysis of rater bias in measuring Japanese second language writing performance. Language Testing, 19(1), 3–31. https://doi.org/10.1191/0265532202lt218oa
Kuiken, F., & Vedder, I. (2014). Rating written performance: What do raters do and why? Language Testing, 31(3), 329–348. https://doi.org/10.1177/0265532214526174
Lane, S., Raymond, M.R., & Haladyna, T.M. (Eds.). (2015). Handbook of Test Development (2nd ed.). Routledge. https://doi.org/10.4324/9780203102961
Li, X., Jiang, Y., & Yang, L. (2025). Assessing the reliability and relevance of DeepSeek in EFL writing evaluation: A generalizability theory approach. Language Testing in Asia, 15, Article 8. https://doi.org/10.1186/s40468-025-00369-6
Lim, G. S. (2011). The development and maintenance of rating quality in performance writing assessment: A longitudinal study of new and experienced raters. Language Testing, 28(4), 543–560. https://doi.org/10.1177/0265532211406422
Linacre, J. M. (1989). Many-facet Rasch measurement. MESA Press.
Linacre, J. M. (1994). Sample size and item calibration [or person measure] stability. Rasch Measurement Transactions,7(4), 328. Retrieved from: https://www.rasch.org/rmt/rmt74m.htm
Linacre, J. M. (2023a). Winsteps® (Version 5.6.0) [Computer Software]. Winsteps.com. Retrieved from https://www.winsteps.com/
Linacre, J. M. (2023b). Winsteps® Rasch measurement computer program User's Guide. Version 5.6.0. Winsteps.com
Ling, G., Mollaun, P., & Xi, X. (2014). A study on the impact of fatigue on human raters when scoring speaking responses. Language Testing, 31(4), 479–499. https://doi.org/10.1177/0265532214530699
Lynch, B. K., & McNamara, T. F. (1998). Using G-theory and Many-facet Rasch measurement in the development of performance assessments of the ESL speaking skills of immigrants. Language Testing, 15(2), 158–180. https://doi.org/10.1177/026553229801500202
Masters, G. N. (1982). A Rasch model for partial credit scoring. Psychometrika, 47(2), 149–174. https://doi.org/10.1007/BF02296272
McNamara, T., & Knoch, U. (2012). The Rasch wars: The emergence of Rasch measurement in language testing. Language Testing, 29(4), 555–576. https://doi.org/10.1177/0265532211430367
McNamara, T. (1996). Measuring second language performance. Longman.
Messick, S. (1989). Validity. In R. L. Linn (Ed.), Educational measurement (3rd ed., pp. 13–103). Macmillan Publishing, American Council on Education.
Mohd Noh, M. F., & Mohd Matore, M. E. E. (2022). Rater severity differences in English language as a second language speaking assessment based on rating experience, training experience, and teaching experience through many-faceted Rasch measurement analysis. Frontiers in Psychology, 13, Article 941084. https://doi.org/10.3389/fpsyg.2022.941084
Myford, C. M., & Wolfe, E. W. (2003). Detecting and measuring rater effects using many-facet Rasch measurement: Part I. Journal of Applied Measurement, 4(4), 386–422. http://jampress.org/abst.htm
Myford, C. M., & Wolfe, E. W. (2004). Detecting and measuring rater effects using many-facet Rasch measurement: Part II. Journal of Applied Measurement, 5(2), 189–227.
O'Grady, S. (2025). Task design and rater effects in task-based language assessment. TESOL Journal, 6(1), Article e904. https://doi.org/10.1002/tesj.904
Palermo, C. (2022). Rater characteristics, response content, and scoring contexts: Decomposing the determinates of scoring accuracy. Frontiers in Psychology, 13, Article 937097. https://doi.org/10.3389/fpsyg.2022.937097
Rasch, G. (1960/1980). Probabilistic models for some intelligence and achievement tests. Danish Institute for Educational Research.
Safarova, Y., Yusupova, L., & Kuznetsova, E. (2022). Rater severity differences in English language as a second language speaking assessment based on rating experience, training experience, and teaching experience through many-faceted Rasch measurement analysis. Frontiers in Psychology, 13, Article 941084. https://doi.org/10.3389/fpsyg.2022.941084
Song, D., Wilson, C., Sun, G., Bai, H., Chen, T., Yu, S., & Huang, J. (2025). Exploring LLM autoscoring reliability in large-scale writing assessments using generalizability theory. arXiv. https://arxiv.org/abs/2507.19980
Uto, M. (2021). Accuracy of performance-test linking based on a many-facet Rasch model. Behavior Research Methods, 53(4), 1440–1454. https://doi.org/10.3758/s13428-020-01498-x
Uto, M. (2021). A multidimensional generalized many-facet Rasch model for rubric-based performance assessment. Behaviormetrika, 48(4). 425–457. https://doi.org/10.1007/s41237-021-00144-w
Uto, M., Tsuruta, J., Araki, K., & Ueno, M. (2024). Item response theory model highlighting rating scale of a rubric and rater-rubric interaction in objective structured clinical examination. PLoS ONE, 19(9), Article e0309887. https://doi.org/10.1371/journal.pone.0309887
Uto, M., & Ueno, M. (2023). A Bayesian many-facet Rasch model with Markov modeling for rater severity drift. Behavior Research Methods, 55(7), 3910–3928. https://doi.org/10.3758/s13428-022-01997-z
Weir, C. J. (2005). Language testing and validation: An evidence-based approach. Palgrave Macmillan. https://doi.org/10.1057/9780230514577
Wigglesworth, G. (1993). Exploring bias analysis as a tool for improving rater consistency in assessing oral interaction. Language Testing, 10(3), 305–335. https://doi.org/10.1177/026553229301000306
Wind, S. A., & Jones, E. (2019). The effects of incomplete rating designs in combination with rater effects. Journal of Educational Measurement, 56(1), 76–100. https://doi.org/10.1111/jedm.12201
Wind, S. A., & Peterson, M. E. (2018). A systematic review of methods for evaluating rating quality in language assessment. Language Testing, 35(2), 161–192. https://doi.org/10.1177/0265532216686999
Wright, B. D., & Stone, M. H. (1979). Best test design. MESA Press.
Wu, Y., & Han, C. (2024). Raters' scoring process in assessment of interpreting: An empirical study based on eye tracking and retrospective verbalisation. Language Assessment Quarterly, 21(4-5), 400-422. https://doi.org/10.1080/1750399X.2024.2326400
Wu, T., Kim, S. Y., Westine, C., & Boyer, M. (2025). IRT observed‐score equating for rater‐mediated assessments using a hierarchical rater model. Journal of Educational Measurement, 62(1), 145–171. https://doi.org/10.1111/jedm.12425
Yessimov, B., Hussein, R. A., Mohammed, A., Hassan, A. Y., Hashim, A., Najeeb, S. S., Mohammed Ali, Y., Abdullah, A. Salim, & Afif, N. S. (2023). Detecting measurement disturbance: Graphical illustrations of item characteristic curves. International Journal of Language Testing, 13 (Special Issue), 126–133. https://doi.org/10.22034/ijlt.2023.391731.1247

Implementing the Rasch Rating Scale Model for Assessing Writing Performance

How to Cite

Download Citation

Abstract

Keywords

References