10.57647/jntell.2026.0501.03

Implementing the Rasch Rating Scale Model for Assessing Writing Performance

  1. Department of Teaching English as a Foreign Language, NT.C., Islamic Azad University, Tehran, Iran
  2. Department of Teaching English as a Foreign Language, Ma.C., Islamic Azad University, Mashhad, Iran

Received: 2025-10-25

Revised: 2025-11-09

Accepted: 2025-12-20

Published in Issue 2026-03-31

How to Cite

Sahebalam, S., Baghaei, P., & Rashtchi, M. (2026). Implementing the Rasch Rating Scale Model for Assessing Writing Performance. Journal of New Trends in English Language Learning (JNTELL), 5(1). https://doi.org/10.57647/jntell.2026.0501.03

PDF views: 93

Abstract

Rater-mediated assessments require raters to make complex judgments, often expressed through ordinal rating scale categories, about test-takers’ performances. When the focus is on the quality of these judgments, it becomes essential to evaluate the ratings for their psychometric soundness, including validity, reliability, and fairness. In applied linguistics and second-language assessment research, the Many-Facet Rasch Model (MFRM; Linacre, 1989) has been widely used to investigate rater performance. However, the MFRM’s complexity, combined with the limited accessibility of user-friendly software, presents challenges for researchers and practitioners. In this study, the researchers propose using the Rating Scale Model (RSM), a more straightforward and widely recognized framework, as an alternative for analyzing judged performance. To explore its utility, the RSM was applied to scores assigned by five raters to 156 compositions written by English as a Foreign Language Learners (EFL). The findings indicate that the RSM can successfully convert ordinal ratings into interval-level measures for both test-takers and tasks, while also allowing examination of item fit, person fit, rating scale functioning, and dimensionality. Moreover, the model proved capable of investigating interaction effects, such as the influence of rater and examinee gender, as well as the interplay between raters and specific items. Overall, this study demonstrates that the RSM, while more limited than the MFRM, offers a practical and accessible tool for evaluating the quality of assessor judgments in performance assessment. We also highlight both the promise and the constraints of relying on RSM in this context, with implications for future inquiry and implementation.

Keywords

  • Rasch Rating Scale Model,
  • Writing Performance,
  • Assessment

References

  1. Andrich, D. (1978a). A rating formulation for ordered response categories. Psychometrika, 43(4), 561–573. https://doi.org/10.1007/BF02293814
  2. Andrich, D. (2013). An expanded derivation of the threshold structure of the polytomous Rasch model that dispels any "threshold disorder controversy." Educational and Psychological Measurement, 73(1), 78–124. https://doi.org/10.1177/0013164412450877
  3. Arias, R. M. (2010). Performance assessment. Papeles del Psicólogo, 31(1), 85–96.
  4. Aryadoust, V., Tan, H. A. H., & Ng, L. Y. (2019). A Scientometric Review of Rasch Measurement: The Rise and Progress of a Specialty. Frontiers in Psychology, 10, Article 2197. https://doi.org/10.3389/fpsyg.2019.02197
  5. Baghaei, P. (2021). Mokken scale analysis in language assessment. Waxmann Verlag.
  6. Baghaei, P., & Effatpanah, F. (2024). Nonparametric kernel smoothing item response theory analysis of Likert items. Psych, 6(1), 236–259.
  7. https://doi.org/10.3390/psych6010015
  8. Baharudin, H., Maskor, Z. M., & Matore, M. E. E. M. (2022). The raters' differences in Arabic writing rubrics through the Many-Facet Rasch measurement model. Frontiers in Psychology, 13, Article 988272. https://doi.org/10.3389/fpsyg.2022.988272
  9. Bonk, W. J., & Ockey, G. J. (2003). A many-facet Rasch analysis of the second language group oral discussion task. Language Testing, 20(1), 89–110. https://doi.org/10.1191/0265532203lt245oa
  10. Brown, A. (1995). The effect of rater variables in the development of an occupation-specific language performance test. Language Testing, 12(1), 1–15. https://doi.org/10.1177/026553229501200101
  11. Cai, H., & Yan, X. (2024). Triangulating natural language processing (NLP)-based analysis of rater comments and many-facet Rasch measurement (MFRM): An innovative approach to investigating raters' application of rating scales in writing assessment. Language Testing, 41(2), 290–316. https://doi.org/10.1177/02655322231210231
  12. Cambridge English (2024). Assessing writing for Cambridge English Qualifications: A guide for teachers. Cambridge University Press & Assessment. Retrieved from: https://www.cambridgeenglish.org/images/231794-cambridge-english-assessing-writing-performance-at-level-b1.pdf
  13. Carr, N. T. (2011). Designing and analyzing language tests. Oxford University Press.
  14. Coe, R., McCaffrey, D. F., Casabianca, J. M., Lockwood, J. R., & Guarino, C. (2024). Monitoring rater quality in observational systems: Issues due to unreliable estimates of rater quality. Educational Assessment, 29(2), 124–146. https://doi.org/10.1080/10627197.2024.2354311
  15. Cumming, A., Kantor, R., & Powers, D. E. (2001). Scoring TOEFL essays and TOEFL 2000 prototype writing tasks: An investigation into raters' decision making and development of a preliminary analytic framework (TOEFL Monograph Series MS-22). Educational Testing Service.
  16. Dewberry, C., Davies‐Muir, A., & Newell, S. (2013). Impact and causes of rater severity/leniency in appraisals without postevaluation communication between raters and ratees. International Journal of Selection and Assessment, 21(3), 286–293. https://doi.org/10.1111/ijsa.12038
  17. Eckes, T. (2005). Examining rater effects in TestDaF writing and speaking performance assessments: A many-facet Rasch analysis. Language Assessment Quarterly, 2(3), 197–221. https://doi.org/10.1207/s15434311laq0203_2
  18. Eckes, T. (2023). Introduction to many-facet Rasch measurement. Peter Lang. https://doi.org/10.3726/978-3-653-04844-5
  19. Eckes, T. (2015). Introduction to many-facet Rasch measurement. Analyzing and evaluating rater-mediated assessments (2nd rev. and updated ed.). Peter Lang. https://doi.org/10.3726/978-3-653-04844-5
  20. Effatpanah, F., & Baghaei, P. (2022). Exploring rater quality in rater-mediated assessment using the nonparametric item characteristic curve estimation. Psychological Test and Assessment Modeling, 64(3), 216–252.
  21. Effatpanah, F., & Baghaei, P. (2024). Examining the dimensionality of linguistic features in L2 writing using the Rasch measurement model. Educational Methods & Psychometrics, 2, 1–22. https://dx.doi.org/10.61186/emp.2024.3
  22. Elder, C. (1993). How do subject specialists construe classroom language proficiency? Language Testing, 10(3), 235–254. https://doi.org/10.1177/026553229301000303
  23. Elder, C., Barkhuizen, G., Knoch, U., & von Randow, J. (2007). Evaluating rater responses to an online training program for L2 writing assessment. Language Testing, 24(1), 37–64. https://doi.org/10.1177/0265532207071511
  24. Engelhard Jr, G., & Wang, J. (2024). Invariant measurement: Using Rasch models in the social, behavioral, and health sciences. Routledge. https://doi.org/10.4324/9781003458746
  25. Engelhard, G. Jr. (2013). Invariant measurement: Using Rasch models in the social, behavioral, and health sciences. Routledge/Taylor & Francis Group.
  26. Gordon, R. A., Peng, F., Curby, T. W., & Zinsser, K. M. (2021). An introduction to the many-facet Rasch model as a method to improve observational quality measures with an application to measuring the teaching of emotion skills. Early Childhood Research Quarterly, 55, 149–164. https://doi.org/10.1016/j.ecresq.2020.11.005
  27. Hamp-Lyons, L. (2007). Worrying about rating. Assessing Writing, 12(1), 1–9. https://doi.org/10.1016/j.asw.2007.05.002
  28. Heene, M. (2020). Applying the Rasch model: fundamental measurement in the human sciences. https://doi.org/10.4324/9780429030499
  29. Hoyt, W. T. (2000). Rater bias in psychological research: When is it a problem and what can we do about it? Psychological Methods, 5(1), 64–86. https://doi.org/10.1037/1082-989X.5.1.64
  30. Huang, J., & Chen, G. (2022). Individualized feedback to raters in language assessment: Impacts on rater effects. Assessing Writing, 52, Article 100623. https://doi.org/10.1016/j.asw.2022.100623
  31. In'nami, Y., & Koizumi, R. (2015). Task and rater effects in L2 speaking and writing: A synthesis of generalizability studies. Language Testing, 33(3), 341–366. https://doi.org/10.1177/0265532215587390
  32. Kane, M., Crooks, T., & Cohen, A. (1999). Validating measures of performance. Educational Measurement: Issues and Practice, 18(2), 5–17. https://doi.org/10.1111/j.1745-3992.1999.tb00010.x
  33. Kissi, P. S., Arthur, Y. D., Baffoe, S., Kontor, S. D., Amuzu, I. K., Ezuma, B. I., & Affum-Osei, E. (2022). Item and rater variabilities in students' evaluation of teaching in a university in Ghana: Application of Many-Facet Rasch Model. Heliyon, 8(12), Article e12548. https://doi.org/10.1016/j.heliyon.2022.e12548
  34. Kondo-Brown, K. (2002). A FACETS analysis of rater bias in measuring Japanese second language writing performance. Language Testing, 19(1), 3–31. https://doi.org/10.1191/0265532202lt218oa
  35. Kuiken, F., & Vedder, I. (2014). Rating written performance: What do raters do and why? Language Testing, 31(3), 329–348. https://doi.org/10.1177/0265532214526174
  36. Lane, S., Raymond, M.R., & Haladyna, T.M. (Eds.). (2015). Handbook of Test Development (2nd ed.). Routledge. https://doi.org/10.4324/9780203102961
  37. Li, X., Jiang, Y., & Yang, L. (2025). Assessing the reliability and relevance of DeepSeek in EFL writing evaluation: A generalizability theory approach. Language Testing in Asia, 15, Article 8. https://doi.org/10.1186/s40468-025-00369-6
  38. Lim, G. S. (2011). The development and maintenance of rating quality in performance writing assessment: A longitudinal study of new and experienced raters. Language Testing, 28(4), 543–560. https://doi.org/10.1177/0265532211406422
  39. Linacre, J. M. (1989). Many-facet Rasch measurement. MESA Press.
  40. Linacre, J. M. (1994). Sample size and item calibration [or person measure] stability. Rasch Measurement Transactions,7(4), 328. Retrieved from: https://www.rasch.org/rmt/rmt74m.htm
  41. Linacre, J. M. (2023a). Winsteps® (Version 5.6.0) [Computer Software]. Winsteps.com. Retrieved from https://www.winsteps.com/
  42. Linacre, J. M. (2023b). Winsteps® Rasch measurement computer program User's Guide. Version 5.6.0. Winsteps.com
  43. Ling, G., Mollaun, P., & Xi, X. (2014). A study on the impact of fatigue on human raters when scoring speaking responses. Language Testing, 31(4), 479–499. https://doi.org/10.1177/0265532214530699
  44. Lynch, B. K., & McNamara, T. F. (1998). Using G-theory and Many-facet Rasch measurement in the development of performance assessments of the ESL speaking skills of immigrants. Language Testing, 15(2), 158–180. https://doi.org/10.1177/026553229801500202
  45. Masters, G. N. (1982). A Rasch model for partial credit scoring. Psychometrika, 47(2), 149–174. https://doi.org/10.1007/BF02296272
  46. McNamara, T., & Knoch, U. (2012). The Rasch wars: The emergence of Rasch measurement in language testing. Language Testing, 29(4), 555–576. https://doi.org/10.1177/0265532211430367
  47. McNamara, T. (1996). Measuring second language performance. Longman.
  48. Messick, S. (1989). Validity. In R. L. Linn (Ed.), Educational measurement (3rd ed., pp. 13–103). Macmillan Publishing, American Council on Education.
  49. Mohd Noh, M. F., & Mohd Matore, M. E. E. (2022). Rater severity differences in English language as a second language speaking assessment based on rating experience, training experience, and teaching experience through many-faceted Rasch measurement analysis. Frontiers in Psychology, 13, Article 941084. https://doi.org/10.3389/fpsyg.2022.941084
  50. Myford, C. M., & Wolfe, E. W. (2003). Detecting and measuring rater effects using many-facet Rasch measurement: Part I. Journal of Applied Measurement, 4(4), 386–422. http://jampress.org/abst.htm
  51. Myford, C. M., & Wolfe, E. W. (2004). Detecting and measuring rater effects using many-facet Rasch measurement: Part II. Journal of Applied Measurement, 5(2), 189–227.
  52. O'Grady, S. (2025). Task design and rater effects in task-based language assessment. TESOL Journal, 6(1), Article e904. https://doi.org/10.1002/tesj.904
  53. Palermo, C. (2022). Rater characteristics, response content, and scoring contexts: Decomposing the determinates of scoring accuracy. Frontiers in Psychology, 13, Article 937097. https://doi.org/10.3389/fpsyg.2022.937097
  54. Rasch, G. (1960/1980). Probabilistic models for some intelligence and achievement tests. Danish Institute for Educational Research.
  55. Safarova, Y., Yusupova, L., & Kuznetsova, E. (2022). Rater severity differences in English language as a second language speaking assessment based on rating experience, training experience, and teaching experience through many-faceted Rasch measurement analysis. Frontiers in Psychology, 13, Article 941084. https://doi.org/10.3389/fpsyg.2022.941084
  56. Song, D., Wilson, C., Sun, G., Bai, H., Chen, T., Yu, S., & Huang, J. (2025). Exploring LLM autoscoring reliability in large-scale writing assessments using generalizability theory. arXiv. https://arxiv.org/abs/2507.19980
  57. Uto, M. (2021). Accuracy of performance-test linking based on a many-facet Rasch model. Behavior Research Methods, 53(4), 1440–1454. https://doi.org/10.3758/s13428-020-01498-x
  58. Uto, M. (2021). A multidimensional generalized many-facet Rasch model for rubric-based performance assessment. Behaviormetrika, 48(4). 425–457. https://doi.org/10.1007/s41237-021-00144-w
  59. Uto, M., Tsuruta, J., Araki, K., & Ueno, M. (2024). Item response theory model highlighting rating scale of a rubric and rater-rubric interaction in objective structured clinical examination. PLoS ONE, 19(9), Article e0309887. https://doi.org/10.1371/journal.pone.0309887
  60. Uto, M., & Ueno, M. (2023). A Bayesian many-facet Rasch model with Markov modeling for rater severity drift. Behavior Research Methods, 55(7), 3910–3928. https://doi.org/10.3758/s13428-022-01997-z
  61. Weir, C. J. (2005). Language testing and validation: An evidence-based approach. Palgrave Macmillan. https://doi.org/10.1057/9780230514577
  62. Wigglesworth, G. (1993). Exploring bias analysis as a tool for improving rater consistency in assessing oral interaction. Language Testing, 10(3), 305–335. https://doi.org/10.1177/026553229301000306
  63. Wind, S. A., & Jones, E. (2019). The effects of incomplete rating designs in combination with rater effects. Journal of Educational Measurement, 56(1), 76–100. https://doi.org/10.1111/jedm.12201
  64. Wind, S. A., & Peterson, M. E. (2018). A systematic review of methods for evaluating rating quality in language assessment. Language Testing, 35(2), 161–192. https://doi.org/10.1177/0265532216686999
  65. Wright, B. D., & Stone, M. H. (1979). Best test design. MESA Press.
  66. Wu, Y., & Han, C. (2024). Raters' scoring process in assessment of interpreting: An empirical study based on eye tracking and retrospective verbalisation. Language Assessment Quarterly, 21(4-5), 400-422. https://doi.org/10.1080/1750399X.2024.2326400
  67. Wu, T., Kim, S. Y., Westine, C., & Boyer, M. (2025). IRT observed‐score equating for rater‐mediated assessments using a hierarchical rater model. Journal of Educational Measurement, 62(1), 145–171. https://doi.org/10.1111/jedm.12425
  68. Yessimov, B., Hussein, R. A., Mohammed, A., Hassan, A. Y., Hashim, A., Najeeb, S. S., Mohammed Ali, Y., Abdullah, A. Salim, & Afif, N. S. (2023). Detecting measurement disturbance: Graphical illustrations of item characteristic curves. International Journal of Language Testing, 13 (Special Issue), 126–133. https://doi.org/10.22034/ijlt.2023.391731.1247