712
Richard J. Patz, Brian W. Junker and Matthew S. Johnson
Single and multiple ratings of test items have become a stock component of standardized educational tests and surveys. For both formative and summative evaluation of raters, a number of multiple-read rating designs are now commonplace (Wilson and Hoskens, 1999), including designs with as many as six raters per item (e.g. Sykes, Heidorn and Lee, 1999). As digital image based distributed rating becomes commonplace, we expect the use of multiple raters as a routine part of test scoring to grow; increasing the number of raters also raises the possibility of improving the precision of examinee proficiency estimates. In this paper we develop Patz's (1996) hierarchical rater model (HRM) for polytomously scored item response data, and show how it can be used, for example, to scale examinees and items, to model aspects of consensus among raters, and to model individual rater severity and consistency effects. The HRM treats examinee responses to open-ended items as unobserved discrete variables, and it explicitly models the ``proficiency'' of raters in assigning accurate scores as well as the proficiency of examinees in providing correct responses. We show how the HRM ``fits in'' to the generalizability theory framework that has been the traditional analysis tool for rated item response data, and give some relationships between the HRM, the design effects correction of Bock, Brennan and Muraki (1999), and the rater bundles model of Wilson and Hoskens (1999). Using simulated data, we compare analyses using the conventional IRT Facets model for rating data (e.g. Linacre, 1989; Engelhard, 1994, 1996) and illustrate parameter recovery for the HRM. We also analyze data from a study of three different rating modalities intended to support a Grade 5 mathematics exam given in the State of Florida, to show how the HRM can be used to identify individual raters of poor reliability or excessive severity, how standard errors of estimation of examinee scale scores are affected by multiple reads, and how the HRM scales up to rating designs involving large numbers of raters.