When analysts evaluate performance assessments, they often use modern measurement theory models to identify raters who frequently give ratings that are different from what would be expected, given the quality of the performance. To detect problematic scoring patterns, two rater fit statistics, the infit and outfit mean square error (MSE) statistics are routinely used. However, the interpretation of these statistics is not straightforward. A common practice is that researchers employ established rule-of-thumb critical values to interpret infit and outfit MSE statistics. Unfortunately, prior studies have shown that these rule-of-thumb values may not be appropriate in many empirical situations. Parametric bootstrapped critical values for infit and outfit MSE statistics provide a promising alternative approach to identifying item and person misfit in item response theory (IRT) analyses. However, researchers have not examined the performance of this approach for detecting rater misfit. In this study, we illustrate a bootstrap procedure that researchers can use to identify critical values for infit and outfit MSE statistics, and we used a simulation study to assess the false-positive and true-positive rates of these two statistics. We observed that the false-positive rates were highly inflated, and the true-positive rates were relatively low. Thus, we proposed an iterative parametric bootstrap procedure to overcome these limitations. The results indicated that using the iterative procedure to establish 95% critical values of infit and outfit MSE statistics had better-controlled false-positive rates and higher true-positive rates compared to using traditional parametric bootstrap procedure and rule-of-thumb critical values.
Keywords: false-positive rates; parametric bootstrap method; rater-mediated assessment; true-positive rate.
© The Author(s) 2021.