PCRC — Carlin Conner, Ph.D.

Interrater Reliability Conditional on Literacy Behavior

Here you can find additional information and tables to accompany my poster presentation from the Pacific Coast Research Conference in San Diego, CA.

This presentation summarized statistical evidence indicating that interrater reliability depends on the literacy behavior being observed. Observation data were obtained using the Systematic Observation of Language and Reading (SOLR). The tool captures 25 behaviors grouped into seven behavior categories specific to literature engagement for students with ID. Behaviors are coded as either observed (1) or not observed (0) by a trained observer during literacy instruction at each 30-second time interval during the observed instruction. The 25 literacy behaviors captured with the SOLR are grouped into seven behavior domains, (a) language development, (b) abstract thinking, (c) elaboration, (d) print, (e) engagement, (f) fluency and prosody, and (g) off task/refusal behaviors.

Three raters independently coded 8 to 10 minutes of video data of three students with ID during literacy instruction participating in a curriculum designed for students with disabilities, Friends on the Block (FOTB; Allor et al., 2018). Before independently coding, raters practiced coding segments of literacy instruction together, reaching 100% agreement. Overall, across all behaviors observed percent agreement of the three raters was 89.7%, which is arguably very high. The observed agreement, however, is an inadequate estimate.

Multiple rater reliability statistics were considered (e.g., Kappa, Krippendorf alpha, the Gwet AC). The Gwet AC was the most appropriate test statistic to use in the presence of this high observed agreement (Gwet, 2008), though it is limited to comparing only two raters at a time. Provided below, results from each pair of raters (1 & 2, 1 & 3, and 2 & 3) indicate high reliability between each pair of raters, with little variability. The results comparing raters over the entire measure, including Gwet AC estimate, standard errors, and upper and lower confidence bounds, has been analyzed and will be shared via presentation.

Based on examination of both the Gwet AC statistic and the overlapping (or non-overlapping) confidence bounds, it appears that rater reliability for literacy behavior categories engagement and fluency/prosody were lower than for the other behavior categories. Data in the tables compare each pair of raters across individual behavior categories, including Gwet AC estimate, standard error, and upper and lower confidence bound. We interpret these findings to indicate that obtaining reliable literacy behavior observations for students with ID and ASD depends on specific behaviors and that increased rater training may be needed for some specific literacy behaviors.