Research Article 4

Holistic bias as a source of intrarater unreliability in analytic writing assessment

Paul Kavanagh

University of Leicester


In assessing the productive skills of speaking and writing, the ability of assessors to apply rating scales consistently throughout a series of individual ratings is fundamental to judgments about reliability. This issue of intrarater reliability has been the subject of a large amount of previous research. However, much of this research has been done under explicitly experimental conditions, where participants have been aware of the aspect of their performance being studied. In order to elicit more natural rater behaviour with minimal external controls, this study employed an internet survey using matched pairs of short extracts of students’ writing. Ten writers of varying levels each produced two samples of writing of similar length and quality. In order to identify close matches, a control group of twelve accredited examiners were asked to use a provided rating scale for grammar and vocabulary to rate each sample individually. From the combined ratings given by the group, five pairs were found to be closely matched in both criteria. These ten samples formed the basis of an internet survey taken by 25 teachers. In the first half of the survey, teachers were asked to rate the first half of each pair according to the same rating scale used by the expert group. In the second half of the survey, teachers gave ratings for only one criterion on each sample, either grammar or vocabulary. However, to assess the extent to which they might be influenced by external cues, teachers were also given additional information. On some samples, teachers were given the consensus opinion reached previously by the expert group as to the other criterion. On others, teachers were given the overall ranking of the writer amongst the initial group of ten participants. A comparison was then made between, on the one hand, ratings made independently and, on the other, ratings made after consensus opinion or ranking became known. While the effect of introducing external cues was not significant for three out of five matched pairs, those samples which scored in the mid-range of the scale, evidence was found for a substantial change in the overall ratings given for the samples which placed closer to the extremes of the scale.

Keywords: Language learning; intrarater reliability; examination; assessment.