TQB: Teacher Quality Bulletin

A fault in our measures? Evidence of bias in classroom observations may raise some familiar concerns

See all posts


A lot of us can rattle off the possible shortcomings of using value-added test scores to evaluate teachers: The scores vary from year to year. They lack transparency. They cannot control for other events going on in the classroom, like broken air conditioning or teachers consistently being assigned exceptionally motivated students. 

Sad to say, these problems aren't relegated to the statistical wizardry behind VAM—it turns out classroom observation scores may suffer from many of the same ills. 

New research from Matthew Steinberg of the University of Pennsylvania and Rachel Garrett of the American Institutes for Research employed data from the Methods of Effective Teaching study to look at how classroom composition relates to teachers' observation scores. 

First, they found that teachers who were assigned to high-performing students were more likely to earn higher observation scores. They also found that some domains of the evaluation instrument used in this study (the Danielson framework) appeared to give teachers undue credit for traits, achievement levels, and other factors students arrived with at the start of the school year; domains like "engaging students in learning" and "establishing a culture for learning," which rely heavily on student-teacher interaction, were the primary culprits. The authors offer two competing hypotheses for these higher scores: they could be a sign of observer bias—meaning, for example, that a teacher might get a score boost for having an eager and well-behaved class, even if she basically inherited her students that way—or they could indicate that teachers either perform or become better when working with a class of higher-achieving students. 

Finally, teachers who teach multiple subjects, like most elementary teachers, had observation scores that were less related to their students' incoming achievement—unlike the scores for teachers who only teach a single subject (and therefore older grades). This difference could be because teachers' observations in older grades are spread across multiple classrooms, or because teachers who spend more time with one group of students are better able to adjust to their needs. 

So, would Steinberg and Garrett's findings hold true elsewhere? After all, the MET study observation data relied on an unusually robust approach to teacher evaluation, using highly-trained off-site evaluators who rated videotapes of lessons. In contrast, districts more often rely on in-person observations, often by school principals and APs. 

Unfortunately, the findings from more typical on-site evaluations by school administrators may look even worse, according to Whitehurst, Chingos, and Lindquist. These Brookings researchers tackled the observation issue a few years ago and found that outside observers produce more valid observations than school administrators—so Steinberg and Garrett's work could actually understate the problem.

Read more about observations in our previous month's Teacher Trendline