The three-year, $50 million Measures of Effective Teaching study, funded by the Bill & Melinda Gates Foundation, found it was difficult to predict how much students would achieve in a school year based on their teacher's years of experience or knowledge of pedagogical technique.
But researchers found they could pick out the best teachers in a school and even predict roughly how much their students would learn if they rated the educators through a formula that put equal weight on student input, test scores and detailed classroom observations by principals and peers.
Taken alone, each of those measures was fairly volatile. Judging teachers primarily by student performance on state tests, for instance, turned out to be highly unreliable, with little consistency from year to year. Judging them chiefly by a principal's observations failed to identify those teachers who could be counted on to boost student proficiency on state math and reading tests.
Combining all three measures into a properly weighted index, however, produced a result "teachers can trust," said Vicki Phillips, a director in the education program at the Gates Foundation.
The study comes at a time of bitter political wrangling over teacher evaluations in cities including New York, Los Angeles and Chicago - and provides ammunition for all sides.
Education reformers who have been pressing to dismantle tenure systems, which protect veteran teachers from layoffs, could take heart in the finding that seniority doesn't predict success in the classroom.
Yet the report also bolstered union leaders who have argued that teacher evaluations should not be tied so heavily to trendy "value-added measures," or VAM - complex algorithms that aim to gauge whether students do better or worse than expected on state tests after several months in a given teacher's classroom.
The Obama administration has pushed states to give heavy weight to quantitative measures such as test scores in designing teacher evaluations. More than a dozen states have moved in that direction, in some cases making it impossible for a teacher to earn a good review if her VAM score is low, no matter how well she performs on other measures. States including Florida, Louisiana, Colorado, Michigan and Ohio have been particularly aggressive in tying teacher ratings to test scores.
The Gates study concluded that student performance should ideally make up one-third to one-half of a teacher's evaluation.
STUDENT SURVEYS HELP PREDICT LEARNING
"This should be a very big red flag to all those policy makers who think they can have test-based accountability be half or more of a teacher's evaluation," said Randi Weingarten, the president of the American Federation of Teachers.
In fact, some states that relied heavily on those value-added measures are already rethinking.
Louisiana is poised to announce a dramatic overhaul of its evaluation system, which took effect just last July and was hailed by education reformers as pioneering for its reliance on student test scores to rate teachers.
The revised system, which Superintendent of Education John White will unveil this week, still uses student test scores but in a far more nuanced way.
Teachers who score in the bottom 10 percent on the value-added metrics will be automatically deemed ineffective and can be fired, White said. Those in the top 20 percent will be deemed highly effective. Those who score in the middle, however, won't be pitted against one another in a ranking of best to worst. Instead, their principals will be urged to use other measures of quality, including watching the teacher at work and evaluating student progress toward classroom goals, White said.
"The system that had been put in place in Louisiana assumed ... greater statistical precision" for the value-added measures than they could realistically deliver, White said, adding that he developed the new plan after consulting with researchers from the Gates Foundation and elsewhere.
The Washington, D.C., public school system also recently revamped its teacher evaluation formula so that value-added measures based on student test scores now account for 35 percent rather than 50 percent.
Weighting the test scores so heavily caused undue anxiety for teachers, said Jason Kamras, chief of human capital for the D.C. public schools. The district also feared it led to an unfortunate narrowing of the curriculum, as some elementary school teachers focused intently on math and reading, which are tested frequently, while spending less time on social studies and science, which are not, Kamras said.
The Gates report was notable as well for its emphasis on student evaluations of teachers. Researchers found that evaluations were most reliable when student surveys made up as much of a third of a teacher's rating.
Researchers used a survey known as Tripod, developed by a Harvard researcher and a British consulting company, Cambridge Education. Children as young as five are asked to respond to statements such as "This class is a happy place for me to be," or "In this class, we learn to fix our mistakes." Older children answer questions such as whether the teacher has firm control over the class and whether she explains new concepts clearly.
Few districts use student surveys as part of their formal teacher evaluations. Any effort to change that could stir up opposition from teachers, who fear putting their jobs in the hands of sometimes immature students.
The Gates study examined 3,000 teachers in several cities, including Dallas, Denver, New York and Charlotte, N.C. The first year, teachers were evaluated by multiple measures, including student test scores and classroom observations. The researchers then randomly assigned students to each participating teacher. The following year they checked to see whether teachers rated as highly effective did indeed produce better results for students - not only on the state standardized tests but also on other measures, including open-ended math and reading assessments requiring sophisticated critical thinking.
Sure enough, they said their predictions about which teachers would produce the best results proved correct.
The research relied on teachers who volunteered to have their work scrutinized, so they may not have been a representative sample. And since researchers could not randomly assign students to a classroom across town, they were only able to study the relative strengths of teachers within a given school.
Despite those limitations, Thomas Kane, a professor of education and economics at Harvard University and a lead researcher on the Gates team, said the study achieved a landmark goal: "We identified groups of teachers who caused students to learn more."
(Reporting by Stephanie Simon; editing by Lee Aitken and Prudence Crowther)