Want to evaluate teachers? Look at teaching.

October, 2012

Dr. Eugene W. Muller
National Measurement and Testing, Inc.

Imagine for a moment that you are a recently-hired police officer, assigned to a local neighborhood or “beat”. Further imagine that you are told by your superiors that your job performance will be assessed by the number of reported crimes on your beat over a certain length of time. Would you feel you’re your performance was being evaluated fairly?

Or suppose you were a physician working for a clinic or hospital, and you were told that your job performance was to be appraised by how many of your patients survived and for how long, regardless of age or severity of illness. Once again, would you regard this as a fair and valid assessment of your performance?

Your answer would almost certainly be ‘no’, for in both situations, you would be judged according to factors well outside of your control. A police officer has no control over the socioeconomic conditions and social problems that exist within the households and neighborhood he/she is trying to protect. It is well known, for instance, that crime rates in struggling inner city neighborhoods tend to be substantially higher than crime rates in wealthier suburban towns. To ignore these social conditions and hold the police officer accountable for all violations of the law would be erroneous. In a similar vein, a physician cannot control the ages and medical conditions of patients that come to a clinic or hospital to seek treatment. The physician has no control over a life threatening epidemic that suddenly emerges within a community. The physician can only deal with cases that present themselves for intervention, and prescribe the appropriate tests and medications needed to treat those cases. It is not within the physician’s power to control which patients come in with what illness. The physician can only be expected to deal in the best way possible with the condition that presents itself for treatment. Beyond this, the physician has no control over how well patients actually follow medical advice, whether or not they take prescribed medications, refrain from prohibited activities, and so forth.

These examples might seem rather obvious to most people, yet oddly enough, this very point is lost on those who would evaluate school teachers by measuring student performance. In recent years, there has been a considerable support, from both sides of the political spectrum, to construct teacher evaluation systems based on student achievement test performance. These systems, sometimes referred to as value-added models (VAMS), measure score changes of a teacher’s students over time (say, one year), and use these changes as an indicator of the teacher’s effectiveness (see Darling-Hammond, et. al., 2012).

In New Jersey, for example, the Department of Education has been considering a system that includes student growth percentiles (SGP) on state assessments in the evaluation of teacher performance. This plan uses grades to account for up to 45% of the teacher’s rating. In Pennsylvania, legislators have sponsored a bill to include student performance on a wide range of measures for 50% of teacher and principal ratings. These measures would include graduation rates and attendance, as well as test scores. In New York, state lawmakers approved a bill to have student test scores account for up to 40% of teacher’s annual evaluations. In Los Angeles, the Unified School District was order by the County Superior Court to use student performance on standardized achievement tests in the evaluation of teachers. In stating his objections to the policy, the president of the local teachers union noted that such an evaluation system will create incentives to narrow the curriculum and “teach to the test”, rather than focus on good instruction. As of this writing, up to 19 states and the District of Columbia are using, or are planning to use, student achievement as a factor in teacher appraisal. In 2009 alone, over $4 Billion in federal grants were given to states in support of teacher evaluation systems based on student test performance.

The approach of using student test scores to evaluate a teacher’s performance is based on the belief that student gains are a reflection of teacher effectiveness, which is grounded in a number of faulty assumptions, including the notion that it is the teacher alone who influences a student’s test performance. Educational and psychological research simply does not bear out this idea. There are many factors that influence student achievement test performance, including class size, the choice of curricula and study materials, instructional time, peer culture and achievement, prior teachers and education, and home environment and community support. For instance, numerous studies have shown that students from high socioeconomic (SES) backgrounds tend to demonstrate higher levels of academic achievement than do students from lower SES backgrounds (Alwin and Thornton, 1984). In a study of 14 year olds sampled from 17 different countries, it was found that student socio-economic status had a very strong influence on the science test performance, to the point where the researchers concluded that the home, and not the classroom, had the most powerful influence on academic achievement (Postlethwaite and Wiley, 1992). They noted that there is far more variation between homes than schools, which would explain much of the differences between students. It was also found that students who viewed the subject matter of science as being important and useful performed better on science achievement tests than those who did not. Even the mere liking school was shown to have a real and positive effect on student performance.

Studies of this nature demonstrate the spuriousness, if not outright absurdity, of evaluating teachers according to student test performance. Evaluating a teacher on the basis of student achievement tests is akin to rating a chief executive of a Fortune 100 company on the basis of the state of the national economy: Both have SOME influence over the outcomes, but most definitely not complete control. A good teacher is one who demonstrates good pedagogical skills, and a poor teacher is one who does not. We cannot validly evaluate a teacher on the skill levels and characteristics of the students he/she is assigned to teach, as there are too many other factors that can influence student test performance, irrespective of the efforts of the teacher. Rather, we should focus our efforts on how well the teacher demonstrates effective teaching skills.

So if we are to measure what the teacher does - and only what the teacher does - when evaluating teachers, the question becomes that of finding the most effective way of conducting such a measure. Current instruments used to evaluate teacher classroom performance typically employ “rating scales” that present the observer with a list of traits or characteristics on which the teacher is to be evaluated. These characteristics are usually rated along a range of numbers, adjectives, or descriptions representing different levels or degrees of performance (e.g., “superior”, “very good”, “fair”, “poor”, “and satisfactory”). Rating scales of this type are prone to numerous sources of distortion and error, often leading to invalid and inaccurate appraisals. Such scales may not provide sufficient time to evaluate the observed teacher on all traits listed, leading to superficial judgments. Some of the traits or characteristics may not be clearly defined, and could mean different things to different raters. Scales of this nature are often vague and ambiguous as to the distinctions between various levels of performance. Some are unclear in terms of how total scores are to be determined, and are subject to the manipulations of an observer who has decided beforehand to give a teacher a certain overall rating without considering all of the elements of performance. And many such scales will be prone to rater tendency to disregard specific aspects of performance in favor of general impressions, a problem referred to as “halo error”.

The Teacher Observation Scale (TOS)™ is designed to address these problems. The Teacher Observation Scale (TOS)™ is an instrument used for the objective analysis and appraisal of teacher classroom performance that employs a format representing a major departure from common methods of teacher evaluation. The TOS is a behavioral checklist consisting of a variety of statements describing typical classroom activities, such as:

  • The teacher's questions required the students to apply the concepts of the lesson.
  • The teacher assisted the students in formulating the general concept by analyzing specific data.
  • The teacher failed to use a motivation to start the lesson.

The observer completing the TOS indicates which of these activities are exhibited by the teacher giving the lesson and which are not, without any superficial ratings of degree. This information is then analyzed against predetermined ratings made by a panel of experts to produce an objective and fair appraisal of teacher performance. In this manner, the Teacher Observation Scale (TOS)™ minimizes measurement error by avoiding vague trait descriptions and ambiguous rating values.

A critical aspect of the TOS is that the observer is not informed of the scale values for the items used in the determination of the candidate’s score. The observer only reports on what he/she sees the teacher under observation do or fail to do; the observer does not calculate any scores. Instead, the completed scale forms are scored according to the pre-determined ratings for each item. In this manner, the person conducting the observation cannot manipulate or distort an observed teacher’s score on the TOS. Any potential rater bias in TOS scoring is greatly minimized.

Most importantly, the Teacher Observation Scales (TOS)™ looks exclusively at what the TEACHER does, and not the students. The quality and performance of the students has no bearing on the scores generated by the TOS. Only the actions of the teacher are considered in the determination of the TOS scores.

If the goal is to evaluate teacher performance, then the focus should be on teaching, not on students. The Teacher Observation Scale (TOS)™ offers a promising and exciting new approach in the objective appraisal of teacher classroom performance by measuring teaching, and only teaching. It represents a significant advancement in the assessment and appraisal of class instruction, minimizing the chance of unfair, ambiguous, subjective, and distorted appraisals that can often occur when using traditional teacher evaluation methods. The Teacher Observation Scale (TOS)™ provides a concrete assessment of teacher classroom performance independent of the personal biases or idiosyncrasies of the observer, and independent of the academic performance of students. Further information on the TOS can be obtained by contacting National Measurement and Testing at www.nmetest.com.


Darling-Hammon, L., Amrein-Beardsley, A., Haertel, E., and Rothstein, J. (2012) Evaluating teacher evaluation. Phi Delta Kappan, 93 (6), 8-15.

Postlethwaite, T.N. and Wiley, D.E. (1992) The IEA Study of Science II: science achievement in twenty-three countries. Oxford: Pergamon Press.

White, K.R. (1982) The relation between socioeconomic status and academic achievement. Psychological Bulletin, 91 (3), 461-481.