Minimizing Bias Due to Observer Effects This study used three techniques for minimizing observer bias. First, instructions were standardized across teachers. The goal of the research was clearly communicated to the teacher in carefully written, standard instructions. Teachers were told that the goal was to videotape a typical lesson with typical defined as whatever they would have been doing had the videographer not shown up. Teachers were also explicitly asked to prepare for the target lesson just as they would for a typical lesson. (A copy of information given to teachers prior to the study is included as appendix A.) Second, this study attempted to assess the degree to which bias occurred. After the videotaping, teachers were asked to fill out a questionnaire in which they rated, for example, the typicality of what we would see on the videotape, and describe in writing any aspect of the lesson they felt was not typical.

We also asked teachers whether the lesson in the videotape was a stand-alone lesson or part of a sequence of lessons and to describe what they did yesterday and what they plan to do in tomorrow's lesson.

Lessons described as stand-alone and as having little relation to the lessons on adjoining days would be suspect for being special lessons constructed for the purpose of the videotaping. In this study, however, lessons were rarely described in this way.

Finally, one must use common sense in deciding the kinds of indicators that may be susceptible to bias and taking this into account in interpreting the results of a study. It seems likely, for example, that students will try to be on their best behavior with a videographer present, and so we may not get a valid measure from video of the frequency with which teachers must discipline students. On the other hand, it is probably less likely that teachers use a different style of questioning while being videotaped than they would when the camera is not present. Some behaviors, such as the routines of classroom discourse, are so highly socialized as to be automatic and thus difficult to change.

Sampling and Validity Observer effects are not the only threat to validity of video survey data. Samplingof schools, teachers, class periods, lesson topics, and parts of the school yearis a major concern.

One key issue is the number of times any given teacher in the sample should be videotaped. This obviously will depend on the level of analysis to be used. If we need a valid and reliable picture of individual teachers, then we must tape the teacher multiple times, as teachers vary from day to day in the kind of lesson they teach, as well as in the success with which they implement the lesson. If we want a school-level picture, or a national-level picture, then we obviously can tape each teacher fewer times, provided we resist the temptation to view the resulting data as indicating anything reliable about the individual teacher.

On the other hand, taping each teacher once limits the kinds of generalizations we can make about instruction. Teaching involves more than constructing and implementing lessons. It also involves weaving together multiple lessons into units that stretch out over days and weeks. If each teacher is taped once, it is not possible to study the dynamics of teaching over the course of a unit. Inferences about these dynamics cannot necessarily be made, even at the aggregate level, based on one-time observations.

Another sampling issue concerns representativeness of the sample across the school year. This is especially important in cross-national surveys where centralized curricula can lead to high correlations of particular topics with particular months of the year. In Japan, for example, the eighth-grade mathematics curriculum devotes the first half of the school year to algebra, the second half to geometry. Clearly, the curriculum would not be fairly represented by taping in only one of these two parts of the year.

Finally, although at first blush it may seem desirable to sample particular topics in the curriculum in order to make comparisons more valid, in practice this is virtually impossible. Especially across cultures, teachers may define topics so differently that the resulting samples become less rather than more comparable. Randomization appears to be the most practical approach to insuring the comparability of samples.

Confidentiality The fact that images of teachers and students appear on the tapes makes it more difficult than usual to protect the confidentiality of study participants when the data set is used for secondary analyses. An important issue, therefore, concerns how procedures can be established to allow continued access to video data by researchers interested in secondary analysis.

One option is to disguise the participants by blurring their faces on the video. This can be accomplished with modern-day digital video editing tools, but it is expensive at present to do this for an entire data set. A more practical approach is to define special access procedures that will enable us to protect the confidentiality of participants while still making the videos available as part of a restricted-use data set.

Logistics Contrary to traditional surveys, which require intensive and thorough preparation up front, the most daunting part of video surveys is in the data management and analysis phase. Information entered on questionnaires is more easily transformed into computer readable format than is the case for video images.

Thus, it is necessary to find a means to index the contents of the hundreds of hours of tape that can be collected in a video survey. Otherwise, the labor involved in analyzing the tapes grows enormously.

Once data are indexed, there is still the problem of coding. Coding of videotapes is renowned as highly labor intensive. But there are strategies available for bringing the task under control. The present study has developed specialized computer software to help in this task. Emerging multimedia computing technologies will, over the next several years, revolutionize the conduct of video surveys, making them far more feasible than they have ever been in the past.


Anecdotes and images are vivid and powerful tools for representing and communicating information.

One picture, it is said, is worth a thousand words. On the other hand, anecdotes can be misleading and even completely unrepresentative of reality. Furthermore, research in cognitive psychology has shown that the human information processing system is easily misled by anecdotes, even in the face of contradictory and far more valid information (e.g., Nisbett and Ross, 1980). Methods of research design and inferential statistics were developed, in fact, specifically to protect us from being misled by anecdotes and experiences (Fisher, 1951).

A video survey, like the one being described here, provides one possible way to resolve this tension between anecdotes and statistics. Recognizing the power of video images, one can harness this power in two ways. First, discoveries made through qualitative analysis of the videos can be validated by statistical analysis of the whole set of videos. For example, while watching a video we might notice some interesting technique used by a Japanese teacher. If we only had one video, it would be hard to know what to make of this observation: Do Japanese teachers really use the technique more than U.S. teachers, or did we just happen to notice one powerful example in the Japanese data? Because we have a large sample of videos, we can turn our observation into a hypothesis that can be validated against the database.

In a complementary process, we might, after coding and quantitative analysis of the video data, discover a statistical relationship in the data. By returning to the actual video, we can find concrete images to attach to our discovery, giving us a means of further analysis and exploration, as well as a set of powerful images that can be used to communicate the statistical discovery we have made. Through this process we can uncover what the statistic means in practice.

Chapter 2. Methods


Our goal was to collect national probability samples of eighth-grade mathematics students in Germany, Japan, and the United States. The final sample consisted of 100 lessons in Germany, 81 in the United States, and 50 in Japan. In addition, five "public use" tapes were collected in each country to serve as examples to help us communicate the results of the study. And, a subsample of 30 lessons in each country was selected from the final sample for in-depth analysis by a group of mathematicians and mathematics educators. We review each of these samples in more detail here.

All analyses reported here were done on the full sample of 231 lessons except the following, which used the subsample of 30 lessons in each country selected for analysis by the Math Content Group: (1) Analyses of the Math Content Group; (2) Some analyses of the use of the chalkboard; (3) Analyses of second-pass coding of discourse; and (4) Analyses of explicit linking within and across lessons. We have explicitly noted in the text whenever anything less than the full sample is used in an analysis.

The Main Video Sample The main video samples were designed to be random subsamples of the TIMSS main study sample, which was selected according to the TIMSS sampling plan in each country. Our plan was to videotape 100 eighth-grade classrooms in Germany and the United States, and 50 in Japan. In the end, these targets were attained in Germany and Japan but not in the United States, where only 81 classrooms agreed to participate. The sample size in Japan was reduced to 50 primarily because collaborators at the Japanese National Institute for Educational Research (NIER) determined that 100 classrooms would create too great a burden for their country. This reduction was further justified by the fact that certain characteristics of Japanese education (e.g., lack of tracking within or across schools, adherence to a national curriculum, and culturally more homogeneous population) led us to expect lower variability between classrooms in Japan.

The main TIMSS study focused on three separate age groups; the video sample was drawn from only one of these age groups, referred to as Population 2. Population 2 was defined as the pair of adjacent grades in each country which contained the largest percentage of 13-year-olds. In all three countries included in the video study, Population 2 was defined as grades seven and eight. NCES specifications for the study required that only eighth-grade classrooms be sampled for videotaping. According to the TIMSS international specifications, sampling in each country was accomplished by selecting schools, then classrooms within schools. Each country was required to sample a minimum of 150 schools and a minimum of one seventh- and one eighth-grade classroom within each school.

The selection of schools for the main TIMSS study followed a somewhat different procedure in each country. In the United States, schools were sampled from within primary sampling units (PSUs), geographically-defined units designed to reduce the costs of data collection. PSUs were stratified according to geographic region, metropolitan versus nonmetropolitan area, and various secondary strata defined by socioeconomic and demographic characteristics, then sampled with the probability of selection proportionate to the population of each PSU. Within each sampled PSU, schools were sampled with the probability of selection proportionate to the estimated number of students in the target grades. In Japan, schools were randomly selected from strata defined by size of community and size of school, with the.29 probability of selection proportionate to the size of the population within each stratum. Germany followed a similar procedure but defined its strata by state and by type of school.

Further details regarding selection of the main TIMSS samples in each country can be obtained elsewhere (Foy, Rust, & Schleicher, 1996). Here, we describe how the subsamples were selected for the video study. Because specific details of sample selection and recruitment varied across the three countries, we describe each country's sample separately. (A discussion of weighted and unweighted response rates for each country can be found in appendix B.) The U.S. Sample The U.S. TIMSS sample for Population 2 consisted of a stratified random sample of 220 schools.

Within each school, one seventh- and two eighth-grade classrooms were studied. One-half of these schools were randomly sampled to be part of the video study. Within each school, one eighth-grade classroom was randomly sampled to be videotaped.

Schools were selected for the video study as follows: First, Population 2 TIMSS schools were listed in the order in which they were originally sampled. Using this ordering, pairs of schools were generated.

Within each pair one of the two schools was randomly sampled (with each school having an equal probability of being sampled). The unsampled school in the pair was reserved as a potential replacement for the sampled school. A total of 109 pairs were assigned, with one school unpaired, because one school of the original Population 2 sample of 220 schools had no eighth grade. The unpaired school was given a half chance of being selected. The final videotape sample size was 109. The unpaired school was not sampled.

Within each sampled school, one eighth-grade classroom was selected with equal probability from the two TIMSS eighth-grade classrooms in the school. There was no sorting or stratification of classrooms by level of mathematics taught. In the event that the sampled teacher refused to be videotaped, the classroom was never replaced by the other eighth-grade classroom in the same school. Instead, the entire school was replaced by its paired school.

Of the original 109 schools sampled, 100 were public and 9 were private. Forty schools, including one private school, refused to participate. The paired schools for 13 of these refusals were contacted, and 12 agreed to participate in the video study. Thus, the final video sample in the United States consisted of 73 public and 8 private schools. The high refusal rate among originally sampled U.S. classrooms should be kept in mind as a potential source of sampling bias.

Each teacher who participated in the study was awarded a $300 grant, its use to be determined "jointly by the teacher and the principal."

