# «SE 062 649 ED 431 621 Stigler, James W.; Gonzales, Patrick; Kwanaka, Takako; AUTHOR Knoll, Steffen; Serrano, Ana The TIMSS Videotape Classroom Study: ...»

Public and Private Talk Our first step in coding discourse was to make a distinction between public and private talk. Public talk was defined as talk intended for everyone to hear; private talk was talk intended only for the teacher or an individual student. When the teacher stopped at an individual student's desk to make comments on the students' work this was generally coded as private talk, regardless of whether others could hear what the teacher was saying. The important thing is that the talk was primarily intended for this individual student alone. On the other hand, if a teacher stopped in the middle of a classwork period to criticize the behavior of a student sitting in the back of the room, this was coded as public talk. In this case, everyone had to stop and listen, even if they were not the one being disciplined.

All further coding of discourse was done on public talk only. Because public talk was accessible to everyone, we assumed that it would provide the most valid representation of the discourse environment experienced by students in the classroom.

First-Pass Coding and the Sampling Study Next, we divided all transcripts into utterances, which was the smallest unit of analysis used for describing discourse. An utterance was defined as a sentence or phrase that serves a single goal or function.

Generally, utterances are small and most often correspond to a single turn in a classroom conversation.

Utterances were then coded into 12 mutually exclusive categories. Six of the categories were used to code teacher utterances: Elicitation, Direction, Information, Uptake, Teacher Response, and Provide Answer.

Five categories were applied to student utterances: Response, Student Elicitation, Student Information, Student Direction, and Student Uptake. One category, Other, could be applied to both teacher and student utterances. Elicitations were further subdivided into five mutually exclusive categories: Content, Metacognitive, Interactional, Evaluation, and Other. And Content Elicitations were subcategorized as well.

Definitions of each of these categories will be presented later, together with the results.

Although all lessons were coded with the first-pass categories in the lesson transcripts, we decided to enter only a sample of the codes into the computer for preliminary analysis.

Thirty codes were sampled from each lesson according to the following procedure. First, three time points were randomly selected from each lesson. Starting with the last time point sampled, we found the first code in the transcript to occur after the sampled time. From this point, we took the first 10 consecutive codes, excluding Other, that occurred during public talk. If private talk was encountered before 10 codes were found, we continued to sample after the period of private talk. If the end of the lesson was encountered before 10 codes were found, we sampled upward from the time point until 10 codes were found. The same procedure was repeated for the second and first of the three time points.

In those cases, if working down in the lesson led us to overlap with codes sampled from a later time point, we reversed and sampled upward from the selected time point.

Two kinds of summary variables were used for the sampling study: (1) Average number of codes (out of 30) in each lesson that were of each category, and (2) Percentage of lessons that contained any codes of each category (within the 30 codes sampled).

Second-Pass Coding of Discourse For second-pass coding of discourse we decided to work with a subsample of lessons. We chose to study the 30 lessons in each country that had been selected for analysis by the Math Content Group, in part because they were balanced in their representation of algebra and geometry. Before proceeding, however, we wanted to know how well the subsample of 30 lessons in each country represented the larger sample, specifically with regard to discourse. To answer this question we compared the subsample of 30 lessons in the Math Content Group sample with the rest of the lessons in each country on each of the discourse variables produced in the first-pass sampling study (presented earlier).

Each variable was analyzed using a two-way analysis of variance (ANOVA), with country and sample group as factors. On only one analysis did we find a significant effect of sample group. However, neither for this variable nor for any of the others did we find a significant Country x Sample interaction.

Several new codes were added for second-pass coding. Content Elicitations, Information statements, and Directions were further subdivided. In addition, we started the process of grouping utterances into higher-level categories we call Elicitation-Response sequences (ER sequences). Elicitation-Response sequences appear to be the next-level building block for classroom conversations. A more detailed definition of all of these categories will be presented later with the results, but for now it is useful to define ER sequences as a sequence of turns exchanged between the teacher and student(s) that begins with an initial elicitation and ends with a final uptake.

## STATISTICAL ANALYSES

Most of the analyses presented in this preliminary report are simple comparisons of either means or distributions across the three countries. In all cases, the lesson was the unit of analysis. All analyses were done in two stages: First, means or distributions were compared across the three countries using either one-way ANOVA or Pearson Chi-Square procedures. Variables coded dichotomously were usually analyzed using ANOVA, using asymptotic approximations.Next, if overall analyses were significant, pairwise contrasts were computed and significance determined by the Bonferroni adjustment. In all cases, the Bonferroni adjustment was made assuming three simultaneous tests (i.e., Germany vs. Japan, Germany vs. United States, and Japan vs. United States). In the case of dichotomous variables (for which the sample estimate is a proportion) and continuous variables, we computed Student's t on each pairwise contrast. Student's t was computed as the difference between the two sample means divided by the standard error of the difference. Determination that a pairwise contrast was statistically significant with p.05 was made by consulting the Bonferroni t tables published by Bailey (1977).

For categorical variables, we followed the procedure suggested by Wickens (1989) and used the Bonferroni Chi-Square tables printed in that book. Throughout, a significance level criterion of.05 was used. All differences discussed met at least this level of significance, unless otherwise stated. Anytime we use terms such as "less," "more," "greater," "higher," or "lower," for example, the reader can be assured that the comparisons are statistically significant.

All tests were two-tailed. Statistical tests were conducted using unrounded estimates and standard errors, which were also computed for each estimate. Standard errors for estimates shown in figures in the report are provided in the table in appendix E. Standard errors for estimates indicated in the text but not shown in figures are reported in footnotes to the relevant text.

Weighting All of the analyses reported here were done using data weighted with survey weights, which were calculated for the classrooms in the videotape study itself, separate from any weights calculated for the main TIMSS assessment. The weights were developed for each country, so that estimates are unbiased estimates of national means and distributions. The weight for each classroom reflects the overall probability of selection for the classroom, with appropriate adjustments for nonresponse.

The analyses also used procedures that accounted for the complex nature of the sample design within each country (with the samples being independent across countries). The jackknife procedure was used, via the WesVarPC software, to account for the fact that a stratified random sample of schools was selected, with one classroom chosen from each selected school. F-tests for the comparison of means across three countries were achieved through the use of linear regression, with dummy variables indicating country as the independent variables. Pairwise t-tests were computed using the 'FUNCTION' capability of the 'TABLE' statement. Chi-square tests were computed using the `TABLE' statement also, using first-order Rao-Scott corrections to account for the complex sample design.

## COMPARISON OF VIDEO SUBSAMPLES WITH MAIN TIMSS SAMPLES

Despite the exhaustive attempts to select the video subsample randomly from the TIMSS main study sample, it may still be asked: Are the classrooms selected for the video study representative of the larger TIMSS sample?Some information relevant to this question can by found by comparing the mathematics achievement scores (i.e., performance on the TIMSS student assessments) of classrooms in the main TIMSS samples with the subsample of classrooms selected for the video study. We did not have test data for all of the

**classrooms included in the video study; and, data from Japan were somewhat problematic in one respect:**

Test data were not collected on classrooms included in the video study but on other eighth-grade classrooms in the same schools as the video classrooms. Nevertheless, we did have enough data to warrant a meaningful comparison of the two samples, and the lack of any tracking in Japan gives us some confidence that the school-level estimate of performance in Japan would be a reasonable indicator of classroom-level performance.

Distributions of mean achievement scores for classrooms in the Main TIMSS samples and in the video subsamples for each country are presented in figure 5. It is apparent in the figure that the distribution of mathematics achievement scores among the video subsamples are representative of distributions in the Main TIMSS samples.' These distributions are based on unweighted average achievement scores for each classroom. Our purpose is simply to compare distributions across pairs of samples within countries, not to make any inferences about true population distributions.

NOTE: SD = standard deviation.

SOURCE: U.S. Department of Education, National Center for Education Statistics, Third International Mathematics and Science Study, Videotape Classroom Study, 1994-95.

## BEST COPY AVAILABLE

## VALIDITY OF THE VIDEO OBSERVATI NS

As mentioned earlier, one of our concerns was that the presence of the video camera might in some way alter the nature of classroom instruction and thus threaten the validity of the study. One step we took to lessen the chances of this happening was to give all teachers a standard set of instructions in which we informed them of the goals of the study. We wanted to be certain that teachers understood that we wanted to film a typical day in their classroom, not one that was prepared especially for us.We also attempted to assess how successful we were in sampling what typically happens in these classrooms by asking teachers, after the videotaping, to evaluate the typicality of what we would find on the videotape. We will report these results here.

Our first concern was that we might get a special lesson, one that the teacher holds in reserve for demonstration purposes. To ascertain whether or not this happened, we asked several questions on our questionnaire about how this particular lesson was chosen, and about how it related to the previous and next lessons that the teacher had taught or would teach to this same class of students. We asked, for example, whether the lesson we videotaped was a stand-alone lesson or part of a sequence of lessons. If the lesson was part of a sequence, we asked them to describe the goals and activities of the adjoining lessons so that we could judge the relationship they had to the lesson on videotape. Our reasoning was that special lessons would show up as stand-alone lessons that were unrelated to the adjoining lessons.

As it turned out, almost all of the teachers in our sample (97.8 percent in Germany, 95.7 percent in Japan, and 93.4 percent in the United States) reported that the videotaped lesson was part of a sequence of lessons designed to teach a particular topic in the mathematics curriculum. Further, they were able to give clear and reasonable descriptions of how this lesson related to the previous and next lessons in the sequence. This outcome confirmed our sense that teachers did not make drastic accommodations to prepare for our videographer.

We also asked teachers how many lessons were in the whole sequence of lessons, and where the lesson we videotaped fell in the sequence. The average sequence of lessons reported by U.S. teachers was

9.1 lessons, significantly shorter than the 13.6 and 14.4 lessons reported by German and Japanese teachers respectively.' However, the position of the videotaped lesson in the sequence did not differ across countries.' A more subtle picture of how the presence of the camera might have affected instruction emerges when we look at some of the other judgments teachers made about the lesson. First, it is interesting to see how nervous or tense the teachers felt about being videotaped. Teachers were asked to check one of four choices: Very nervous, somewhat nervous, not very nervous, and not at all nervous. Japanese teachers reported being more nervous than both German and U.S. teachers about the presence of our videographer." Figure 6 shows that about three-fifths of U.S. teachers (62.1 percent), almost one-half of German teachers (48.9 percent), and about one-fifth of Japanese teachers (21.6 percent) reported being "not at all nervous" or "not very nervous."

SOURCE: U.S. Department of Education, National Center for Education Statistics, Third International Mathematics and Science Study, Videotape Classroom Study, 1994-95.

A number of questions were designed to get teachers' evaluations of how typical the lesson on the videotape was of the lessons they normally teach to this class of students. In figure 7 we present teachers' judgments of the quality of the video lesson compared with their usual lesson. Again, we see that Japanese teachers appear to differ from their German colleagues. Twenty-seven percent of Japanese teachers and 18.6 percent of German teachers reported feeling that the lesson on tape was worse than usual.