DR. FOX ROCKS: STUDENT

PERCEPTIONS OF EXCELLENT AND

POOR COLLEGE TEACHING

Working Draft

Charles D. Dziuban, Ph.D.[1]

Morgan C. Wang, Ph. D.[2]

Ida J. Cook, Ph.D.[3]

Contact Dr. Charles Dziuban: Research Initiative for Teaching Effectiveness

Library, Room 107

P.O. Box 162810

Orlando, FL 32816-2810

Office: 407-823-5478

Fax: 402-823-6580

Not to be reproduced, cited, or distributed without the authors’ permission

Abstract

The authors explorerelationships among items of a student rating instrument that purports to measure teaching effectiveness at a large metropolitan university. Theyuse overall rating of the instructor—the question that shared the largest variance with other items—as the dependent variable in a categorization and regression tree analysis (CART). The developmental model (424,498 observations) used course level, college, year, and the remaining items on the instrument as predictors. The results produce six if-then decision rules for predicting “Excellent” and “Poor” overall student ratings of instructors. The rules underpinthe instructor’s ability to facilitate learning, communicate ideas and information, organize his or her course, assess student progress, demonstrate interest in student learning, and show respect and concern for students. The performance assessment (164,077 observations) phase of the analysis demonstrates that the rules equalize instructor ratings across colleges, showing consistency in students’ perception of “Excellent” and “Poor” college teaching.

Key Words: Faculty Rating, Data Mining, Decision Trees

Few phenomena on university campuses evoke more ambivalence than student evaluation of teaching effectiveness. The resulting data from the instruments appear to serve two main functions: supply professors with formative information to improve their teaching and provide the summative status of an instructor’s effectiveness. However, disagreement exists about the validity of the process for improving or indexing instruction. The problem casts several constituencies in an ongoingdebate over the components of good teaching.

Taking Issue with Student Ratings

Critics of instructor evaluation (Adams, 1997) have argued that little evidence exists to demonstrate that student ratings improve instruction. Others (Abrami, d’Apollonia, & Cohen, 1990) suggested that such data are good indicators of student satisfaction, but offer little as a measure of effective instruction. Altschuler(2001) claimed that student responses reflect an instructor’s personality and entertainment value rather than course content adequacy. This line of reasoning produced the “Dr. Fox” characterization suggesting that some professors feign being warm and enthusiastic to entice high student ratings irrespective of course quality. Hoyt (1977) pointed out that many extraneous factors play a confounding role in the student rating process (i.e. level, discipline, and workload among others).

Greenwald and Gilmore (1997) contended that grading leniency is positively related to high student ratings. Following this line of reasoning, Eiszler (2002) examined the trends in a large data set comparing the expectations of students for high grades with their tendency to assign high instructor ratings—finding a positive relationship. Crumbley, Henry, and Kratchman (2001) surveyed 530 students on 18 factors related to teaching effectiveness. Of the 18 areas, students rated instructor equity in grading as most important. Further,more than one-third of the respondents reported that they researched instructors’ prior grade distributions before making a course selection, and 36% of the responding students indicated that if the rating instruments were not anonymous, they would respond differently. The authors concluded that student rating of instruction is a major contributing factor in grade inflation.

Kolitch and Dean (1999) conducted a critical analysis of the items on a “typical” student evaluation of instruction instrument finding the majority of items emphasized an informationtransmission model for teaching:leaving them to conclude most instruments narrowly define an “effective” course. The authors provided examples of alternative items that reflected an engaged-critical model of teaching—i.e., “As a result of the course, have you done anything to improve your community?”

Howell and Symbaluk (2001) surveyed students and faculty regarding the potential value of publishing student ratings of instruction and found that although students cited many potential advantages in published ratings, faculty cited many more disadvantages. For instance, students felt that the published ratings enhanced instructor accountability—facultydisagreed. Faculty felt that published student ratings lowered academic standard—students disagreed.

Support for Student Ratings

Not all investigatorsconcur with commonly held criticisms about the course evaluation process. Felder (1992) for instance, felt that student evaluations reflect a reliable and valid index of a professor’s effectiveness--especially if they are drawn from varying perspectives.

In a similar fashion, Marsh and Roche (1997) concluded that student ratings are reliable, stable, and multi-dimensional, arguing that those data are valid against many effectiveness indicators. They suggested that ratings are unaffected by a number of potentially biasing factors and are useful for improving instruction if used in conjunction with consultation.

The Dimensionality of Student Ratings

An impressive body of research addresses dimensional issues in student ratings of teaching effectiveness. Feldman (1976) offered 20 components for effective instruction that were organized into three higher order categories: presentation, facilitation, and regulation. Marsh and Roche (1997) proposed a nine-factor model for teaching assessment: learning and value, instructor enthusiasm, organization and clarity, group interaction, individual rapport, breadth of coverage, examinations and grading, assignments and readings, and workload and difficulty. Kim, Damewood, and Hodge (2000) investigated the affective components of teaching assessment—proposing components similar to those found in previous research with additional aspects—demonstrates enthusiasm, encourages student motivation, encourages student discussion, is open to constructive criticism, provides assistance outside of class, encourages students to ask for help, is considerate of students, generates equality among students, respects students, and demonstrates a positive attitude for the course and the students. Young and Shaw (1999), using clusters analysis, identified five instructor profiles; overall highly effective, highly effective except for communication, moderately effective on organization, moderately effective on organization and classroom climate.

Other researchers, however, do not agree with multidimensional theories of student ratings. Greenwald and Gilmore (1997), Alemami and d’Apollonia (1991) and McKeachie (1997) contended that student ratings are dominated by a global, “G” factor. They acknowledged the existence of secondary dimensions, but minimized their importance in the evaluation process.

Other Approachesto Investigating Student Ratings

Chang and Hocevar (2000) applied the generalizability theory to student evaluation of faculty data and conceptualized the data at five levels: teacher, courses, student, items, and actions. They pointed out that students are nested within teachers and courses. With teachers as the measurement object, the increase in the number of courses sampled showed the greatest impact on the generalizability estimate. Chang (2000) used multiple regression to determine the best predictors of student evaluation of instruction. Five variables defined the final model: student enthusiasm, participation, expected grade, grading standard, and course difficulty.

Using structural equation models, Shelvin, Banyan, Davies and Griffiths (2000) found two major components, lecture and stability, in the evaluation process. They discovered, however, that a third factor, instructor charisma, strongly moderated the relationship between lecture and stability. When Clayson (1999) examined stability of ratings, he found that students perceived instructor-related characteristics such as organization, knowledge, and fairness improved over time while affective characteristics failed to do so. He inferred that if the arguments of Feldman (1993) and Marsh (1991)that student ratings of professors do not change over time held, the instruments, then, most likely they were measuring instructor personality.

Read, Rama, and Raghunandan (2001) surveyed a large number of accounting departments to clarify the importance placed on teaching effectiveness and the reliance on student ratings when making promotion and tenure decisions. They discovered an inverse relationship. Institutions that de-emphasized teaching importance gave more credence to student ratings than institutions that emphasized the importance of teaching.

Hativa and Birenbaum (2000) examined student-favored teaching approaches in two different instructional environments—engineering and education. Theyfound that students in both areas preferred instructors who were organized, communicated effectively and stimulated student interest.

Sheehan and DuPrey (1999) reviewed over 3,600 student evaluations attempting to build a predictive model for responses to the item “I learned a lot in this class.” Five items accounted for the majority of variance in student self-perceived learning: informative lecture, approaches to assessment, instructor preparation, interesting lectures, and student perception of challenge in the course. Radmacher and Martin (2001) used hierarchical regression techniques with several classes of variables as predictors for student evaluation of instructor effectiveness. Their modeling procedures showed that extraversion was the only significant predictor of perceived instructor effectiveness.

Stapleton and Murkison (2001) addressed yet another issue in student ratings: fairness. They correlated measures of instructor excellence with student study production and found both positive and negative results, but found a positive relationship between a construct that they defined as learning production and instructor excellence. They also found that perceived instructor excellence related positively to expected grades. Based on these findings, the authors recommended that evaluation variables be weighted and ranked for each administration in order to enhance fairness in instructor evaluation.

Where Does the Literature on Student Ratings Lead?

In addition to measurement issues about teaching effectiveness, the literature on student ratings features disagreement about the validity of this approach for indexing instructional effectiveness, with faculty and students sometimes cast as antagonists in their ability to recognize and respond to effective teaching.

Although the literature contained elegant, philosophical, theoretical, and methodological discussions about the student evaluation process in the contexts of validity and stability, relatively few discussions occurred about how these ratings related to the improvement of instruction or to the resolution of the apparent disagreement between students and faculty about the evaluation process. Do students evaluate instructors through predetermined assumptions about effective teaching, or is their evaluation formed bycourse experiences? If the former, what similarities and differences exist in those assumptions according to student and instructor demographics, student learning styles, course level, academic pursuit, and type of institution?

Context for the Present Study

The University of Central Florida (UCF) administers an end-of-course student evaluation instrument. The form entitled, “Student Perception of Instruction” is a 16-item, Likert-type device that students use to rate their instructors (i.e., Excellent, Very Good, Good, Fair, Poor). Respondents have the opportunity to provide written comments about the instructor, and considerable demographic information (course level, college, department, and instructor) can be obtained from the instrument because the class and date are recorded on the form. The results are available in hard copy at the university library, but are not published or made available electronically. After classes end, instructors receive the original forms with student comments and a summary of course rating responses.

The instrument used at UCF comprises two separately designed item sets. A university-wide committee developed the first group of eight questions, and the Florida Board of Regents provided the second set of items that are common to Florida State University System institutions. However, this distinction of item sets is not evident on the instrument. Instructors may customize the form by adding, but not deleting, items. No other student demographic information is collected (i.e., anticipated grade). Table 1 provides the items of UCF’s student rating instrument.

Insert Table 1 about here

In 1997 the UCF Faculty Senate passed a resolution authorizing a study of the results from the “Student Perception of Instruction” instrument to explore its validity compared to alternative teaching evaluation methods. As a result, the Office of Academic Affairs funded a two-year study of those data by the authors.

Data Collection and Ethics Protocol

The investigators assembled a data set containing all student ratings of instructors for the 1996-97, 1997-98, 1998-99, 1999-2000 and the 2000-01 academic years. The file contained 588,575 student records with responses to the 16 Likert items and corresponding demographic information. The investigators reformatted the file so that it comprised only the responses to the 16 items and indicators of course level (lower undergraduate, upper undergraduate, or graduate), college (Arts and Sciences, Business Administration, Education, Engineering and Computer Science, and Health and Public Affairs), and the academic year. No further identifying information was available in the analysis file. Throughout the study, the investigators preserved department and instructor anonymity.

A Decision Tree Analysis

In order to explore these data, the authors incorporated decision tree data mining techniques that identified classification rules for an instructor receiving “Excellent” and “Poor” overall ratings(Breiman, Freidman, Olshen, & Stone, 1984). The authors adopted this approach for the following reasons.

Decision trees are readily applicable to large data sets such as this. To deal with missing values, the user does not have to impute values because these models have built in mechanisms such as floating category approaches implemented by Enterprise Miner and the surrogate method in Categorization and Regression Trees (CART). For data sets such as this one, there are many missing values and imputation is a very difficult and time-consuming task. Second, decision trees are among the most efficient methods for studying problems of this nature. For example, other methods such as logistic regression cannot efficiently handle all variables under consideration. There are 18 independent variables involved here; two of them have three levels and another sixteen variables have 5 levels. This means the logistic regression model must incorporate 68 dummy variables and 2278 two-way interactions. Even with today’s computers, this is very difficult. On the other hand, the decision tree approach can perform this analysis very efficiently, even if the investigator considers higher order interactions. Third, decision-trees constitute an appropriate method for studying this problem because fifteen of the variables are ordinal in their scaling. Although we can assign numerical values to each of these categories, assignment of values to each of these categories is not unique. However, decision trees use the ordinal component of the variables to derive a solution analysis. Fourth, the rules found in decision trees have an “if-then” structure that is readily comprehendible. Fifth, the quality of these rules can be assessed with percentages of accurate classification or odds-ratios that can be easily understood. The final procedure analysis produces tree-like rule structures that predict outcomes. Customarily, researchers test the quality of the rules on a data set independently of the one on which they were developed.

The ModelBuilding Procedure

For this study, the investigators used the classification and regression tree methods (CART, Breiman et al., 1984) executed with SAS Enterprise Miner (SAS, 2000). Because of its strong variance-sharing tendencies with the other variables, the dependent measure for the analysis was the Overall Rating of the Instructor, with the previously mentioned indicator variables (college, course level, academic year and the remaining fifteen questions) on the instrument.

Tree-based methods are recursive; bisecting data into disjoint subgroups called terminate nodes or leaves. CART analysis incorporates three stages: data splitting, pruning methods, and homogeneous assessment.

Data splitting into two (binary) subsets at each stage is the first feature of the model. After splitting, the data in the subsets become more and more homogeneous. The tree continues to split the data until the numbers in each subset are either very small (i.e. say the number of observations is less than 100), or all observations in a subset belong to one category (e.g. all observations in a subset have the same rating). Typically, this “Growing the Tree: stage results in far too many terminate nodes for the model to be useful. The extreme case occurs when the number of terminate nodes equals the number of observations. Such models are uninformative because they produce very few rules with explanatory power.

The CART procedure solves this problem by using pruning methods that reduce the dimensionality of the system. In practice, CART splits the data into two pieces—the first data set grows the tree, and the second prunes the tree, thereby validating the model. In practice, CART methods, through the pruning process, reduce the original tree into a nested set of sub trees. Although homogeneousness based on the training data set can always be improved, it is not necessarily true in the validation set. Typically, because the validation data are not used in the growing process, they give an honest estimate of the best tree size.