Let's continue our discussion of studies in Holman et al. 2019 "Evidence of Bias in Standard Evaluations of Teaching" listed as "finding bias". See here for the first entry in the series and here for other entries.

---

31.

Bray and Howard 1980 "Interaction of Teacher and Student Sex and Sex Role Orientations and Student Evaluations of College Instruction" selected 6 instructional faculty at the College of Social Sciences at the University of Houston from each of the following categories: feminine man, androgynous man, masculine man, feminine woman, androgynous woman, and masculine woman. For each of these 36 faculty members, one of their undergraduate courses was given during the final week of classes the IDEA questionnaire for measuring student evaluation of teaching and the Bem Sex Role Inventory (BSRI).

Bray and Howard 1980 indicate that the IDEA had 40 items across four parts: 20 instructor items, 10 student-rated progress items, 3 course evaluation items, and 7 items for the student self-rating of the course (p. 243). The article indicates that "During the last full week of classes (prior to final examination week) the appropriate students in each target class completed BSRI and IDEA questionnaires" (p. 244) and that "Two sections of IDEA were employed as outcome criteria: (1) the average of the responses on Part 2: Student Rated Progress (progress), and (2) student satisfaction with the instructor (satisfaction; IDEA question 37)" (p. 243). From what I can tell, the article thus uses only 11 of the 40 measured IDEA items and does not report results for any of the 20 instructor items from the first part of the IDEA questionnaire.

Results indicate that "Androgynous teachers received somewhat higher student evaluations than did masculine and feminine teachers" (p. 246), but there is no statistical control to assess whether any differences are due to differences in teaching styles or other factors.

---

32.

Bennett 1982 "Student Perceptions of and Expectations for Male and Female Instructors: Evidence Relating to the Question of Gender Bias in Teaching Evaluation" reported on data from 253 liberal arts college students in nonscience introductory courses. Students were instructed to respond to a self-returned questionnaire about a specific course.

Results in Table 2 indicate that, of the four factors in questionnaire responses, female instructors had higher ratings than did male instructors for non-authoritarian interpersonal style and charisma (potency) and did not differ at p<0.05 from male instructors on the self-assurance factor or the instructional approach factor.

Students reported more scheduled office visits for female instructors than for male instructors and were more likely to report feeling free to contact a female instructor at home than to contact a male instructor at home.

Results also suggested that the explanatory power of certain predictors of performance ratings differed for female instructors and for male instructors. From page 176:

Additionally, a highly structured instructional approach—described by students as communicating greater professionalism—was consistently more important for women's performance ratings than for men's. This was especially true for students' ratings of instructor's organization, clarity, and coherence in classroom presentation (rs = .76 and .53, for instructional approach for women and men, respectively), command of material for classroom presentation (rs = .49 and .27), and overall evaluation (rs = .57 and .30). Although in the collective student mind men and women do not differ in instructional approach, students are clearly more tolerant of what they perceive as a lack of formal professionalism in the conduct of teaching from their male professors, demanding of women a higher standard of formal preparation and organization.

Moreover, from pages 177 and 178:

...male instructors are judged independently of students' personal experiences of contact and access, whereas female instructors are judged far more closely in this regard. In this sense women are negatively evaluated when they fail to meet this gender appropriate expectation (and rewarded when they do so), although this study cannot provide evidence of the obverse—whether women are devalued for achievement in stereotypically masculine domains.

The "(and rewarded when they do so)" reflects how evidence that that explanatory power of the predictors of performance ratings differs for female instructors and for male instructors is mixed evidence of unfair bias in these performance ratings. If it is true that, for example, organization matters more for the performance ratings of female instructors than for male instructors, that might merely mean that female instructors have more ability to influence their performance ratings through their level of organization. That's not obviously worse to me than, say, male instructors not being rewarded for higher levels of organization. For a recent discussion of a related finding, here is a passage from Hesli et al. 2012 (p. 486, emphasis in the original):

Another critical finding that we note with some consternation is that among women, the probability of being an associate professor over an assistant professor is unrelated (given other controls) to the total number of publications (model 4D). This confirms that the promotion process at this level involves different dynamics for men as compared with women.

I'm not sure that the Hesli et al. 2012 results indicate that there is a p<0.05 difference in the explanatory power of publications. The coefficient for publications is statistically significant for men and not for women, but that does not necessarily indicate that a difference between the coefficients can be inferred. For what it's worth, the "publications" logit coefficient for men is 0.713 with a 0.230 standard error and for women is 0.303 with a 0.453 standard error.

---

33.

Brooks 1982 "Sex Differences in Student Dominance Behavior in Female and Male Professors' Classrooms" isn't a study about student evaluations of teaching. Rather, the study involved analyzing behaviors of first-year students in a master's of social work program, such as student frequency of speaking in class, student frequency interrupting a fellow student, and student frequency interrupting the professor.

Results indicated that "...male students interrupted both male and female professors significantly more often than female students (p < .01 and p < .001, respectively), and interrupted female professors significantly more often than male professors (p < .001)" (p. 687). Interruptions by female students were more evenly distributed than interruptions by male students: of all female interruptions of professors, 45 percent were of female professors; of all male interruptions of professors, 74 percent were of female professors.

---

Comments are open if you disagree, but I don't think that data from the early 1980s is relevant for discussions of whether student evaluations of teaching should be used in employment decisions made in 2019 or beyond.

From what I can tell, the main evidence of bias in student evaluations of teaching to be concerned with in these three studies is from Bennett 1982. But, from what I can tell, the bias in performance evaluations being better predicted by certain criteria for female instructors than for male instructors doesn't necessarily cause a gender-wide bias. For example, imagine these performance evaluation scores:

2 for a disorganized male instructor

2 for a organized male instructor

1 for a disorganized female instructor

3 for organized female instructor

Organization matters more for the female instructors than for the male instructors, but that particular difference in evaluation criteria does not cause a gender-wide bias. Following Hesli et al. 2012, it could be noted with some consternation that performance evaluation scores for men are unrelated to the instructor's organization.

Tagged with: , ,

Let's continue our discussion of studies in Holman et al. 2019 "Evidence of Bias in Standard Evaluations of Teaching" listed as "finding bias". See here for the first entry in the series and here for other entries.

---

28.

I did not locate the text of Hogan 1978 "Review of the Literature: The Evaluation of Teaching in Higher Education", but I'm guessing from the "Review of the Literature" title that Hogan 1978 doesn't report novel data.

---

29.

Kaschak 1978 "Sex Bias in Student Evaluations of College Professors" had 100 seniors or first-year graduate students at San Jose State University (50 male and 50 female) rate fictional professors in business administration, chemistry, home economics, elementary education, psychology, and history, based on descriptions of the professors' teaching methods and practices; each professor had a male name or a female name, depending on the form that a student received. Ratings were reported for 1-to-10 scales for: effective/ineffective, concerned/unconcerned, likeable/not at all likeable, poor/excellent, powerless/powerful, and definitely would/would not take the course.

Male students had mean ratings on each item that were more positive for the male professor than for the female professor. Female students had statistically different mean ratings by professor sex only for indicating that the male professor is more powerful and for indicating a preference for taking the female professor's course.

---

30.

Lombardo and Tocci 1979 "Attribution of Positive and Negative Characteristics of Instructors as a Function of Attractiveness and Sex of Instructor and Sex of Subject" had 120 introductory psychology students (60 male and 60 female) rate a person in a photograph, with the experimental manipulation that the person in the photograph was male or female and was attractive or unattractive. Students were told that the photograph was of Mary Dickson or Andrew Dickson and were told that the person had earned a Ph.D. and had just finished a second year of teaching. Ratings included nine scales (such as from intelligent to not intelligent) and the items "Compared with the faculty members at this college, how would you rate the over-all teaching performance of this instructor?" and "How much would you like to take a course from this faculty member?".

Results indicated that "Each of the dependent measures was analyzed by a 2 (attractive vs unattractive) X 2 (sex of pictured person) X 2 (sex of subject) analysis of variance...A significant main effect was found...for attractive-unattractiveness. The absence of other main effects or interactions indicated that the attractive pictures were rated significantly more attractive" (p. 493) and that "An interaction between the attractiveness of the picture and the sex of the instructor...on the question of how much they would like to take a course from this instructor indicated that all subjects preferred to take a course from a male" (p. 494).

---

Comments are open if you disagree, but I don't think that data from the 1970s is relevant for discussions of whether student evaluations of teaching should be used in employment decisions made in 2019 or beyond. For example, there has been a substantial increase since the 1970s in female representation among students and faculty, which can be plausibly expected to have reduced biases against female college faculty present during the Nixon administration.

Tagged with: , ,

Let's continue our discussion of studies in Holman et al. 2019 "Evidence of Bias in Standard Evaluations of Teaching" listed as "finding bias". See here for the first entry in the series and here for other entries.

---

25.

Elmore and LaPointe 1974 "Effects of Teacher Sex and Student Sex on the Evaluation of College Instructors" analyzed student evaluation data from courses from various departments of the Southern Illinois University at Carbondale in 1971. Complete data were available from 1,259 students in 38 pairs of courses matched on course number and instructor sex. For the 20 instructor evaluation items analyzed, only two items had a mean difference between female instructors and male instructors using a p=0.01 threshold: men instructors were rated higher for "spoke understandably", and women instructors were rated higher for"promptly returned homework and tests".

I'm not sure why Elmore and LaPointe 1974 is included in a list of studies finding bias in standard evaluations of teaching. No statistically-significant difference was reported for 18 of the 20 instructor evaluation items, and, for the two items for which there was a reported difference, one difference favored male instructors and the other difference favored female instructors. But, more importantly, the Elmore and LaPointe 1974 research design does not permit the inference that student ratings were biased from reality; for example, no evidence is reported that indicates that the female instructors didn't return homework and tests more promptly on average than the male instructors did.

---

26.

Elmore and LaPointe 1975 "Effect of Teacher Sex, Student Sex, and Teacher Warmth on the Evaluation of College Instructors" analyzed student evaluation data from courses from various departments of the Southern Illinois University at Carbondale in 1974. Data were available from 838 students in 22 pairs of courses matched on course and instructor sex. Twenty standard instructor evaluation items were used, plus instructor responses and student responses to an item about whether the instructor's primary interest lie in the course content or the students and a five-point measure of how warm a person the instructor is. The p-value threshold was 0.0025.

Results indicated that "When students rate their instructor's interest and warmth, teachers perceived as warmer or primarily interested in students receive higher ratings in effectiveness regardless of their sex", that "In general, female faculty receive significantly higher effectiveness ratings than do male faculty when they rate themselves low in warmth or interested in course content", and that "Male teachers who rate themselves high in warmth or primarily interested in students receive significantly higher ratings than male teachers who rate themselves low in warmth or primarily interested in course content, respectively" (p. 374).

I'm not sure how these data establish an unfair bias in student evaluations of teaching.

---

27.

Ferber and Huber 1975 "Sex of Student and Instructor: A Study of Student Bias" reported on responses to three items from students in the first class meeting of four large introductory economics or sociology courses at the University of Illinois Urbana from 1972.

The first item asked students to rate men college teachers that they had had in seven academic areas and women college teachers that they had had in seven academic areas. Results in Table 1 indicate that, across the seven academic areas, the mean rating for men college teachers was identical to the mean rating for women college teachers (2.24).

The second question asked about student preferences for men instructors or women instructors in various types of classroom situations. Results in Table 2 indicate that most students did not express a preference, but, of the students who did express a preference, the majority preferred a man instructor. For example, of 1,241 students, 39 percent expressed a preference for a man instructor in a large lecture and 2 percent expressed a preference for a woman instructor in a large lecture.

The third item asked students to rate their level of agreement with a statement, attributed to a man or to a woman. For one statement, the prompt was: "A well-known American economist [Mary Killingsworth/Charles Knight] proposes that compulsory military service be replaced by the requirement that all young people give one year of service for their country". Results in Table 6 indicate that the mean level of agreement did not differ between Mary and Charles at p<0.05 among male students, among female students, or among the full sample.

For the other statement, the prompt was: "According to the contemporary social theorist [Frank Merton/Alice Parsons], in order to achieve equal educational opportunity in the United States, no parents should be allowed to pay for their children's education; every college student should borrow from the federal government to pay for tuition and living expenses". Results in Table 6 indicate that, on a rating scale from 1 for strongly agree to 5 for strongly disagree, the mean level of agreement differed at p<0.05 among male students, among female students, and among the full sample, with Alice favored over Frank (respective overall means of 3.38 and 3.66).

I'm not sure why Ferber and Huber 1975 is included in a list of studies finding bias in standard evaluations of teaching. The first item is the only item directly on point for assessing bias in student evaluations of teaching, and there was no overall difference in that item for male instructors and female instructors and no evidence that the lack of a difference was unfair.

---

Comments are open if you disagree, but I don't think that any of these three studies provide sufficient evidence to undercut the use of student evaluations in employment decisions.

And it's worth considering whether these data from the Nixon administration should be included in the main Holman et al. 2019 list, given that the sum of "76" studies "finding bias" in the Holman et al. 2019 list is being used to suggest inferences about the handling of student evaluations of teaching in contemporary times.

Tagged with: , ,

Let's continue our discussion of studies in Holman et al. 2019 "Evidence of Bias in Standard Evaluations of Teaching" listed as "finding bias". See here for the first entry in the series and here for other entries.

---

22.

Heilman and Okimoto 2007 "Why Are Women Penalized for Success at Male Tasks?: The Implied Communality Deficit" reports on three experiments regarding evaluations of fictional vice presidents of financial affairs. The experiments do not concern student evaluations of teaching, so it's not clear to me that Holman et al. 2019 should classify this article under "Evidence of Bias in Standard Evaluations of Teaching".

---

23.

Punyanunt-Carter and Carter 2015 "Students' Gender Bias in Teaching Evaluations" indicated that 58 students in an introductory communication course were asked to complete a survey about a male professor or about a female professor. The article did not report inferential statistics, and, given the reported percentages and sample sizes, it's not clear to me that this study should be classified as finding bias.

For example, here are results from the first question, about instructor effectiveness, for which the article reported results only for the percentage of each student gender that agreed or strongly agreed that the instructor was effective:

For the female professor:
82% of 17 males, so 14 of 17
67% of 15 females, so 10 of 15

For the male professor:
69% of 13 males, so 9 of 13
69% of 13 females, so 9 of 13

Overall, that's 21 of 32 (66%) for the female professor and 18 of 26 (69%) for the male professor, producing a p-value of 0.77 in a test for the equality of proportions.

---

24.

Young et al. 2009 "Evaluating Gender Bias in Ratings of University Instructors' Teaching Effectiveness" had graduate students and undergraduate students evaluate on 25 items "a memorable college or university teacher of their choice" (p. 4). Results indicated that "Female students rated their female instructors significantly higher on pedagogical characteristics and course content characteristics than they rated their male instructors. Also, male students rated male instructors significantly higher on the same two factors. Interpersonal characteristics of male and female instructors were not rated differently by the male and female students" (p. 9).

I'm not sure how much to make of the finding quoted above based on this study, given results in Table 4 of the article. The p-value section of Table 4 has a column for each of the three factors (interpersonal characteristics, pedagogical characteristics, and course content characteristics) and has seven rows, for student gender (A), student level (B), instructor gender (C), AxB, AxC, BxC, and AxBxC. So the table has 21 p-values, only 2 of which are under 0.05; the average of the 21 p-values is 0.52.

---

Comments are open if you disagree, but I don't think that any of these three studies provide sufficient evidence to undercut the use of student evaluations in employment decisions.

Tagged with: , ,

Let's pause our discussion of studies in Holman et al. 2019 "Evidence of Bias in Standard Evaluations of Teaching" listed as "finding bias", to discuss three studies of student evaluations of teaching that are not in the Holman et al. 2019 list. I'll use the prefix "B" to refer to these bonus studies.

---

B1.

Meltzer and McNulty 2011 "Contrast Effects of Stereotypes: 'Nurturing' Male Professors Are Evaluated More Positively than 'Nurturing' Female Professors" reported on an experiment in which undergraduates rated a psychology job candidate, with variation in candidate gender (Dr. Michael Smith or Dr. Michelle Smith), variation in whether the candidate was described as "particularly nurturing", and variation in whether the candidate was described as "organized" or "disorganized". Participants responded to items such as "Do you think Dr. Smith's responses to students' questions in class would be helpful?" and "How do you think you would rate Dr. Smith's overall performance in this course?". Results indicated no main effect for gender, but the nurturing male candidate was rated higher than the control male candidate and the nurturing female candidate and marginally higher than the control female candidate.

For some reason, results for the "organized"/"disorganized" variation were not reported.

---

B2.

Basow et al. 2013 "The Effects of Professors' Race and Gender on Student Evaluations and Performance" reported on an experiment in which undergraduates from psychology, economics, and mathematics courses evaluated a three-minute engineering lecture from an animated instructor whose race and sex was Black or White and male or female; participants also took a quiz on lecture content. Results indicated that "student evaluations did not vary by teacher gender", that "students rated the African American professor higher than the White professor on several teaching dimensions", and that students in the male instructor condition and in the White instructor condition did better on the quiz (p. 359).

---

B3.

I don't have access to Chisadza et al. 2019 "Race and Gender Biases in Student Evaluations of Teachers", but the highlights indicate that "We use an RCT to investigate race and gender bias in student evaluations of teachers" and that "We note biases in favor of female lecturers and against black lecturers". The abstract at Semantic Scholar indicates that the experiment was conducted in South Africa and that "Students are randomly assigned to follow video lectures with identical narrated slides and script but given by lecturers of different race and gender".

---

Comments are open if you disagree, but I don't think that there is much in B1 or B2 that would undercut the use of student evaluations in employment decisions. The experiments have high internal validity, but B1 had no main effect for gender and B2 results aren't strong and consistent. Moreover, B1 and B2 use brief stimuli, so I don't know that the results are sufficiently informative about student evaluations at the end of a 15-week course.

Tagged with: , ,

Let's continue our discussion of studies in Holman et al. 2019 "Evidence of Bias in Standard Evaluations of Teaching" listed as "finding bias". See here for the first entry in the series and here for other entries.

---

19.

Miller and Chamberlin 2000 "Women Are Teachers, Men Are Professors: A Study of Student Perceptions" reported on a study in which students in sociology courses were asked to indicate their familiarity with faculty members on a list, and, for faculty members that the student was familiar with, to indicate the highest education degree that the student thinks the faculty member has attained; listed faculty members were the faculty members in the sociology department, plus a fictitious man and a fictitious woman that footnote 6 indicates no student indicated a familiarity with. Results indicated that "controlling for faculty salary, seniority, rank, and award nomination rate, the level of educational attainment attributed to male classroom instructors is substantially and significantly higher than it is for women" (p. 294).

This study isn't about student evaluations of teaching and, from what I can tell, any implications of the study for student evaluations of teaching should be detectable in student evaluations of teaching.

---

20.

From what I can tell, the key finding mentioned above from Miller and Chamberlin 2000 did not replicate in Chamberlin and Hickey 2001 "Student Evaluations of Faculty Performance: The Role of Gender Expectations in Differential Evaluations", which indicated that: "Male versus female faculty credentials and expertise were also nonsignificant on items assessing student perceptions of the highest degree received by the faculty member, the rank of the faculty member, and whether the faculty member was tenured" (p. 10). Chamberlin and Hickey 2001 reported evidence of male faculty being rated differently than female faculty on certain items, but no analysis was reported that assessed whether these differences in ratings could be accounted for by plausible alternate explanations such as faculty performance.

---

21.

Sprague and Massoni 2005 "Student Evaluations and Gendered Expectations: What We Can't Count Can Hurt Us" analyzed data from 66 students at a public university on the East Coast and 223 students at a public university in the Midwest in 1999. Key data were student responses to a prompt to print up to four adjectives to describe the worst teacher that the student ever had and then to print up to four adjectives to describe the best teacher that the student ever had. Results were interpreted to indicate that "Men teachers are more likely to be held to an entertainer standard...[and]...Women teachers are held to a nurturer standard" (p. 791). Table V indicates that Caring is the most common factor for the best male teachers and that Uncaring is the second most common factor for the worst male teachers, so it's not obvious to me that the data permit a strong inference that men aren't also held to a nurturer standard.

---

Comments are open if you disagree, but I don't think that studies 19 and 20 report data indicating for unfair sex or race bias in student evaluations of teaching using a research design with internal validity, with internal validity referring to an analysis that adequately addresses plausible alternate explanations. Study 21 (Sprague and Massoni 2005) reported results suggesting a difference in student expectations for male faculty and female faculty, but I don't know that there's enough in that study to undercut the use of student evaluations in employment decisions.

Tagged with: , ,

Let's continue our discussion of studies in Holman et al. 2019 "Evidence of Bias in Standard Evaluations of Teaching" listed as "finding bias". See here for the first entry in the series and here for other entries.

---

16.

Huston 2006 "Race and Gender Bias in Higher Education: Could Faculty Course Evaluations Impede Further Progress toward Parity" is a review that, as far as I can tell, does not report novel data on unfair sex or race bias in student evaluations of teaching.

Sandler 1991 "Women Faculty at Work in the Classroom: Or, Why It Still Hurts To Be a Woman in Labor" is a review/essay-type of publication.

---

17.

Miles and House 2015 "The Tail Wagging the Dog; An Overdue Examination of Student Teaching Evaluations" [sic for the semicolon] reported on an analysis of student evaluations from a southwestern university College of Business, with 30,571 cases from 2011 through 2013 for 255 professors across 1,057 courses with class sizes from 10 to 190. The mean rating for the 774 male-instructed courses did not statistically differ from the mean rating for the 279 female-instructed courses (p=0.33), but Table 7 indicates that the 136 male-instructed large required courses had a higher mean rating than the 30 female-instructed large required courses (p=0.01). I don't see results reported for a gender difference in small courses.

For what it's worth, page 121 incorrectly notes that scores from male-instructed courses range from 4.96 to 4.26; the 4.96 should be 4.20 based on the lower bound of 4.196 in Table 4. Moreover, Hypothesis 6 is described as regarding a gender difference for "medium and large sections of required classes" (p. 119) but the results are for "large sections of required classes" (p. 122, 123) and the discussion of Hypothesis 6 included elective courses (p. 119), so it's not clear why medium classes and elective courses weren't included in the Table 7 analysis.

---

18.

Martin 2016 "Gender, Teaching Evaluations, and Professional Success in Political Science" reports on publicly available student evaluations for undergraduate political science courses from a southern R1 university from 2011 through 2014 and a western R1 university from 2007 through 2013. Results for the items, on a five-point scale, indicated little gender difference in small classes of 10 students, a mean male instructor rating 0.1 and 0.2 points higher than the mean female instructor rating for classes of 100, and a mean male instructor rating 0.5 points higher than the mean female instructor rating for classes of 200 or 400.

The statistical models had predictors only for instructor gender, class size, and an interaction term of instructor gender and class size. No analysis was reported that assessed whether ratings could be accounted for by plausible alternate explanations such as course or faculty performance.

---

Comments are open if you disagree, but I don't think that any of these three studies report a novel test for unfair sex or race bias in student evaluations of teaching using a research design with internal validity, with internal validity referring to an analysis that adequately addresses plausible alternate explanations. The interaction of instructor gender and class size that appeared in Miles and House 2015 and Martin 2016 appears to be worth further consideration in a research design that adequately addresses plausible alternate explanations.

Tagged with: , ,