Comments on "Mitigating gender bias in student evaluations of teaching"
The Peterson et al. 2019 PLOS ONE article "Mitigating gender bias in student evaluations of teaching" reported on an experiment conducted with students across four Spring 2018 courses: an introduction to biology course taught by a female instructor, an introduction to biology course taught by a male instructor, an introduction to American politics course taught by a female instructor, and an introduction to American politics course taught by a male instructor. Students completing evaluations of these teachers were randomly assigned to receive or to not receive a statement about how student evaluations of teachers are often biased against women and instructors of color.
The results clearly indicated that "this intervention improved the SET scores for the female faculty" (p. 8). But that doesn't address the mitigation of bias in the title of the article because, as the article indicates, "It is also possible that the students with female instructors who received the anti-bias language overcompensated their evaluations for the cues they are given" (p. 8).
---
For the sake of illustration, let's assume that the two American politics teachers were equal to each other and that the two biology teachers were equal to each other; if so, data from the Peterson et al. 2019 experiment for the v19 overall evaluation of teaching item illustrate how the treatment can both mitigate and exacerbate gender bias in student evaluations.
Here are the mean student ratings on v19 for the American politics instructors:
4.65 Male American politics teacher CONTROL
4.17 Female American politics teacher CONTROL
4.58 Male American politics teacher TREATMENT
4.53 Female American politics teacher TREATMENT
So, for the American politics teachers, the control had a 0.49 disadvantage for the female teacher (p=0.02), but the treatment had only a 0.05 disadvantage for the female teacher (p=0.79). But here are the means for the biology teachers:
3.72 Male biology teacher CONTROL
4.02 Female biology teacher CONTROL
3.73 Male biology teacher TREATMENT
4.44 Female biology teacher TREATMENT
So, for the biology teachers, the control had a 0.29 disadvantage for the male teacher (p=0.25), and the treatment had a 0.71 disadvantage for the male teacher (p<0.01).
---
I did not see any data reported on in the PLOS ONE article that can resolve whether the treatment mitigated or exacerbated or did not affect gender bias in the student evaluations of the biology teachers or the American politics teachers. The article's claim about addressing the mitigation of bias is, by my read of the article, rooted in the "decidedly mixed" (p. 2) literature and, in particular, on their reference 5, to MacNell et al. 2015. For example, from Peterson et al. 2019:
These effects [from the PLOS ONE experiment] were substantial in magnitude; as much as half a point on a five-point scale. This effect is comparable with the effect size due to gender bias found in the literature [5].
The MacNell et al. 2015 sample was students evaluating assistant instructors for an online course, with sample sizes for the four cells (actual instructor gender X perceived instructor gender) of 8, 12, 12, and 11. That's the basis for "the effect size due to gender bias found in the literature": a non-trivially underpowered experiment with 43 students across four cells evaluating *assistant* instructors in an *online* course.
It seems reasonable that, before college or university departments use the Peterson et al. 2019 treatment, there should be more research to assess whether the treatment mitigates, exacerbates, or does not change gender bias in student evaluations in situations in which the treatment is used. For what it's worth, the gender difference has been reported to be about 0.13 on a five-point scale based on a million or so Rate My Professors evaluations, using the illustration of 168 additional steps for a 5,117-step day. If the true gender bias in student evaluations were 0.13 units against women, the roughly 0.4-unit or 0.5-unit Peterson et al. 2019 treatment effect would have exacerbated gender bias in student evaluations of teaching.
---
NOTES:
1. Thanks to Dave Peterson for comments.
2. From what I can tell, if the treatment truly mitigated gender bias among students evaluating the biology teachers, that would mean that the male biology teacher truly did a worse job teaching than the female biology teacher did.
3. I created a index combining the v19, v20, and v23 items, which respectively are the overall evaluation of teaching, a rating of teaching effectiveness, and the overall evaluation of the course. Here are the mean student ratings on the index for the American politics instructors:
4.56 Male American politics teacher CONTROL
4.21 Female American politics teacher CONTROL
4.36 Male American politics teacher TREATMENT
4.46 Female American politics teacher TREATMENT
So, for the American politics teachers, the control had a 0.35 disadvantage for the female teacher (p=0.07), but the treatment had a 0.10 advantage for the female teacher (p=0.59). But here are the means for the biology teachers:
3.67 Male biology teacher CONTROL
3.90 Female biology teacher CONTROL
3.64 Male biology teacher TREATMENT
4.39 Female biology teacher TREATMENT
So, for the biology teachers, the control had a 0.23 disadvantage for the male teacher (p=0.35), and the treatment had a 0.75 disadvantage for the male teacher (p<0.01).
4. Regarding MacNell et al. 2015 being underpowered, if we use the bottom right cell of MacNell et al. 2015 Table 2 to produce a gender bias estimate of 0.50 standard deviations, the statistical power was 36% for an experiment with 20 student evaluations of instructors who were a woman or a man pretending to be a woman and 23 student evaluations of instructors who were a man or a woman pretending to be a man. If the true effect of gender bias in student evaluations is, say, 0.25 standard deviations, then the MacNell et al. study had a 13% chance of detecting that effect.
R code:
library(pwr)
pwr.t2n.test(n1=20, n2=23, d=0.50, sig.level=0.05)
pwr.t2n.test(n1=20, n2=23, d=0.25, sig.level=0.05)
5. Stata code:
* Overall evaluation of teaching
ttest v19 if bio==0 & treatment==0, by(female)
ttest v19 if bio==0 & treatment==1, by(female)
ttest v19 if bio==1 & treatment==0, by(female)
ttest v19 if bio==1 & treatment==1, by(female)
* Teaching effectiveness:
ttest v20 if bio==0 & treatment==0, by(female)
ttest v20 if bio==0 & treatment==1, by(female)
ttest v20 if bio==1 & treatment==0, by(female)
ttest v20 if bio==1 & treatment==1, by(female)
* Overall evaluation of the course
ttest v23 if bio==0 & treatment==0, by(female)
ttest v23 if bio==0 & treatment==1, by(female)
ttest v23 if bio==1 & treatment==0, by(female)
ttest v23 if bio==1 & treatment==1, by(female)
sum v19 v20 v23
pwcorr v19 v20 v23
factor v19 v20 v23, pcf
gen index = (v19 + v20 + v23)/3
sum index v19 v20 v23
ttest index if bio==0 & treatment==0, by(female)
ttest index if bio==0 & treatment==1, by(female)
ttest index if bio==1 & treatment==0, by(female)
ttest index if bio==1 & treatment==1, by(female)