In this post, I discussed the possibility that "persons at the lower levels of hostile sexism are nontrivially persons who are sexist against men". Brian Schaffner provides more information on this possibility, in the paper "How Political Scientists Should Measure Sexist Attitudes". I'll place Figure 2 from the paper below:

From the paper discussion of Figure 2 (p. 14):

The plot on the right shows the modest influence of hostile sexism on moderating the gender treatment in the politician conjoint. Subjects in the bottom third of the hostile sexism distribution were about 10 points more likely to select the female profile, a difference that is statistically significant (p=.005). However, the treatment effect was small and not statistically significant among those in the middle and top terciles.

From what I can tell, this evidence suggests that the proper interpretation of the hostile sexism scale is not as a measure of sexism against women but as a measure of male/female preference, with participants who prefer men sorted to high levels of the measure and participants who prefer women sorted to low levels of the measure. If hostile sexism were a proper linear measure of sexism against women, low values of hostile sexism would predict equal treatment of men and women and higher levels would predict favoritism of men over women.

Tagged with: , ,

"Evidence of Bias in Standard Evaluations of Teaching" (Mirya Holman, Ellen Key, and Rebecca Kreitzer, 2019) has been cited as evidence of bias in student evaluations of teaching.

I am familiar with Mitchell and Martin 2018, so let's check how that study is summarized in the list, as archived on 20 November 2019. I count three substantive errors in the summary and one spelling error, highlighted below, and not counting the fgender in the header or the singular RateMyProfessor:

The summary referred to the online courses as being from different universities, but all of the online courses in the Mitchell and Martin 2018 analysis were at the same university. The summary referred to "female instructors" and "male professors", but the Mitchell and Martin 2018 analysis compared comments and evaluations for only one female instructor to comments and evaluations for only one male instructor. The summary indicated that female instructors were evaluated differently in intelligence, but no Mitchell and Martin 2018 table reported a statistical significance asterisk for the Intelligence/Competency category.

---

The aforementioned errors in the summary of Mitchell and Martin 2018 can be easily fixed, but that would not address a flaw in a particular use of the list, given that, from what I can tell, Mitchell and Martin 2018 has errors that undercut the inference about students using different language when evaluating female instructors than when evaluating male instructors. Listing that study and other studies as evidence of bias in student evaluations of teaching based on an uncritical reading of results shouldn't be convincing evidence of bias in student evaluations of teaching, especially if the categorizing of studies does not indicate whether "bias" is operationalized as an unfair difference or as a mere difference.

I think there would be value in a version of "Evidence of Bias in Standard Evaluations of Teaching" that accurately summarizes each study that has tested for unfair bias in student evaluations of teaching using a research design with internal validity and plausibly sufficient statistical power, especially if each summary were coupled with a justification of why the study provides credible evidence about unfair bias in student evaluations of teaching. But I don't see why anyone should be convinced by "Evidence of Bias in Standard Evaluations of Teaching" in its current form.

Tagged with: ,

This post discusses whether the lowest levels of hostile sexism are properly understood as indicating the lowest measured levels of sexism.

---

Barnes et al. 2018 "Sex and Corruption: How Sexism Shapes Voters' Responses to Scandal" in Politics, Groups, and Identities (ungated) reported results from an experiment that had an outcome variable with four levels, from "very unlikely" to "very likely", in which participants could indicate how likely the participant would be to vote for a hypothetical representative in the next election. Treatments were the representative's sex (man or woman) and the type of scandal that the representative had been involved in (corruption or sex).

Hostile sexism (Glick and Fiske 1996) was measured with three items:

  1. Women are too easily offended
  2. Most women fail to appreciate all that men do for them
  3. Women exaggerate their problems

Below is the hostile sexism panel for the sex scandal condition, from Barnes et al. 2018 Figure 2. The right side of the panel suggests that participants at the highest levels of hostile sexism were biased against women. But the left side of the panel suggests that participants at lowest levels of hostile sexism were biased against men.That low levels of hostile sexism do not indicate the absence of sexism seems plausible given that, in the article, the lowest level of hostile sexism for participants responding to all hostile sexism items required participants to disagree as much as possible on a 7-point scale with mildly negative statements about women, such as the statement that "Most women fail to appreciate all that men do for them". Strong disagreement with this statement is equivalent to expressing the view that most women appreciate all that men do for them, and it seems at least possible that persons with such a positive view of women might be unfairly biased in favor of women. Another way to think of it is that persons unfairly biased in favor of women must fall somewhere on the hostile sexism measure, and it seems plausible that these persons would place themselves at or toward the lower end of the measure.

"Sex and Corruption" co-author Emily Bacchus sent me data and code for the article, and these data indicate that the patterns for the dichotomous "very unlikely" outcome variable in the above plot hold when the outcome variable is coded with all four measured levels of vote likelihood, as in the plot below, in which light blue dots are for the male candidate and pink dots are for the female candidate:

Further analysis suggested that, in the sex scandal plot, much or all of the modeled discrimination against men at the lower levels of hostile sexism is due to the linear model and a relatively large discrimination against women at higher levels of hostile sexism. For example, for levels of hostile sexism from 0.75 through 1, there is a 0.75 discrimination against women (Ns of 20 and 32, p<0.01); for levels of hostile sexism from 0 through 0.25, there is a 0.20 discrimination against men (Ns of 95 and 94, p=0.07); for levels of hostile sexism at 0, there is a 0.09 discrimination against men (Ns of 35 and 28, p=0.70). Only 4 participants scored a 1 for hostile sexism. For levels of hostile sexism from 0.25 through 0.75, there is a 0.05 discrimination against men (Ns of 169 and 155, p=0.57).

---

Recent political science that I am familiar with that has used a hostile sexism measure has I think at least implied that lower levels of hostile sexism are normatively good. For example, the Barnes et al. 2018 article discussed "individuals who hold sexist attitudes" (p. 14, implying that some participants did not hold sexist attitudes), and a plot in Luks and Schaffner 2019 labeled the low end of a hostile sexism measure as "least sexist". However, it is possible that persons at the lower levels of hostile sexism are nontrivially persons who are sexist against men. I don't think that this possibility can be conclusively accepted or rejected based on the Barnes et al. 2018 data, but I do think that it matters whether the proper labeling of the low end of hostile sexism is "least sexist" or is "most sexist against men", to the extent that such unambiguous labels can be properly used for the lower end of the hostile sexism measure.

---

NOTES

Thanks to Emily Bacchus and her co-authors for comments and sharing data and code, and thanks for Peter Glick and Susan Fiske for comments.

Tagged with: ,

On October 27, 2019, U.S. Representative Katie Hill announced her resignation from Congress after her involvement in a sex scandal, claiming that she was leaving "because of a double standard".

There is a recently published article that reports on an experiment that can be used to assess such a double standard among the public, at least with an MTurk sample of over 1,000, with women about 45% of the sample: Barnes et al. 2018 "Sex and corruption: How sexism shapes voters' responses to scandal" in Politics, Groups, and Identities (ungated). Participants in the Barnes et al. 2018 experiment indicated on a four-point scale how likely they would be to vote for a representative in the next election; the experiment manipulated the hypothetical U.S. Representative's sex (man or woman) and the type of scandal that the representative had been involved in (corruption or sex).

Results in Barnes et al. 2018 Figure 1 indicated that, compared to the reported vote likelihoods for the male representative among participants assigned to the male representative involved in the sex scandal, participants assigned to the female representative involved in the sex scandal were not less likely to vote for the female representative.

---

The Monkey Cage published a post by Michael Tesler, entitled "Was Rep. Katie Hill held to a higher standard than men in Congress? This research suggests she was". The post did not mention the Barnes et al. 2018 experiment.

---

Mischiefs of Faction published a post by Gregory Koger and Jeffrey Lazarus that did mention the Barnes et al. 2018 experiment, but the Koger/Lazarus post did not mention the null finding across the full sample. The post instead mentioned the finding of a correlate of relative disfavoring of the female candidate (links omitted in the quoted passage below):

One answer is that there is sexist double standard for female politicians. One recently published article (ungated) by Tiffany Barnes, Emily Beaulieu, and Gregory Saxton finds that citizens are more likely to disapprove of a sex scandal by a female politician if they a) generally disapprove of women "usurping men's power," or b) see themselves as protectors of women, with protection contingent upon conformity to traditional gender roles. Both dynamics help explain why alleged House-rule-breaker Hill is resigning, while alleged federal-lawbreaker Hunter was reelected in 2018 and shows no interest in resigning.

The Koger/Lazarus post doesn't explain why these correlates are more important than the result among all participants or, for that matter, more important than the dynamic in Barnes et al. 2018 Figure 2 among participants with low hostile sexism scores.

The Koger/Lazarus post suggests that the Barnes et al. 2018 experiment detected a correlation between relative disfavoring of the female politician involved in a sex scandal and participant responses to a benevolent sexism scale (the "b" part of the passage quoted above). I don't think that is a correct description of the results: see Barnes et al. 2018 Table 1, Barnes et al. 2018 Figure 2, and/or the Barnes et al. 2018 statement that "Participants are thus unlikely to differentiate between the sex of the representative when responding to allegations about the representative's involvement in a sex scandal, regardless of the participant's level of benevolent sexism" (p. 13).

For what it's worth, the Barnes et al. 2018 abstract can be read as suggesting that the experiment did detect a bias among persons with high scores on a benevolent sexism scale.

---

Barnes et al. 2018 is a recently published large-sample experiment that found that, in terms of vote likelihood, participants assigned to a hypothetical female U.S. Representative involved in a sex scandal treated that female representative remarkably similar to the way in which participants assigned to the hypothetical male representative involved in a sex scandal treated that male representative. This result is not mentioned in two political science blog posts discussing the claim of a gender double standard made by a female U.S. Representative involved in a sex scandal.

Tagged with: ,

The Peterson et al. 2019 PLOS ONE article "Mitigating gender bias in student evaluations of teaching" reported on an experiment conducted with students across four Spring 2018 courses: an introduction to biology course taught by a female instructor, an introduction to biology course taught by a male instructor, an introduction to American politics course taught by a female instructor, and an introduction to American politics course taught by a male instructor. Students completing evaluations of these teachers were randomly assigned to receive or to not receive a statement about how student evaluations of teachers are often biased against women and instructors of color.

The results clearly indicated that "this intervention improved the SET scores for the female faculty" (p. 8). But that doesn't address the mitigation of bias in the title of the article because, as the article indicates, "It is also possible that the students with female instructors who received the anti-bias language overcompensated their evaluations for the cues they are given" (p. 8).

---

For the sake of illustration, let's assume that the two American politics teachers were equal to each other and that the two biology teachers were equal to each other; if so, data from the Peterson et al. 2019 experiment for the v19 overall evaluation of teaching item illustrate how the treatment can both mitigate and exacerbate gender bias in student evaluations.

Here are the mean student ratings on v19 for the American politics instructors:

4.65     Male American politics teacher CONTROL

4.17     Female American politics teacher CONTROL

4.58     Male American politics teacher TREATMENT

4.53     Female American politics teacher TREATMENT

So, for the American politics teachers, the control had a 0.49 disadvantage for the female teacher (p=0.02), but the treatment had only a 0.05 disadvantage for the female teacher (p=0.79). But here are the means for the biology teachers:

3.72     Male biology teacher CONTROL

4.02     Female biology teacher CONTROL

3.73     Male biology teacher TREATMENT

4.44     Female biology teacher TREATMENT

So, for the biology teachers, the control had a 0.29 disadvantage for the male teacher (p=0.25), and the treatment had a 0.71 disadvantage for the male teacher (p<0.01).

---

I did not see any data reported on in the PLOS ONE article that can resolve whether the treatment mitigated or exacerbated or did not affect gender bias in the student evaluations of the biology teachers or the American politics teachers. The article's claim about addressing the mitigation of bias is, by my read of the article, rooted in the "decidedly mixed" (p. 2) literature and, in particular, on their reference 5, to MacNell et al. 2015. For example, from Peterson et al. 2019:

These effects [from the PLOS ONE experiment] were substantial in magnitude; as much as half a point on a five-point scale. This effect is comparable with the effect size due to gender bias found in the literature [5].

The MacNell et al. 2015 sample was students evaluating assistant instructors for an online course, with sample sizes for the four cells (actual instructor gender X perceived instructor gender) of 8, 12, 12, and 11. That's the basis for "the effect size due to gender bias found in the literature": a non-trivially underpowered experiment with 43 students across four cells evaluating *assistant* instructors in an *online* course.

It seems reasonable that, before college or university departments use the Peterson et al. 2019 treatment, there should be more research to assess whether the treatment mitigates, exacerbates, or does not change gender bias in student evaluations in situations in which the treatment is used. For what it's worth, the gender difference has been reported to be about 0.13 on a five-point scale based on a million or so Rate My Professors evaluations, using the illustration of 168 additional steps for a 5,117-step day. If the true gender bias in student evaluations were 0.13 units against women, the roughly 0.4-unit or 0.5-unit Peterson et al. 2019 treatment effect would have exacerbated gender bias in student evaluations of teaching.

---

NOTES:

1. Thanks to Dave Peterson for comments.

2. From what I can tell, if the treatment truly mitigated gender bias among students evaluating the biology teachers, that would mean that the male biology teacher truly did a worse job teaching than the female biology teacher did.

3. I created a index combining the v19, v20, and v23 items, which respectively are the overall evaluation of teaching, a rating of teaching effectiveness, and the overall evaluation of the course. Here are the mean student ratings on the index for the American politics instructors:

4.56     Male American politics teacher CONTROL

4.21     Female American politics teacher CONTROL

4.36     Male American politics teacher TREATMENT

4.46     Female American politics teacher TREATMENT

So, for the American politics teachers, the control had a 0.35 disadvantage for the female teacher (p=0.07), but the treatment had a 0.10 advantage for the female teacher (p=0.59). But here are the means for the biology teachers:

3.67     Male biology teacher CONTROL

3.90     Female biology teacher CONTROL

3.64     Male biology teacher TREATMENT

4.39     Female biology teacher TREATMENT

So, for the biology teachers, the control had a 0.23 disadvantage for the male teacher (p=0.35), and the treatment had a 0.75 disadvantage for the male teacher (p<0.01).

4. Regarding MacNell et al. 2015 being underpowered, if we use the bottom right cell of MacNell et al. 2015 Table 2 to produce a gender bias estimate of 0.50 standard deviations, the statistical power was 36% for an experiment with 20 student evaluations of instructors who were a woman or a man pretending to be a woman and 23 student evaluations of instructors who were a man or a woman pretending to be a man. If the true effect of gender bias in student evaluations is, say, 0.25 standard deviations, then the MacNell et al. study had a 13% chance of detecting that effect.

R code:

library(pwr)

pwr.t2n.test(n1=20, n2=23, d=0.50, sig.level=0.05)

pwr.t2n.test(n1=20, n2=23, d=0.25, sig.level=0.05)

5. Stata code:

* Overall evaluation of teaching

ttest v19 if bio==0 & treatment==0, by(female)

ttest v19 if bio==0 & treatment==1, by(female)

ttest v19 if bio==1 & treatment==0, by(female)

ttest v19 if bio==1 & treatment==1, by(female)

* Teaching effectiveness:

ttest v20 if bio==0 & treatment==0, by(female)

ttest v20 if bio==0 & treatment==1, by(female)

ttest v20 if bio==1 & treatment==0, by(female)

ttest v20 if bio==1 & treatment==1, by(female)

* Overall evaluation of the course

ttest v23 if bio==0 & treatment==0, by(female)

ttest v23 if bio==0 & treatment==1, by(female)

ttest v23 if bio==1 & treatment==0, by(female)

ttest v23 if bio==1 & treatment==1, by(female)

 

sum v19 v20 v23

pwcorr v19 v20 v23

factor v19 v20 v23, pcf

gen index = (v19 + v20 + v23)/3

sum index v19 v20 v23

 

ttest index if bio==0 & treatment==0, by(female)

ttest index if bio==0 & treatment==1, by(female)

ttest index if bio==1 & treatment==0, by(female)

ttest index if bio==1 & treatment==1, by(female)

Tagged with: , ,

In the 2019 PS: Political Science & Politics article "How Many Citations to Women Is 'Enough'? Estimates of Gender Representation in Political Science", Michelle L. Dion and Sara McLaughlin Mitchell address a question about "the normative standard for the amount women should be cited" (p. 1).

The first proposed Dion and Mitchell 2019 measure is the proportion of female members of the American Political Science Association (APSA) by section and primary field, using data from 2018. According to Dion and Mitchell 2019: "When political scientists compose course syllabi, graduate reading lists, and research bibliographies, these membership data provide guidance about the minimum representation of scholarship by women that should be included to be representative by gender" (p. 3).

But is APSA section membership in 2018 a reasonable benchmark for gender representation in course syllabi that include readings from throughout history?

Hardt et al. 2019 reported on data for readings assigned in the training of political science graduate students. Below are percentages of graduate student readings in these data that had a female first author:

Time PeriodFemale First Author %
Before 19703.5%
1970 to 19796.7%
1980 to 198911.3%
1990 to 199915.7%
2000 to 2009 21.0%
2010 to 201824.6%

So the pattern is increasing representation of women over time. If this pattern reflects increasing representation of women over time in APSA section membership or increasing representation of women among the set of researchers whose research interests include the topic of a particular section, then APSA section membership data from 2018 will overstate the percentage of women needed to ensure fair gender representation on syllabi or research bibliographies. For illustrative purposes, if a section had 20% women across the 1990s, 30% women across the 2000s, and 40% women across the 2010s, a fair "section membership" benchmark for gender representation on syllabi would not be 40%; rather, a fair "section membership" benchmark for gender representation on syllabi would be something like 20% women for syllabi readings across the 1990s, 30% women for syllabi readings across the 2000s, and 40% women for syllabi readings across the 2010s.

---

Dion and Mitchell 2019 propose another measure that is biased in the same direction and for the same reason: gender distribution of authors by journal from 2007 to 2016 inclusive for available years.

About 68% of readings in the Hardt et al. 2019 graduate training readings data were published prior to 2007: 15% of these pre-2007 readings had a first female author, but 24% of the 2007-2016 readings in the data had a first female author.

Older readings are included on Hardt et al. 2019 readings with decent frequency: 42% of readings that had the gender of the first author coded were published before 2000. However, the Dion and Mitchell 2019 measure of journal representation from 2007 to 2016 ignores these older readings, which produces a biased measure favoring women if fair representation means representation that matches the representation in the relevant pool of syllabi-worthy journal articles.

---

In a sense, this bias in the Dion and Mitchell 2019 measures might not matter much if the measures are used in the biased manner that Dion and Mitchell 2019 proposed (p. 6):

We remedy this gap by explicitly providing conservative estimates of gender diversity based on organization membership and journal article authorship for evaluating gender representation. Instructors, researchers, and editors who want to ensure that references are representative can reference these as floors (rather than ceilings) for minimally representative citations.

The Dion and Mitchell 2019 suggestion above is that instructors, researchers, and editors who want to ensure that references are representative use a conservative estimate as a floor. Both the conservative nature of the estimate and its use as a floor would produce a bias favoring women, so I'm not sure how that is helpful for instructors, researchers, and editors who want to ensure that references are representative.

---

NOTE:

1. Stata code for the analysis of the Hardt et al. 2019 data:

tab female1 if year<1970

tab female1 if year>=1970 & year<1980

tab female1 if year>=1980 & year<1990

tab female1 if year>=1990 & year<2000

tab female1 if year>=2000 & year<2010

tab female1 if year>=2010 & year<2019

 

tab female1

tab female1 if year<2000

di 36791/87398

Tagged with: ,

"The Gender Readings Gap in Political Science Graduate Training" by Heidi Hardt, Amy Erica Smith, Hannah June Kim, and Philippe Meister was recently published in the Journal of Politics and featured in a Monkey Cage blog post. The Kim Yi Dionne header for the Monkey Cage post indicated that:

Throughout academia, including in political science, women haven't achieved parity with men. As this series explores, implicit bias holds women back at every stage, from the readings professors assign to the student evaluations that influence promotions and pay, from journal publications to book awards.

The abstract to the JOP article indicates that "Introducing a unique data set of 88,673 citations from 905 PhD syllabi and reading lists, we find that only 19% of assigned readings have female first authors". This 19% for assigned readings is lower than the 21.5% of publications in the top three political science journals between 2000 and 2015 (bottom of page 2 of the JOP article). However, the 19% is based on assigned readings published at any time in history, including authors such as Plato and Sun Tzu. My analysis of the data for the article indicated that 22% of assigned readings have female first authors when the assigned readings are limited to assigned readings published between 2000 and 2015 inclusive. The top three publications benchmark therefore produces an estimate of the gender readings gap in political science graduate training for 2000 to 2015 publications that is less than one percent and trivially advantages women.

Figure 1 in the Hardt et al. JOP article reports percentages by subfield, with benchmarks for published top works, which I think are articles in top 10 journals; the first and third numeric columns in the table below are data reported in Figure 1. Using the benchmark for published top works, my analysis limiting the assigned readings to assigned readings published between 2000 to 2015 inclusive (the middle numeric column) produced a difference greater than 1% that disadvantaged female first authors for only one of the five subfields with benchmark data (comparative politics):

Topic% Female
1st Author
Readings
(All Time)
% Female
1st Author
Readings
(2000-2015)
% Female
1st Author
Top Pubs
(2000-2015)
Methodology 11.5713.6411.36
Political Economy 16.7518.03 NA
American 15.6618.46 19.07
Comparative 20.5523.26 28.76
IR 19.9623.41 22.42
Theory 25.0531.58 29.39

For an example topic most relevant to my work, the Hardt et al. Figure 1 gender gap for American politics is 3.41 percentage points (15.66 compared to 19.07), but falls to 0.61 percentage points (18.46 compared to 19.07) when the time frame of the assigned readings is set to the 2000-2015 time frame of the top publications benchmark. Invocation of an implicit bias that holds back women might be premature if the data indicate a gap of less than 1 percentage point in an analysis that does not include relevant control variables such as any gender gap in how "syllabus-worthy" publications are within the set of top publications. The 5.50 percentage point gender gap for comparative politics might be large enough to consider implicit bias in that subfield, but that's a localized concern.

---

NOTES

1. [*] The post title alludes to this tweet.

2. The only first authors coded female before 1776 are Titus Livy and Sun Tzu (tab surname1 if female1==1 & year<1776).

3. Code below:

* Insert this command into the Hardt et al. do file after Line 11 ("use 'Hardt et al. JOP_Replication data.dta', clear"):
keep if year>=2000 & year<=2015

* Insert these commands into the Hardt et al. do file after new Line 124 ("tab1 gender1 if gender1 < 3 [aweight=wt] // THE TOPLINE RESULTS WE REPORT EXCLUDE THOSE 304 OBSERVATIONS"):
tab1 gender1 if gender1 < 3 [aweight=wt] // This should report 21.86%
tab1 gender1 if gender1 < 3 // This should report 22.20%

* Insert this command into the Hardt et al. do file before new Line 184 ("restore"):
tab topic mn

* Run the Hardt+et+al.+JOP_Replication+code-1.do file until and including new Line 126 ("tab1 gender1 if gender1 < 3 // This should report 22.20%"). These data indicate that, of first authors coded male or female, about 22% were female.

* Run new Line 127 to new Line 184 ("tab topic mn"). Line 184 should output data for the middle column in the table in this post. See the "benchmark_teelethelen" lines for data for the right column in the table.

Tagged with: ,