In May 2020, PS published a correction to Mitchell and Martin 2018 "Gender Bias in Student Evaluations", which reflected concerns that I raised in a March 2019 blog post. That correction didn't mention me, and in May 2020 PS published another correction that didn't mention me but was due to my work, so I'll note below evidence that the corrections were due to my work, which might be useful in documenting my scholarly contributions for, say, an end-of-the-year review or promotion application.

---

In August 2018, I alerted the authors of Mitchell and Martin 2018 (hereafter MM) to concerns about potential errors in MM. I'll post one of my messages below. My sense at the time was that the MM authors were not going to correct MM (and the lead author of MM was defending MM as late as June 2019), so I published a March 2019 blog post about my concerns and in April 2019 I emailed PS a link to my blog post and a suggestion that MM "might have important errors in inferential statistics that warrant a correction".

In May 2019, a PS editor indicated to me that the MM authors have chosen to not issue a correction and that PS invited me to submit a comment on MM that would pass through the normal peer review process. I transformed my blog post into a manuscript comment, which involved, among other things, coding all open-ended student evaluation comments and calculating what I thought the correct results should be in the main three MM tables. Moreover, for completeness, I contacted Texas Tech University and eventually filed a Public Information Act request, because no one I communicated with at Texas Tech about this knew for certain why student evaluation data were not available online for certain sections of the course that MM Table 4 reported student evaluation results for.

I submitted a comment manuscript to PS in August 2019 and submitted a revision based on editor feedback in September 2019. Here is the revised submitted manuscript. In January 2020, I received an email from PS indicating that my manuscript was rejected after peer review and that PS would request a corrigendum from the authors of MM.

In May 2020, PS published a correction to MM, but I don't think that the correction is complete: for example, as I discussed in my blog post and manuscript comment, I think that the inferential statistics in MM Table 4 were incorrectly based on a calculation in which multiple ratings from the same student were treated as independent ratings.

---

For the Comic-Con correction that PS issued in May 2020, I'll quote from my manuscript documenting the error of inference in the article:

I communicated concerns about the Owens et al. 2020 "Comic-Con" article to the first two authors in November 2019. I did not hear of an attempt to publish a correction, and I did not receive a response to my most recent message, so I submitted this manuscript to PS: Political Science & Politics on Feb 4, 2020. PS published a correction to "Comic-Con" on May 11, 2020. PS then rejected my manuscript on May 18, 2020 "after an internal review".

Here is an archive of a tweet thread, documenting that in September 2019 I alerted the lead "Comic-Con" author to the error of inference, and the lead author did not appear to understand my point.

---

NOTES:

1. My PS symposium entry "Left Unchecked" (published online in June 2019) discussed elements of MM that ended up being addressed in the MM correction.

2. Here is an email that I sent the MM authors in August 2018:

Thanks for the data, Dr. Mitchell. I had a few questions, if you don't mind:

[1] The appendix indicates for the online course analysis that: "For this reason, we examined sections in the mid- to high- numerical order: sections 6, 7, 8, 9, and 10". But I think that Dr. Martin taught a section 11 course (D11) that was included in the data.

[2] I am not certain about how to reproduce the statistical significance levels for Tables 1 and 2. For example, for Table 1, I count 23 comments for Dr. Martin and 45 comments for Dr. Mitchell, for the N=68 in the table. But a proportion test in Stata for the "Referred to as 'Teacher'" proportions (prtesti 23 0.152 45 0.244) produces a z-score of -0.8768, which does not seem to match the table asterisks indicating a p-value of p<0.05.

[3] Dr. Martin's CV indicates that he was a visiting professor at Texas Tech in 2015 and 2016. For the student comments for POLS 3371 and POLS 3373, did Dr. Martin's official title include "professor"? If so, than that might influence inferences about any difference in the frequency of student use of the label "professor" between Dr. Martin and Dr. Mitchell. I didn't see "professor" as a title in Dr. Mitchell's CV, but the inferences could also be influenced if Dr. Mitchell had "professor" in her title for any of the courses in the student comments analysis, or for the Rate My Professors comments analysis.

[4] I was able to reproduce the results for the Technology analysis in Table 4, but, if I am correct, the statistical analysis seems to assume that the N=153 for Dr. Martin and the N=501 for Dr. Mitchell are for 153 and 501 independent observations. I do not think that this is correct, because my understanding of the data is that the 153 observations for Dr. Martin are 3 observations for 51 students and that the 501 observations for Dr. Mitchell are 3 observations for 167 students. I think that the analysis would need to adjust for the non-independence of some of the observations.

Sorry if any of my questions are due to a misunderstanding. Thank you for your time.

Best,

L.J

Tagged with: , ,

Here is a passage from Pigliucci 2013.

Steele and Aronson (1995), among others, looked at IQ tests and at ETS tests (e.g. SATs, GREs, etc.) to see whether human intellectual performance can be manipulated with simple psychological tricks priming negative stereotypes about a group that the subjects self-identify with. Notoriously, the trick worked, and as a result we can explain almost all of the gap between whites and blacks on intelligence tests as an artifact of stereotype threat, a previously unknown testing situation bias.

Racial gaps are a common and perennial concern in public education, but this passage suggests that such gaps are an artifact. However, when I looked up Steele and Aronson (1995) to discover the evidence for this result, I discovered that the black participants and the white participants in the study were all Stanford undergraduates and that the students' test performances were adjusted by the students' SAT scores. Given that the analysis contained both sample selection bias and statistical control, it does not seem reasonable to make an inference about populations based on that analysis. This error in reporting results for Steele and Aronson (1995) is apparently common enough to deserve its own article.

---

Here's a related passage from Brian at Dynamic Ecology:

A neat example on the importance of nomination criteria for gender equity is buried in this post about winning Jeopardy (an American television quiz show). For a long time only 1/3 of the winners were women. This might lead Larry Summers to conclude men are just better at recalling facts (or clicking the button to answer faster). But a natural experiment (scroll down to the middle of the post to The Challenger Pool Has Gotten Bigger) shows that nomination criteria were the real problem. In 2006 Jeopardy changed how they selected the contestants. Before 2006 you had to self-fund a trip to Los Angeles to participate in try-outs to get on the show. This required a certain chutzpah/cockiness to lay out several hundred dollars with no guarantee of even being selected. And 2/3 of the winners were male because more males were making the choice to take this risk. Then they switched to an online test. And suddenly more participants were female and suddenly half the winners were female. [emphasis added]

I looked up the 538 post linked to in the passage, which reported: "Almost half of returning champions this season have been women. In the year before Jennings's streak, fewer than 1 in 3 winners were female." That passage provides two data points: this season appears to be 2015 (the year of the 538 post), and the year before Jennings's streak appears to be 2003 (the 538 post noted that Jennings's streak occurred in 2004). The 538 post reported that the rule change for the online test occurred in 2006.

So here's the relevant information from the 538 post:

  • In 2003, fewer than 1 in 3 Jeopardy winners were women.
  • In 2006, the selection process was changed to an online test.
  • Presumably in 2015, through early May, almost half of Jeopardy winners have been women.

It does not seem that comparison of a data point from 2003 to a partial data point from 2015 permits use of the descriptive term "suddenly."

It's entirely possible -- and perhaps probable -- that the switch to an online test for qualification reduced gender inequality in Jeopardy winners. But that inference needs more support than the minimal data reported in the 538 post.

Tagged with: , , ,

This post at Active Learning in Political Science describes a discussion on inequality that followed the unequal distribution of chocolate to students reflecting unequal GDPs among countries:

The students then led a discussion about how the students felt, whether the wealthy students were obligated to give up some of their chocolate, and how they would convince the wealthy students to do so. Violence entered the conversation (jokingly) at one point. Eventually the discussion turned to the real-world implications, and the chocolate was widely shared.

Use of a prop like chocolate has advantageous qualities, such as raising the interest level of students and the uniqueness of the discussion, which likely fosters the potential for learning. But the simulation itself clouded or removed many of the features of inequality necessary for a quality discussion of global inequality and aid:

  1. A discussion of inequality among students in the same room diverts attention from impediments to sharing that real countries face: it is nearly costless to pass chocolate to the person next to you, but there is a substantial cost to packaging and shipping goods across the world.
  2. Presumably none of the students had the negative features of a regime like North Korea that would raise questions about whether direct aid might be more harmful than beneficial.
  3. The method of production of the chocolate in the simulation bears no relationsip to the method of production for GDP, chocolate, or any good in the real world: countries do not "receive" goods or wealth independent of mechanisms related to the country's natural resources, education or skill level of the population, political choices, history, etc.
  4. The parameters of the simulation ensured that the total amount of chocolate was static, so that the producion of more chocolate was not an option for the students.

The problem with simulations such as this is that the focus is placed on the simulated instead of the real.

Tagged with: ,