One notable finding in the racial discrimination literature is the boomerang/backlash effect reported in Peffley and Hurwitz 2007:

"...whereas 36% of whites strongly favor the death penalty in the baseline condition, 52% strongly favor it when presented with the argument that the policy is racially unfair" (p. 1001).

The racially-unfair argument shown to participants was: "[Some people say/FBI statistics show] that the death penalty is unfair because most of the people who are executed are African Americans" (p. 1002). Statistics reported in Peffley and Hurwitz 2007 Table 1 indicate that responses differed at p<=0.05 for Whites in the baseline no-argument condition compared to Whites in the argument condition.

However, the boomerang/backlash effect did not appear at p<=0.05 in large-N MTurk direct and conceptual replication attempts reported on in Butler et al. 2017 or in my analysis of a nearly-direct replication attempt using a large-N sample of non-Hispanic Whites in a TESS study by Spencer Piston and Ashley Jardina with data collection by GfK, with a similar null result for a similar racial-bias-argument experiment regarding three strikes laws.

For the weighted TESS data, on a scale from 0 for strongly oppose to 1 for strongly favor, support for the death penalty for persons convicted of murder was 0.015 units lower (p=0.313, n=2018) in the condition in which participants were told "Some people say that the death penalty is unfair because most of the people who are executed are black", compared to the condition in which participants did not receive that statement, with controls for the main experimental conditions for the TESS study, which appeared earlier in the survey. This lack of statistical significance remained when the weighted sample was limited to liberals and extreme liberals; slight liberals, liberals, and extreme liberals; conservatives and extreme conservatives; and slight conservatives, conservatives, and extreme conservatives. There was also no statistically-significant difference between conditions in my analysis of the unweighted data. Regarding missing data, 7 of 1,034 participants in the control condition and 9 of 1,000 participants in the experimental condition did not provide a response.

Moreover, in the prior item on the survey, on a 0-to-1 scale, responses were 0.013 units higher (p=0.403, n=2025) for favoring three strikes laws in the condition in which participants were told that "...critics argue that these laws are unfair because they are especially likely to affect black people", compared to the compared to the condition in which participants did not receive that statement, with controls for the main experimental conditions for the TESS study, which appeared earlier in the survey. This lack of statistical significance remained when the weighted sample was limited to liberals and extreme liberals; slight liberals, liberals, and extreme liberals; conservatives and extreme conservatives; and slight conservatives, conservatives, and extreme conservatives. There was also no statistically-significant difference between conditions in my analysis of the unweighted data. Regarding missing data, 6 of 986 participants in the control condition and 3 of 1,048 participants in the experimental condition did not provide a response.

Null results might be attributable to participants not paying attention, so it is worth noting that the main treatment in the TESS experiment was that participants in one of the three conditions were given a passage to read entitled "Genes May Cause Racial Difference in Heart Disease" and participants in another of the three conditions were given a passage to read entitled "Social Conditions May Cause Racial Difference in Heart Disease". There was a statically-significant difference between these conditions in responses to an item about whether there are biological differences between blacks and whites (p=0.008, n=2,006), with responses in the Genes condition indicating greater estimates of biological differences between blacks and whites.

---

NOTE:

Data for the TESS study are available here. My Stata code is available here.

Tagged with: , , ,

This periodically-updated page is to acknowledge researchers who have shared data and/or code and/or have answered questions about their research. I tried to acknowledge everyone who provided data, code, or information, but let me know if I missed anyone who should be on the list. The list is chronological based on the date that I first received data and/or code and/or information.

Aneeta Rattan for answering questions about and providing data used in "Race and the Fragility of the Legal Distinction between Juveniles and Adults" by Aneeta Rattan, Cynthia S. Levine, Carol S. Dweck, and Jennifer L. Eberhardt.

Maureen Craig for code for "More Diverse Yet Less Tolerant? How the Increasingly Diverse Racial Landscape Affects White Americans' Racial Attitudes" and for "On the Precipice of a 'Majority-Minority' America", both by Maureen A. Craig and Jennifer A. Richeson.

Michael Bailey for answering questions about his ideal point estimates.

Jeremy Freese for answering questions and conducting research about past studies of the Time-sharing Experiments for the Social Sciences program.

Antoine Banks and AJPS editor William Jacoby for posting data for "Emotional Substrates of White Racial Attitudes" by Antoine J. Banks and Nicholas A. Valentino.

Gábor Simonovits for data for "Publication Bias in the Social Sciences: Unlocking the File Drawer" by Annie Franco, Neil Malhotra, and Gábor Simonovits.

Ryan Powers for posting and sending data and code for "The Gender Citation Gap in International Relations" by Daniel Maliniak, Ryan Powers, and Barbara F. Walter. Thanks also to Daniel Maliniak for answering questions about the analysis.

Maya Sen for data and code for "How Judicial Qualification Ratings May Disadvantage Minority and Female Candidates" by Maya Sen.

Antoine Banks for data and code for "The Public's Anger: White Racial Attitudes and Opinions Toward Health Care Reform" by Antoine J. Banks.

Travis L. Dixon for the codebook for and for answering questions about "The Changing Misrepresentation of Race and Crime on Network and Cable News" by Travis L. Dixon and Charlotte L. Williams.

Adam Driscoll for providing summary statistics for "What's in a Name: Exposing Gender Bias in Student Ratings of Teaching" by Lillian MacNell, Adam Driscoll, and Andrea N. Hunt.

Andrei Cimpian for answering questions and providing more detailed data than available online for "Expectations of Brilliance Underlie Gender Distributions across Academic Disciplines" by Sarah-Jane Leslie, Andrei Cimpian, Meredith Meyer, and Edward Freeland.

Vicki L. Claypool Hesli for providing data and the questionnaire for "Predicting Rank Attainment in Political Science" by Vicki L. Hesli, Jae Mook Lee, and Sara McLaughlin Mitchell.

Jo Phelan for directing me to data for "The Genomic Revolution and Beliefs about Essential Racial Differences A Backdoor to Eugenics?" by Jo C. Phelan, Bruce G. Linkb, and Naumi M. Feldman.

Spencer Piston for answering questions about "Accentuating the Negative: Candidate Race and Campaign Strategy" by Yanna Krupnikov and Spencer Piston.

Amanda Koch for answering questions and providing information about "A Meta-Analysis of Gender Stereotypes and Bias in Experimental Simulations of Employment Decision Making" by Amanda J. Koch, Susan D. D'Mello, and Paul R. Sackett.

Kevin Wallsten and Tatishe M. Nteta for answering questions about "Racial Prejudice Is Driving Opposition to Paying College Athletes. Here's the Evidence" by Kevin Wallsten, Tatishe M. Nteta, and Lauren A. McCarthy.

Hannah-Hanh D. Nguyen for answering questions and providing data for "Does Stereotype Threat Affect Test Performance of Minorities and Women? A Meta-Analysis of Experimental Evidence" by Hannah-Hanh D. Nguyen and Ann Marie Ryan.

Solomon Messing for posting data and code for "Bias in the Flesh: Skin Complexion and Stereotype Consistency in Political Campaigns" by Solomon Messing, Maria Jabon, and Ethan Plaut.

Sean J. Westwood for data and code for "Fear and Loathing across Party Lines: New Evidence on Group Polarization" by Sean J. Westwood and Shanto Iyengar.

Charlotte Cavaillé for code and for answering questions for the Monkey Cage post "No, Trump won't win votes from disaffected Democrats in the fall" by Charlotte Cavaillé.

Kris Byron for data for "Women on Boards and Firm Financial Performance: A Meta-Analysis" by Corrine Post and Kris Byron.

Hans van Dijk for data for "Defying Conventional Wisdom: A Meta-Analytical Examination of the Differences between Demographic and Job-Related Diversity Relationships with Performance" by Hans van Dijk, Marloes L. van Engen, and Daan van Knippenberg.

Alexandra Filindra for answering questions about "Racial Resentment and Whites' Gun Policy Preferences in Contemporary America" by Alexandra Filindra and Noah J. Kaplan.

Tagged with: , ,

I have been trying to reproduce several studies and have noticed that the reporting of results from these studies often presents a much stronger impression of results than I get from an investigation of the data itself. I plan to report some of these reproduction attempts, so I have been reading literature on researcher degrees of freedom and the file drawer problem. Below I'll post and comment on some interesting passages that I have happened upon.

---

To put it another way: without modern statistics, we find it unlikely that people would take seriously a claim about the general population of women, based on two survey questions asked to 100 volunteers on the internet and 24 college students. But with the p-value, a result can be declared significant and deemed worth publishing in a leading journal in psychology. (Gelman and Loken, 2013, 14-15, emphasis in the original)

I wonder how many people in the general population take seriously general claims based on only small mTurk and college student samples, provided that these people are informed that these general claims are based only on small unrepresentative samples; I suspect that some of the "taking seriously" that leads to publication in leading psychology journals reflects professional courtesy among peer researchers whose work is also largely based on small unrepresentative samples.

---

Maybe it's because I haven't done much work with small unrepresentative samples, but I feel cheated when investing time in an article framed in general language that has conclusions based on small unrepresentative samples. Here's an article that I recently happened upon: "White Americans' opposition to affirmative action: Group interest and the harm to beneficiaries objection." The abstract:

We focused on a powerful objection to affirmative action – that affirmative action harms its intended beneficiaries by undermining their self-esteem. We tested whether White Americans would raise the harm to beneficiaries objection particularly when it is in their group interest. When led to believe that affirmative action harmed Whites, participants endorsed the harm to beneficiaries objection more than when led to believe that affirmative action did not harm Whites. Endorsement of a merit-based objection to affirmative action did not differ as a function of the policy’s impact on Whites. White Americans used a concern for the intended beneficiaries of affirmative action in a way that seems to further the interest of their own group.

So who were these white Americans?

Sixty White American students (37% female, mean age = 19.6) at the University of Kansas participated in exchange for partial course credit. One participant did not complete the dependent measure, leaving 59 participants in the final sample. (p. 898)

I won't argue that this sort of research should not be done, but I'd like to see this sort of exploratory research replicated with a more representative sample. One of the four co-authors listed her institutional affiliation at California State University San Bernardino, and two other co-authors listed their institutional affiliation at Tulane University, so I would have liked to have seen a second study among a different sample of students. At the very least, I'd like to see a description of the restricted nature of the sample in the abstract to let me and other readers make a more informed judgment about the value of investing time in the article.

---

The Gelman and Loken (2013) passage cited above reminded me of a recent controversy regarding a replication attempt of Schnall et al. (2008). I read about the controversy in a Nicole Janz post at Political Science Replication. The result of the replication (a perceived failure to replicate) was not shocking because Schnall et al. (2008) had reported only two experiments based on data from 40 and 43 University of Plymouth undergraduates.

---

Schnall in a post on the replication attempt:

My graduate students are worried about publishing their work out of fear that data detectives might come after them and try to find something wrong in their work. Doing research now involves anticipating a potential ethics or even criminal investigation.

I like the term "data detectives" a bit better than "replication police" (h/t Nicole Janz), so I think that I might adopt the label "data detective" for myself.

I can sympathize with the graduate students' fear that someone might target my work and try to find an error in that work, but that's a necessary occupational hazard for a scientist.

The best way to protect research from data detectives is to produce reproducible and perceived replicable research; one of the worst ways to protect research from data detectives is to publish low-powered studies in a high-profile journal, because the high profile draws attention and the low power increases suspicions that the finding was due to the non-reporting of failed experiments.

---

From McBee and Matthews (2014):

Researchers who try to serve the interests of science are going to find themselves out-competed by those who elect to “play the game,” because the ethical researcher will conduct a number of studies that will prove unpublishable because they lack statistically significant findings, whereas the careerist will find ways to achieve significance far more frequently. (p. 77)

This reflects part of the benefit produced by data detectives and the replication police: a more even playing field for researchers reluctant to take advantage of researcher degrees of freedom.

---

This Francis (2012) article is an example of a data detective targeting an article to detect non-reporting of experiments. Balcetis and Dunning (2010) reported five experiments rejecting the null hypothesis; the experiments had Ns, effect sizes, and powers as listed below in a table drawn from Francis (2012) p. 176.

Francis 2012Francis summed the powers to get 3.11, which indicates the number of times that we should expect the null hypothesis to be rejected given the observed effect sizes and powers of the 5 experiments; Francis multiplied the powers to get 0.076, which indicates the probability that the null hypothesis will be rejected in all 5 experiments.

---

Here is Francis again detecting more improbable results. And again. Here's a back-and-forth between Simonsohn and Francis on Francis' publication bias studies.

---

Here's the Galak and Meyvis (2012) reply to another study in which Francis claimed to have detected non-reporting of experiments in Galak and Meyvis (2011). Galak and Meyvis admit to the non-reporting:

We reported eight successful demonstrations of this phenomenon in our paper, but we also conducted five additional studies whose results either did not reach conventional levels of significance or did reach significance but ended up being rhetorically redundant. (p. 595)

...but argue that it's not a problem because they weren't interested in effect sizes:

However, as is the case for many papers in experimental psychology, the goal was never to assess the exact size of the effect, but rather to test between competing theoretical predictions. (p. 595)

Even if it is true that the authors were unconcerned with effect size, I do not understand how that justifies not reporting results that fail to reach conventional levels of statistical significance.

So what about readers who *are* interested in effect sizes? Galak and Meyvis write:

If a researcher is interested in estimating the size of an effect reported in a published paper, we recommend asking the authors for their file drawer and conducting a meta-analysis. (p. 595-596)

That's an interesting solution: if you are reading an article and wonder about the effect size, put down the article, email the researchers, hope that the researchers respond, hope that the researchers send the data, and then -- if you receive the data -- conduct your own meta-analysis.

Tagged with: , , ,

Jeremy Freese recently linked to a Jason Mitchell essay that discussed perceived problems with replications. Mitchell discussed many facets of replication, but I will restrict this post to Mitchell's claim that "[r]ecent hand-wringing over failed replications in social psychology is largely pointless, because unsuccessful experiments have no meaningful scientific value."

Mitchell's claim appears to be based on a perceived asymmetry between positive and negative findings: "When an experiment succeeds, we can celebrate that the phenomenon survived these all-too-frequent shortcomings. But when an experiment fails, we can only wallow in uncertainty about whether a phenomenon simply does not exist or, rather, whether we were just a bit too human that time around."

Mitchell is correct that a null finding can be caused by experimental error, but Mitchell appears to overlook the fact that positive findings can also be caused by experimental error.

---

Mitchell also appears to confront only the possible "ex post" value of replications, but there is a possible "ex ante" value to replications.

Ward Farnsworth discussed ex post and ex ante thinking using the example of a person who accidentally builds a house that extends onto a neighbor's property: ex post thinking concerns how to best resolve the situation at hand, but ex ante thinking concerns how to make this problem less likely to occur in the future; tearing down the house is a wasteful decision through the perspective of ex post thinking, but it is a good decision from the ex ante perspective because it incentivizes more careful construction in the future.

In a similar way, the threat of replication incentivizes more careful social science. Rational replicators should gravitate toward research for which the evidence appears to be relatively fragile: all else equal, the value of a replication is higher for replicating a study based on 83 undergraduates at one particular college than for replicating a study based on a nationally-representative sample of 1,000 persons; all else equal, a replicator should pass on replicating a stereotype threat study in which the dependent variable is percent correct in favor of replicating a study in which the stereotype effect was detected only using the more unusual measure of percent accuracy, measured as the percent correct of the problems that the respondent attempted.

Mitchell is correct that there is a real possibility that a researcher's positive finding will not be replicated because of error on the part of the replicator, but, as a silver lining, this negative possibility incentivizes researchers concerned about failed replications to produce higher-quality research that reduces the chance that a replicator targets their research in the first place.

Tagged with: ,