Researchers select 2 of 16,369 combinations to report

The Public Opinion Quarterly article "Bias in the Flesh" provided evidence of "an evaluative penalty for darker skin" (quote from the abstract). Study 2 of the article was an MTurk survey. Some respondents were shown an image of Barack Obama with darkened skin, and some respondents were shown an image of Barack Obama with lightened skin. Both sets of respondents received the text: "We are interested in how people evaluate images of political figures. Consider the following image:"

Immediately following the image and text, respondents received 14 items that could be used to assess this evaluative penalty for darker skin; these items are listed in the boxes below. The first 11 items could be used to measure whether, compared to respondents in one of the conditions, respondents in the other condition completed more word fragments with words associated with negative stereotypes, such as LAZY or CRIME.

Please complete the following word fragments. Make sure to type out the entire word.
1. L A _ _
2. C R _ _ _
3. _ _ O R
4. R _ _
5. W E L _ _ _ _
6. _ _ C E
7. D _ _ _ Y
8. B R _ _ _ _ _
9. _ _ A C K
10. M I _ _ _ _ _ _
11. D R _ _
How competent is Barrack Obama?
1. Very competent
2. Somewhat competent
3. Neither competent nor incompetent
4. Somewhat incompetent
5. Very incompetent
How trustworthy is Barrack Obama?
1. Very trustworthy
2. Somewhat trustworthy
3. Neither trustworthy nor untrustworthy
4. Somewhat untrustworthy
5. Very untrustworthy
On a scale from 0 (coldest) to 100 (warmest) how do you feel about Barack Obama?

The three bolded items above are the only three items for which results were reported on in the article (items 1, 3, and 7) and in the corresponding Monkey Cage post. In other words, the researchers selected 3 of 14 items to assess the evaluative penalty for darker skin. [Update: Footnote 16 in the article reported results for the combination of lazy, black, poor, welfare, crime, and dirty (p=0.078).]

If I'm using the correct formula, there are 16,369 different combinations of 14 items that could have been reported, not counting the null set and not counting reporting on only one item. Hopefully, I don't need a formula or calculation to convince you that there is a pretty good chance that random assignment variation alone would produce an associated two-tailed p-value less than 0.05 in at least one of those 16,369 combinations. The fact that the study reported one of these combinations doesn't provide much information about the evaluative penalty for darker skin.

The really discomforting part of this selective reporting is how transparently it was done: the main text of the article noted that only 3 of 14 puzzle-type items were selected, and the supplemental file included the items about Obama's competency, Obama's trustworthiness, and the Obama feeling thermometer. There was nothing hidden about this selective reporting, from what I can tell.

---

Notes:

1. For what it's worth, the survey had an item asking whether Obama's race is white, black, or mixed. But that doesn't seems to be useful for measuring an evaluative penalty for darker skin, so I didn't count it.

2. It's possible that the number of permutations that the peer reviewers would have permitted is less than 16,369. But that's an open question, given that the peer reviewers permitted 3 of 14 potential outcome variables to be reported [Update: ...in the main text of the article].

3. The data are not publicly available to analyze, so maybe the selective reporting in this instance didn't matter. I put in a request last week for the data, so hopefully we'll find out.

---

UPDATE (Jan 12, 2016)

1. I changed the title of the post from "Researchers select 1 of 16,369 combinations to report" to "Researchers select 2 of 16,369 combinations to report", because I overlooked footnote 16 in the article. Thanks to Solomon Messing for the pointer.

2. Omar Wasow noted that two of the items had a misspelling of Barack Obama's first name. Those misspellings appear in the questionnaire in the supplemental file for the article.

---

UPDATE (Jan 13, 2016)

1. Solomon Messing noted that data for the article are now available at the Dataverse. I followed as best I could the posted R code to reproduce the analysis in Stata, and I came close to the results reported in the article. I got the same percentages for the three word puzzles as the percentages that appear the article: 33% for the lightened photo, and 45% for the darkened photo, with a small difference in t-scores (t=2.74 to t=2.64). Estimates and t-scores were also close for the reported result in footnote 16: estimates of 0.98 and 1.11 for me, and estimates of 0.97 and 1.11 in the article, with respective t-scores of 1.79 and 1.77. Compared to the 630 unexcluded respondents for the article, I had 5 extra respondents after exclusions (635 total).

The table below reports results from t-tests that I conducted. The Stata code is available here.

Bias in the Flesh Table 1

Let me note a few things from the table:

First, I reproduced the finding that, when the word puzzles were limited to the combination of lazy, dirty, and poor, unexcluded respondents in the darkened photo condition completed more word puzzles in a stereotype-congruent way than unexcluded respondents in the lightened photo condition.

However, if I combine the word puzzles for race, minority, and rap, the finding is that unexcluded respondents in the lightened photo condition completed more word puzzles in a stereotype-congruent way than unexcluded respondents in the darkened photo condition: the opposite inference. Same thing when I combine race, minority, rap, and welfare. And same thing when I combine race, minority, rap, welfare, and crime.

Sure, as a group, these five stereotypes -- race, minority, rap, welfare, and crime -- don't have the highest face validity of the 11 stereotypes for being the most negative stereotypes, but there doesn't appear to be anyone in political science enforcing a rule that researchers must report all potential or intended outcome variables.

2. Estimates for 5 of the 11 stereotype items fell to the negative side of zero, indicating that unexcluded respondents in the lightened photo condition completed more word puzzles in a stereotype-congruent way than unexcluded respondents in the darkened photo condition. And estimates for 6 of the 11 stereotype items fell to the positive side of zero, indicating that unexcluded respondents in the darkened photo condition completed more word puzzles in a stereotype-congruent way than unexcluded respondents in the lightened photo condition.

A 5-to-6 split like that is what we'd expect if there were truly no effect, so -- in that sense -- this experiment doesn't provide much evidence for the relative effect of the darkened photo. That isn't a statement that the true relative effect of the darkened photo is exactly zero, but it is a statement about the evidence that this experiment has provided.

For what it's worth, the effect size is 0.118 and the p-value is 0.060 for the combination of word puzzles that I think has the most face validity for being the most negative stereotypes (lazy, poor, welfare, crime, drug, and dirty); the effect size is -0.032 and the p-value is 0.560 for the combination of word puzzles that I think have the least face validity for being the most negative stereotypes (race, black, brother, minority, and rap). So I'm not going to make any bets that the true effect is zero or that the lightened photo fosters relatively more activation of negative stereotypes.

3. Results for the competence, trustworthiness, and feeling thermometer items are pretty much what would be expected if the photo manipulation had no true effect on these items, with respective p-values of 0.904, 0.962, and 0.737. Solomon Messing noted that there is no expectation from the literature of an effect for these items, but now that I think of it, I'm not sure why there should be no expectation that showing a darkened photo of Obama would be expected to [1] make people more likely to call to mind negative racial stereotypes such as lazy and dirty but [2] have no effect on perceptions of Obama. In any event, I think that readers should have been told about the results for the competence, trustworthiness, and feeling thermometer items.

4. The report on these data suggested that the true effect is that the darkened photo increased stereotype activation. But I could have used the same data to argue for the inference that the darkened photo had no effect at all or at best only a negligible effect on stereotype activation and on attitudes toward Obama, had I reported the combination of all 11 word puzzles, plus the competence, trustworthiness, and feeling thermometer items. Moreover, had I selectively reported results and failed to inform peer reviewers of all the items, it might even have been possible to have published an argument that the true effect was that the lightened photo caused an increase in stereotype activation. I don't know why I should trust non-preregistered research if researchers have that much influence over inferences.

5. Feel free to check my code for errors or to report better ways to analyze the data.

4 Comments on “Researchers select 2 of 16,369 combinations to report

  1. Thanks for your interest in this work. It sounds like the suggestion here is that we might have tried every possible combination of three stereotype word completions and chose one that was significant. In fact, we chose these three items in a principled way.

    We knew from the start that we wanted to focus on the most negative stereotypes, because past studies clearly show that darker images activate the most negative stereotypes about Blacks (Maddox and Gray 2002; Blair et al. 2002), and because those negative stereotypes are arguably the most detrimental.

    Why did we choose to analyze the three negative stereotype completions we did? In addition to being the most negative in our judgement, those three items maximized interclass correlation, a commonly used technique to form an index that corresponds to a singular concept.

    And the effect is robust to alternative design choices. If you use another technique, maximizing alpha reliability, the effect persists, though it is noisier.

    Regardless of the way you slice the sample, e.g., if you include participants whose IP addresses do not resolve to the U.S. or who failed an attention check, the effect persists, though it is again noisier.

    Note that we posted replication materials on January 10 here: http://dx.doi.org/10.7910/DVN/F0NDJP.

    Of course, we did not pre-register this study and we certainly aren’t saying that the scientific community should accept these findings as the last word on the matter. We would encourage you and other interested scholars to perform independent replications to produce even better evidence on this question.

    • Thanks for the comment, Solomon.

      I'm not suggesting that your team tried 16 thousand combinations. Except as reported in the article or as revealed in the reproduction code or elsewhere, I have no way of knowing what was analyzed and what was not analyzed.

      I think that the combinations that your team reported were justified, but I think that there are justifiable tests that were not reported that I as a reader would have wanted to be informed of. In my opinion, CRIME is at least as negative as DIRTY, and WELFARE is more negative than POOR, so there are unreported 3-item combinations that have reasonable face validity as being the most negative stereotypes. Maybe there are also alternate justifiable statistical techniques that could have been used to narrow down the 11 word puzzles into different combinations.

      And I was certainly interested in results for the three items about President Obama. If the darkened photo fosters stereotype activation, and these stereotypes are detrimental, then it would seem that the items about Obama can be used to assess whether these stereotypes really are detrimental in the particular context of evaluations of Obama.

      You are correct that the darkened photo effect is robust to alternate research designs, but -- as I indicted in the update to the post from today -- my analysis indicates that the effect is not robust to all reasonable research designs. For example, it appears that the results could have been reported as indicating that the darkened photo had at most a negligible effect: in my analysis, the p-value for the scale of the 11 stereotypes was 0.344, the p-value for the competence item was 0.904, the p-value for the trustworthiness item was 0.962, and the p-value for the feeling thermometer was 0.737. I'm not arguing that this would be the best way to report the data, but I think that -- given the completeness of reporting on all 14 potential outcome variables -- this reporting is at least as reasonable as what was reported.

      Based on my analysis, I think that a fair description of the experiment's evidence for the relative effect of the darkened photo of Obama is: [1] good evidence for activation of the most negative stereotypes, [2] not much evidence for activation of stereotypes that are not as negative (RACE) or are not as valid (e.g., BROTHER), and [3] not much evidence of an influence on evaluations of Obama himself.

      • After thinking about this, I agree we might expect T --> X --> Y, where T is our treatment and X is stereotype activation and Y is trust, competence, or the feeling thermometer.

        And indeed, if you model stereotype activation as a moderator variable, it looks like there could be interactions between darker complexion and stereotype activation for the thermometer ratings and for the trust variable.

        If you use the whole sample for additional statistical power, those interactions become significant at .10 two-tailed/.05 one-tailed.

        This should be taken with the caveat that there could be estimation problems in this specification (as T and X are correlated) and we are conditioning on a post-treatment variable. The estimates of the treatment effect could be biased downward (hence biasing other estimates in the models as well).

        I thought about an IV setup, but that strikes me as inappropriate since the treatment could certainly affect those outcomes via other channels besides just stereotype activation.

        Note that I did some work trying to get at something like this back in 2009, which suggested that darkening complexion might affect outcomes like thermometer ratings, trust, and competence via implicit channels (e.g., study two here: http://pcl.stanford.edu/research/2010/iyengar-racial-candidate.pdf), which is more or less consistent with a moderation relationship.

        As usual, more research is necessary.

        • Thanks for the additional analysis and for the link, Solomon. I think we can agree on that causal chain.

          I wonder whether there might be a dilution or a reversal of the darker skin effect for white conservatives evaluating black conservatives, compared to evaluating black liberals. Most black conservatives who come to my mind have relatively dark skin and seem to have been embraced by white conservatives. Maybe [a] outgroup members who agree with the ingroup are perceived favorably, and [b] darker skin is perceived as a stronger signal of racial outgroup membership for white conservatives.

Leave a Reply

Your email address will not be published. Required fields are marked *

*

This site uses Akismet to reduce spam. Learn how your comment data is processed.