The Public Opinion Quarterly article "Bias in the Flesh" provided evidence of "an evaluative penalty for darker skin" (quote from the abstract). Study 2 of the article was an MTurk survey. Some respondents were shown an image of Barack Obama with darkened skin, and some respondents were shown an image of Barack Obama with lightened skin. Both sets of respondents received the text: "We are interested in how people evaluate images of political figures. Consider the following image:"

Immediately following the image and text, respondents received 14 items that could be used to assess this evaluative penalty for darker skin; these items are listed in the boxes below. The first 11 items could be used to measure whether, compared to respondents in one of the conditions, respondents in the other condition completed more word fragments with words associated with negative stereotypes, such as LAZY or CRIME.

Please complete the following word fragments. Make sure to type out the entire word.
1. L A _ _
2. C R _ _ _
3. _ _ O R
4. R _ _
5. W E L _ _ _ _
6. _ _ C E
7. D _ _ _ Y
8. B R _ _ _ _ _
9. _ _ A C K
10. M I _ _ _ _ _ _
11. D R _ _
How competent is Barrack Obama?
1. Very competent
2. Somewhat competent
3. Neither competent nor incompetent
4. Somewhat incompetent
5. Very incompetent
How trustworthy is Barrack Obama?
1. Very trustworthy
2. Somewhat trustworthy
3. Neither trustworthy nor untrustworthy
4. Somewhat untrustworthy
5. Very untrustworthy
On a scale from 0 (coldest) to 100 (warmest) how do you feel about Barack Obama?

The three bolded items above are the only three items for which results were reported on in the article (items 1, 3, and 7) and in the corresponding Monkey Cage post. In other words, the researchers selected 3 of 14 items to assess the evaluative penalty for darker skin. [Update: Footnote 16 in the article reported results for the combination of lazy, black, poor, welfare, crime, and dirty (p=0.078).]

If I'm using the correct formula, there are 16,369 different combinations of 14 items that could have been reported, not counting the null set and not counting reporting on only one item. Hopefully, I don't need a formula or calculation to convince you that there is a pretty good chance that random assignment variation alone would produce an associated two-tailed p-value less than 0.05 in at least one of those 16,369 combinations. The fact that the study reported one of these combinations doesn't provide much information about the evaluative penalty for darker skin.

The really discomforting part of this selective reporting is how transparently it was done: the main text of the article noted that only 3 of 14 puzzle-type items were selected, and the supplemental file included the items about Obama's competency, Obama's trustworthiness, and the Obama feeling thermometer. There was nothing hidden about this selective reporting, from what I can tell.

---

Notes:

1. For what it's worth, the survey had an item asking whether Obama's race is white, black, or mixed. But that doesn't seems to be useful for measuring an evaluative penalty for darker skin, so I didn't count it.

2. It's possible that the number of permutations that the peer reviewers would have permitted is less than 16,369. But that's an open question, given that the peer reviewers permitted 3 of 14 potential outcome variables to be reported [Update: ...in the main text of the article].

3. The data are not publicly available to analyze, so maybe the selective reporting in this instance didn't matter. I put in a request last week for the data, so hopefully we'll find out.

---

UPDATE (Jan 12, 2016)

1. I changed the title of the post from "Researchers select 1 of 16,369 combinations to report" to "Researchers select 2 of 16,369 combinations to report", because I overlooked footnote 16 in the article. Thanks to Solomon Messing for the pointer.

2. Omar Wasow noted that two of the items had a misspelling of Barack Obama's first name. Those misspellings appear in the questionnaire in the supplemental file for the article.

---

UPDATE (Jan 13, 2016)

1. Solomon Messing noted that data for the article are now available at the Dataverse. I followed as best I could the posted R code to reproduce the analysis in Stata, and I came close to the results reported in the article. I got the same percentages for the three word puzzles as the percentages that appear the article: 33% for the lightened photo, and 45% for the darkened photo, with a small difference in t-scores (t=2.74 to t=2.64). Estimates and t-scores were also close for the reported result in footnote 16: estimates of 0.98 and 1.11 for me, and estimates of 0.97 and 1.11 in the article, with respective t-scores of 1.79 and 1.77. Compared to the 630 unexcluded respondents for the article, I had 5 extra respondents after exclusions (635 total).

The table below reports results from t-tests that I conducted. The Stata code is available here.

Bias in the Flesh Table 1

Let me note a few things from the table:

First, I reproduced the finding that, when the word puzzles were limited to the combination of lazy, dirty, and poor, unexcluded respondents in the darkened photo condition completed more word puzzles in a stereotype-congruent way than unexcluded respondents in the lightened photo condition.

However, if I combine the word puzzles for race, minority, and rap, the finding is that unexcluded respondents in the lightened photo condition completed more word puzzles in a stereotype-congruent way than unexcluded respondents in the darkened photo condition: the opposite inference. Same thing when I combine race, minority, rap, and welfare. And same thing when I combine race, minority, rap, welfare, and crime.

Sure, as a group, these five stereotypes -- race, minority, rap, welfare, and crime -- don't have the highest face validity of the 11 stereotypes for being the most negative stereotypes, but there doesn't appear to be anyone in political science enforcing a rule that researchers must report all potential or intended outcome variables.

2. Estimates for 5 of the 11 stereotype items fell to the negative side of zero, indicating that unexcluded respondents in the lightened photo condition completed more word puzzles in a stereotype-congruent way than unexcluded respondents in the darkened photo condition. And estimates for 6 of the 11 stereotype items fell to the positive side of zero, indicating that unexcluded respondents in the darkened photo condition completed more word puzzles in a stereotype-congruent way than unexcluded respondents in the lightened photo condition.

A 5-to-6 split like that is what we'd expect if there were truly no effect, so -- in that sense -- this experiment doesn't provide much evidence for the relative effect of the darkened photo. That isn't a statement that the true relative effect of the darkened photo is exactly zero, but it is a statement about the evidence that this experiment has provided.

For what it's worth, the effect size is 0.118 and the p-value is 0.060 for the combination of word puzzles that I think has the most face validity for being the most negative stereotypes (lazy, poor, welfare, crime, drug, and dirty); the effect size is -0.032 and the p-value is 0.560 for the combination of word puzzles that I think have the least face validity for being the most negative stereotypes (race, black, brother, minority, and rap). So I'm not going to make any bets that the true effect is zero or that the lightened photo fosters relatively more activation of negative stereotypes.

3. Results for the competence, trustworthiness, and feeling thermometer items are pretty much what would be expected if the photo manipulation had no true effect on these items, with respective p-values of 0.904, 0.962, and 0.737. Solomon Messing noted that there is no expectation from the literature of an effect for these items, but now that I think of it, I'm not sure why there should be no expectation that showing a darkened photo of Obama would be expected to [1] make people more likely to call to mind negative racial stereotypes such as lazy and dirty but [2] have no effect on perceptions of Obama. In any event, I think that readers should have been told about the results for the competence, trustworthiness, and feeling thermometer items.

4. The report on these data suggested that the true effect is that the darkened photo increased stereotype activation. But I could have used the same data to argue for the inference that the darkened photo had no effect at all or at best only a negligible effect on stereotype activation and on attitudes toward Obama, had I reported the combination of all 11 word puzzles, plus the competence, trustworthiness, and feeling thermometer items. Moreover, had I selectively reported results and failed to inform peer reviewers of all the items, it might even have been possible to have published an argument that the true effect was that the lightened photo caused an increase in stereotype activation. I don't know why I should trust non-preregistered research if researchers have that much influence over inferences.

5. Feel free to check my code for errors or to report better ways to analyze the data.

Tagged with: , , ,

The Washington Post police shootings database as of January 4, 2016, indicated that on-duty police officers in the United States shot dead 91 unarmed persons in 2015: 31 whites, 37 blacks, 18 Hispanics, and 5 persons of another race or ethnicity. The database updates; the screen shot below is the data as of January 4, 2016.

WaPo UM

The New York Times search engine restricted to dates in 2015 returned 1,281 hits for "unarmed black", 4 hits for "unarmed white", 0 hits for "unarmed Hispanic", and 0 hits for "unarmed Asian":

nytimesUnarmedBlack

nytimesUnarmedWhite

nytimesUnarmedHispanic

nytimesUnarmedAsian

Tagged with: , , , ,

The Monkey Cage published a post, "Racial prejudice is driving opposition to paying college athletes. Here's the evidence." I tweeted about this post in several threads, but I'm posting the information here for possible future reference and for anyone who reads the blog.

Here's the key figure from the post. The left side of the post indicates that white respondents expressed more opposition to paying college athletes after exposure to a picture of black athletes than in a control condition with no picture.

After reading the post, I noted two oddities about the figure. First, based on the logic of an experiment -- change one thing only to assess the effect of that thing -- the proper comparison for assessing racial bias among white respondents would have been comparing the effect of a photo of black athletes to the effect of a photo of white athletes; that comparison would have removed the alternate explanations that respondents expressed more opposition because a photo was shown or because a photo of athletes was shown, and not necessarily because a photo of *black* athletes was shown. Second, the data were from the CCES, which typically has team samples of 1,000 respondents; these samples are presumably intended to be a representative of the national population, so there should be more than 411 whites in a 1,000-respondent sample.

Putting two and two together suggested that there was an unreported condition in which respondents were shown a photo of white athletes. I emailed the three authors of the blog post, and to their credit I received substantive replies to my questions about the experiment. Based on the team's responses, the experiment did have a condition in which respondents were shown a photo of white athletes, and opposition to paying college athletes in this "white athletes" photo condition did not differ at p<0.05 (two-tailed test) from opposition to paying college athletes in the "black athletes" photo condition.

Tagged with: , , ,

There is a common practice of discussing inequality in the United States without reference to Asian Americans, which permits the suggestion that the inequality is due to race or racial bias. Here's a recent example:

The graph reported results for Hispanics disaggregated into Cubans, Puerto Ricans, Mexicans, and other Hispanics, but the graph omitted results for Asians and Pacific Islanders, even though the note for the graph indicates that Asians/Pacific Islanders were included in the model. Here are data on Asian American poverty rates (source):

ACS

The omission of Asian Americans from discussions of inequality is a common enough practice [1, 2, 3, 4, 5] that it deserves a name. The Asian American Exclusion is as good as any.

Tagged with: , , ,

Here is the manuscript that I plan to present at the 2015 American Political Science Association conference in September: revised version here. The manuscript contains links to locations of the data; a file of the reproduction code for the revised manuscript  is here.

Comments are welcome!

Abstract and the key figure are below:

Racial bias is a persistent concern in the United States, but polls have indicated that whites and blacks on average report very different perceptions of the extent and aggregate direction of this bias. Meta-analyses of results from a population of sixteen federally-funded survey experiments, many of which have never been reported on in a journal or academic book, indicate the presence of a moderate aggregate black bias against whites but no aggregate white bias against blacks.

Metan w mcNOTE:

I made a few changes since submitting the manuscript: [1] removing all cases in which the target was not black or white (e.g., Hispanics, Asians, control conditions in which the target did not have a race); [2] estimating meta-analyses without removing cases based on a racial manipulation check; and [3] estimating meta-analyses without the Cottrell and Neuberg 2004 survey experiment, given that that survey experiment was more about perceptions of racial groups instead of a test for racial bias against particular targets.

Numeric values in the figure are for a meta-analysis that reflects [1] above:

* For white respondents: the effect size point estimate was 0.039 (p=0.375), with a 95% confidence interval of [-0.047, 0.124].
* For black respondents: the effect size point estimate was 0.281 (p=0.016), with a 95% confidence interval of [0.053, 0.509].

---

The meta-analysis graph includes five studies for which a racial manipulation check was used to restrict the sample: Pager 2006, Rattan 2010, Stephens 2011, Pedulla 2011, and Powroznik 2014. Inferences from the meta-analysis were the same when these five studies included respondents who failed the racial manipulation checks:

* For white respondents: the effect size point estimate was 0.027 (p=0.499), with a 95% confidence interval of [-0.051, 0.105].
* For black respondents: the effect size point estimate was 0.268 (p=0.017), with a 95% confidence interval of [0.047, 0.488].

---

Inferences from the meta-analysis were the same when the Cottrell and Neuberg 2004 survey experiment was removed from the meta-analysis. For the residual 15 studies using the racial manipulation check restriction:

* For white respondents: the effect size point estimate was 0.063 (p=0.114), with a 95% confidence interval of [-0.015, 0.142].
* For black respondents: the effect size point estimate was 0.210 (p=0.010), with a 95% confidence interval of [0.050, 0.369].

---

For the residual 15 studies not using the racial manipulation check restriction:

* For white respondents: the effect size point estimate was 0.049 (p=0.174), with a 95% confidence interval of [-0.022, 0.121].
* For black respondents: the effect size point estimate was 0.194 (p=0.012), with a 95% confidence interval of [0.044, 0.345].

Tagged with: , ,

Here is a passage from Pigliucci 2013.

Steele and Aronson (1995), among others, looked at IQ tests and at ETS tests (e.g. SATs, GREs, etc.) to see whether human intellectual performance can be manipulated with simple psychological tricks priming negative stereotypes about a group that the subjects self-identify with. Notoriously, the trick worked, and as a result we can explain almost all of the gap between whites and blacks on intelligence tests as an artifact of stereotype threat, a previously unknown testing situation bias.

Racial gaps are a common and perennial concern in public education, but this passage suggests that such gaps are an artifact. However, when I looked up Steele and Aronson (1995) to discover the evidence for this result, I discovered that the black participants and the white participants in the study were all Stanford undergraduates and that the students' test performances were adjusted by the students' SAT scores. Given that the analysis contained both sample selection bias and statistical control, it does not seem reasonable to make an inference about populations based on that analysis. This error in reporting results for Steele and Aronson (1995) is apparently common enough to deserve its own article.

---

Here's a related passage from Brian at Dynamic Ecology:

A neat example on the importance of nomination criteria for gender equity is buried in this post about winning Jeopardy (an American television quiz show). For a long time only 1/3 of the winners were women. This might lead Larry Summers to conclude men are just better at recalling facts (or clicking the button to answer faster). But a natural experiment (scroll down to the middle of the post to The Challenger Pool Has Gotten Bigger) shows that nomination criteria were the real problem. In 2006 Jeopardy changed how they selected the contestants. Before 2006 you had to self-fund a trip to Los Angeles to participate in try-outs to get on the show. This required a certain chutzpah/cockiness to lay out several hundred dollars with no guarantee of even being selected. And 2/3 of the winners were male because more males were making the choice to take this risk. Then they switched to an online test. And suddenly more participants were female and suddenly half the winners were female. [emphasis added]

I looked up the 538 post linked to in the passage, which reported: "Almost half of returning champions this season have been women. In the year before Jennings's streak, fewer than 1 in 3 winners were female." That passage provides two data points: this season appears to be 2015 (the year of the 538 post), and the year before Jennings's streak appears to be 2003 (the 538 post noted that Jennings's streak occurred in 2004). The 538 post reported that the rule change for the online test occurred in 2006.

So here's the relevant information from the 538 post:

  • In 2003, fewer than 1 in 3 Jeopardy winners were women.
  • In 2006, the selection process was changed to an online test.
  • Presumably in 2015, through early May, almost half of Jeopardy winners have been women.

It does not seem that comparison of a data point from 2003 to a partial data point from 2015 permits use of the descriptive term "suddenly."

It's entirely possible -- and perhaps probable -- that the switch to an online test for qualification reduced gender inequality in Jeopardy winners. But that inference needs more support than the minimal data reported in the 538 post.

Tagged with: , , ,

I left this as a comment here.

For what it's worth, here are questions that I ask when evaluating research:

1. Did the researchers preregister their research design choices so that we can be sure that the research design choices were not made based on the data? If not, are the research design choices consistent with the choices that the researcher has previously made in other research?

2. Have the researchers publicly posted documentation and all the data that were collected, so that other researchers can check the analysis for errors and assess the robustness of the reported results?

3. Did the researchers declare that there are no unreported file drawer studies, unreported manipulations, and unreported variables that were measured?

4. Were the data collected by an independent third party?

5. Is the sample representative of the population of interest?

Tagged with: