The Monkey Cage tweeted a link to post (Gift et al 2022), claiming that "Just seeing a Fox News logo prompts racial bias, new research suggests".

This new research is Bell et al 2022, which reported on an experiment that manipulated the logo on a news story provided to participants (no logo, CNN, and Fox News) and manipulated the name of the U.S. Army Ranger in the news story who was accused of killing a wounded Taliban detainee, with the name signaling race (e.g., no name, Tyrone Washington, Mustafa Husain, Santiago Gonzalez, Todd Becker).

The Appendix to Bell et al 2022 reports some results for all respondents, but Bell et al 2022 indicates (footnotes and citations omitted):

Research on racial attitudes in America largely theorizes about the proclivities and nuances of racial animus harbored by Whites, so we follow conventions in the literature by restricting our analysis to 1149 White respondents.

Prior relevant post.

---

1.

From the Gift et al 2022 Monkey Cage post (emphasis added):

The result wasn't what we necessarily expected. We didn't anticipate that the Fox News logo might negatively affect attitudes toward the Black service member any more than soldiers of other races. So what could explain this outcome?

The regression results reported in Bell et al 2022 have the "no name" condition as the omitted category, so the 0.180 coefficient and 0.0705 standard error for the [Black X Fox News] interaction term for the "convicted" outcome indicates the effect of the Fox News logo in the Black Ranger condition relative to the effect of the Fox News logo in the no-name condition.

But, for assessing anti-Black bias among White participants, it seems preferable to compare the effect of the Fox News logo in the Black Ranger condition to the effect of the Fox News logo in the White Ranger condition. Otherwise, the Black name / no-name comparison might conflate the effect of a Black name for the Ranger with the general effect of naming the Ranger. Moreover, a Black name / White name comparison would better fit the claim about "any more than soldiers of other races".

---

The coefficient and standard error are 0.0917 and 0.0701 for the [White X Fox News] interaction term for the "convicted" outcome, and I don't think that there is sufficient evidence that the 0.180 [Black X Fox News] coefficient differs from the 0.0917 [White X Fox News] coefficient, given that the difference in coefficients for the interaction terms is only 0.09 and the standard errors are about 0.07 for each interaction term.

Similar concern about the "justified" outcome, which had respective coefficients (and standard errors) of −0.142 (0.0693) for [Black X Fox News] and −0.0841 (0.0692) for [White X Fox News]. I didn't see the replication materials for Bell et al 2022 in the journal's Dataverse, or I might have tried to get the p-values.

---

2.

From the Gift et al 2022 Monkey Cage post:

Of course one study is hardly definitive. Our analysis points to the need for more research into how Fox News and other media may or may not prime racial attitudes across a range of political and social issues.

Yes, one study is not definitive, so it might have been a good idea for the Gift et al 2022 Monkey Cage post to have mentioned the replication attempt *published in Bell et al 2022* in which the [Black X Fox News] interaction term did not replicate in statistical significance or even in the direction of the coefficients, with a −0.00371 coefficient for the "convicted" outcome and a 0.0199 coefficient for the "justified" outcome.

I can't see a good reason for the Gift et al 2022 Monkey Cage post to not report results for the preregistered replication attempt and for the Monkey Cage editors to have not known about the replication attempt or to permit publishing the post without mentioning the lack of replication for the [Black X Fox News] interaction term.

The preregistration suggests that the replication attempt was due to the journal (Research & Politics), so it seems that we can thank a peer reviewer or editor for the replication attempt.

---

3.

Below is the first sentence from the preregistration question about the main question for Study 2:

White Americans who see a story about a non-white soldier will be more likely to say the soldier should be punished for their alleged crime than either an unnamed soldier or a white soldier.

Bell et al 2022 Appendix Table A2 indicates that means for the "convicted" outcome in Study 2 were, from high to low and by condition:

No logo news source
0.725 White name
0.697 Latin name
0.692 MEast name
0.680 No name 
0.655 Black name

CNN logo
0.705 No name 
0.698 Latin name
0.695 Black name
0.694 White name
0.688 MEast name

Fox News logo
0.730 No name 
0.703 White name
0.702 Black name
0.695 MEast name
0.688 Latin name

So, in the Fox News condition from this *preregistered* experiment, the highest point estimate for a named Ranger was for the White Ranger, for the "convicted" outcome, which seems like a better measure of punishment than the "justified" outcome.

The gap between the highest mean "convicted" outcome for a named Ranger (0.703) and the lowest mean "convicted" outcome for a named Ranger (0.688) was 0.015 units on a 0-to-1 scale. That seems small enough to be consistent with random assignment error and to be inconsistent with the title of the Monkey Cage post of "Just seeing a Fox News logo prompts racial bias, new research suggests".

---

NOTES

1. Tweet question to authors of Bell et al 2022.

2. The constant in the Bell et al 2022 OLS regressions represents the no-name Ranger in the no-logo news story.

In Study 1, this constant indicates that the Ranger in the no-name no-logo condition was rated on a 0-to-1 scale as 0.627 for the "convicted" outcome and as 0.389 for the "justified" outcome. This balance make sense: on net, participants in the no-name no-logo condition agreed that the Ranger should be convicted and disagreed that the Ranger's actions were justified. Appendix Table A1 indicates that the mean "convicted" rating was above 0.50 and the mean "justified" rating was below 0.50 for each of the 15 conditions for Study 1.

But the constants in Study 2 were 0.680 for the "convicted" outcome and 0.711 for the "justified" outcome, which means that, on net, participants in the no-name no-logo condition agreed that the Ranger should be convicted and agreed that the Ranger's actions were justified. Appendix Table A2 indicates that the mean for both outcomes was above 0.50 for each of the 15 conditions for Study 2.

3. I think that Bell et al 2022 Appendix A1 might report results for all respondents: the sample size in A1 is N=1554, but in the main text Table 2 sample sizes are N=1149 for the convicted outcome and 1140 for the justified outcome. Moreover, I think that the main text Figure 2 might plot these A1 results (presumably for all respondents) and not the Table 2 results that were limited to White respondents.

For example, A1 has the mean "convicted" rating as 0.630 for no-name no-logo, 0.590 for no-name CNN logo, and 0.636 for non-name Fox logo, which matches the CNN dip in the leftmost panel of Figure 2 and Fox News being a bit above the no-logo estimate in that panel. But the "convicted" constant in Table 1 is 0.630 (for the no-name no-logo condition), with a −0.0303 coefficient for CNN and a −0.0577 coefficient for Fox News, so based on this I think that the no-name Fox News mean should be lower than the no-name CNN mean.

The bumps in Figure 2 better match with Appendix Table A5 estimates, which are for all respondents.

4. This Bell et al 2022 passage about Study 2 seems misleading or at least easy to misinterpret (emphasis in the original, footnote omitted):

If the soldier was White and the media source was unnamed, respondents judged him to be significantly less justified in his actions, but when the same information was presented under the Fox News logo, respondents found him to be significantly more justified in his actions.

As indicated in the coefficients and Figure 3, the "more justified" isn't more justified relative to the no-name no-logo condition, but more justified relative to the bias against the White Ranger relative to the no-name Ranger in the no-logo condition. Relevant coefficients are −0.131 for "White", which indicates the reduction in the "justified" rating between the no-name no-logo condition and the White-name no-logo condition, and 0.169 for "White X Fox News", which indicates the White-name Fox-News advantage relative to the no-name Fox-News effect.

So the Fox News bias favoring the White Ranger in the Study 2 "justified" outcome only a little more than offset the bias against the White Ranger in the no-logo condition, with a net bias that I suspect might be small enough to be consistent with random assignment error.

Tagged with: , , ,

1.

In May, I published a blog post about deviations from the pre-analysis plan for the Stephens-Dougan 2022 APSR letter, and I tweeted a link to the blog post that tagged @LaFleurPhD and asked her directly about the deviations from the pre-analysis plan. I don't recall receiving a response from Stephens-Dougan, and, a few days later, on May 31, I emailed the APSR about my post, listing three concerns:

* The Stephens-Dougan 2022 description of racially prejudiced Whites not matching how the code for Stephens-Dougan 2022 calculated estimates for racially prejudiced Whites.

* The substantial deviations from the pre-analysis plan.

* Figure 1 of the APSR letter reporting weighted estimates, but the evidence being much weaker in unweighted analyses.

Six months later (December 5), the APSR has published a correction to Stephens-Dougan 2022. The correction addresses each of my three concerns, but not perfectly, which I'll discuss below, along with other discussion about Stephens-Dougan 2022 and its correction. I'll refer to the original APSR letter as "Stephens-Dougan 2022" and the correction as "the correction".

---

2.

The pre-analysis plan associated with Stephens-Dougan 2022 listed four outcomes at the top of its page 4, but only one of these outcomes (referred to as "Individual rights and freedom threatened") was reported on in Stephens-Dougan 2022. However, Table 1 of Stephens-Dougan 2022 reported results for three outcomes that were not mentioned in the pre-analysis plan.

The t-statistics for the key interaction term for the three outcomes included in Table 1 of Stephens-Dougan 2022 but not mentioned in pre-analysis plan were 2.6, 2.0, and 2.1, all of which indicate sufficient evidence. The t-statistics for the key interaction term mentioned in pre-analysis plan but omitted from Stephens-Dougan 2022 were 0.6, 0.6, and 0.6, none of which indicate sufficient evidence.

I calculated the t-statistics of 2.6, 2.0, and 2.1 from Table 1 of Stephens-Dougan 2022, by dividing a coefficient by its standard error. I wasn't able to use the correction to calculate the t-statistics of 0.6, 0.6, and 0.6, because the relevant data for these three omitted pre-analysis plan outcomes are not in the correction but instead are in Table A12 of a "replication-final.pdf" file hosted at the Dataverse.

That's part of what I meant about an imperfect correction: a reader cannot use information published in the APSR itself to calculate the evidence provided by the outcomes that were planned to be reported on in the pre-analysis plan, or, for that matter, to see how there is substantially less evidence in the unweighted analysis. Instead, a reader needs to go to the Dataverse and dig through table after table of results.

The correction refers to deviations from the pre-analysis plan, but doesn't indicate the particular deviations and doesn't indicate what happens when these deviations are not made.  The "Supplementary Materials Correction-Final.docx" file at the Dataverse for Stephens-Dougan 2022 has a discussion of deviations from the pre-analysis plan, but, as far as I can tell, the discussion does not provide a reason why the results should not be reported for the three omitted outcomes, which were labeled in Table A12 as "Slow the Spread", "Stay Home", and "Too Long to Loosen Restrictions".

It seems to me to be a bad policy to permit researchers to deviate from a pre-analysis plan without justification and to merely report results from a planned analysis on, say, page 46 of a 68-page file on the Dataverse. But a bigger problem might be that, as far as I can tell, many journals don't even attempt to prevent misleading selective reporting for survey research for which there is no pre-analysis plan. Journals could require researchers reporting on surveys to submit or link to the full questionnaire for the surveys or at least to declare that the main text reports on results for all plausible measured outcomes and moderators.

---

3.

Next, let me discuss a method used in Stephens-Dougan 2022 and the correction, which I think is a bad method.

The code for Stephens-Dougan 2022 used measures of stereotypes about Whites and Blacks on the traits of hard working and intelligent, to create a variable called "negstereotype_endorsement". The code divided respondents into three categories, coded 0 for respondents who did not endorse a negative stereotype about Blacks relative to Whites, 0.5 for respondents who endorsed exactly one of the two negative stereotypes about Blacks relative to Whites, and 1 for respondents who endorsed both negative stereotypes about Blacks relative to Whites. For both Stephens-Dougan 2022 and the correction, Figure 3 reported for each reported outcome an estimate of how much the average treatment effect among prejudiced Whites (defined as those coded 1) differed from the average treatment effect among unprejudiced Whites (defined as those coded 0).

The most straightforward way to estimate this difference in treatment effects is to [1] calculate the treatment effect for prejudiced Whites coded 1, [2] calculate the treatment effect for unprejudiced Whites coded 0, and [3] calculate the difference between these treatment effects. The code for Stephens-Dougan 2022 instead estimated this difference using a logit regression that had three predictors: the treatment, the 0/0.5/1 measure of prejudice, and an interaction of the prior two predictors. But, by this method, the estimated difference in treatment effect between the 1 respondents and the 0 respondents depends on the 0.5 respondents. I can't think of a valid reason why responses from the 0.5 respondents should influence an estimated difference between the 0 respondents and the 1 respondents.

See my Stata output file for more on that. The influence of the 0.5 respondents might not be major in most or all cases, but an APSR reader won't know, based on Stephens-Dougan 2022 or its correction, the extent to which the 0.5 respondents influenced the estimates for the comparison of the 0 respondents to the 1 respondents.

Now about those 0.5 respondents…

---

4.

Remember that the Stephens-Dougan 2022 "negative stereotype endorsement" variable has three levels: 0 for the 74% of respondents who did not endorse a negative stereotype about Blacks relative to Whites, 0.5 for the 16% of respondents who endorsed exactly one of the two negative stereotypes about Blacks relative to Whites, and 1 for the 10% of respondents who endorsed both negative stereotypes about Blacks relative to Whites.

The correction indicates that "I discovered an error in the description of the variable, negative stereotype endorsement" and that "there was no error in the code used to create the variable". So was the intent for Stephens-Dougan 2022 to measure racial prejudice so that only the 1 respondents are considered prejudiced? Or was the intent to consider the 0.5 respondents and the 1 respondents to be prejudiced?

The pre-analysis plan seems to indicate a different method for measuring the moderator of negative stereotype endorsement:

The difference between the rating of Blacks and Whites is taken on both dimensions (intelligence and hard work) and then averaged.

But the pre-analysis plan also indicates that:

For racial predispositions, we will use two or three bins, depending on their distributions.

So, even ignoring the plan to average the stereotype ratings, the pre-analysis plan is inconclusive about whether the intent was to use two or three bins. Let's try this passage from Stephens-Dougan 2022:

A nontrivial fraction of the nationally representative sample—26%—endorsed either the stereotype that African Americans are less hardworking than whites or that African Americans are less intelligent than whites.

So that puts the 16% of respondents at the 0.5 level of negative stereotype endorsement into the same bin as the 10% at the 1 level of negative stereotype endorsement. Stephens-Dougan 2022 doesn't report the percentage that endorsed both negative stereotypes about Blacks. Reporting the percentage of 26% is what would be expected if the intent was to place into one bin any respondent who endorsed at least one of the negative stereotypes about Blacks, so I'm a bit skeptical of the claim in the correction that the description is in error and the code was correct. Maybe I'm missing something, but I don't see how someone who intends to have three bins reports the 26% and does not report the 10%.

For another thing, Stephens-Dougan 2022 has only three figures: Figure 1 reports results for racially prejudiced Whites, Figure 2 reports results for non-racially prejudiced Whites, and Figure 3 reports on the difference between racially prejudiced Whites and non-racially prejudiced Whites. Did Stephens-Dougan 2022 intend to not report results for the group of respondents who endorsed exactly one of the negative stereotypes about Blacks? Did Stephens-Dougan 2022 intend to suggest that respondents who rate Blacks as lazier in general than Whites aren't racially prejudiced as long as they rate Blacks equal to or higher than Whites in general on intelligence?

---

5.

Stephens-Dougan 2022 and the correction depict 84% confidence intervals in all figures. Stephens-Dougan 2022 indicated (footnote omitted) that:

For ease of interpretation, I plotted the predicted probability of agreeing with each pandemic measure in Figure 1, with 84% confidence intervals, the graphical equivalent to p < 0.05.

The 84% confidence interval is good for assessing a p=0.05 difference between estimates, but not for assessing at p=0.05 whether an estimate differs from a particular number such as zero. So 84% confidence intervals make sense for Figures 1 and 2, in which the key comparisons are of the control estimate to the treatment estimate. But 84% confidence intervals don't make as much sense for Figure 3, which plot only one estimate and for which the key assessment is whether the estimate differs from zero (Figure 3 in Stephens-Dougan 2022) or from 1 (the correction).

---

6.

I didn’t immediately realize why, in Figure 3 in Stephens-Dougan 2022, two of the four estimates cross zero, but in Figure 3 in the correction, none of the four estimates cross zero. Then I realized that the estimates plotted in Figure 3 of the correction (but not Figure 3 in Stephens-Dougan 2022) are odds ratios.

The y-axis for odds ratios for Figure 3 of the correction ranges from 0 to 30-something, using a linear scale. The odds ratio that indicates no effect is 1, and an odds ratio can't be negative, so that it why none of the four estimates cross zero in the corrected Figure 3.

It seems like a good idea for a plot of odds ratios to have a guideline for 1, so that readers can assess whether an odds ratio indicating no effect is a plausible value. And a log scale seems like a good idea for odds ratios, too. Relevant prior post that mentions that Fenton and Stephens-Dougan 2021 described a "very small" 0.01 odds ratio as "not substantively meaningful".

None of the 84% confidence intervals for Figure 3 capture an odds ratio that crosses 1, but an 84% confidence interval for Figure A3 in "Supplementary Materials Correction-Final.docx" does.

---

7.

Often, when I alert an author or journal to an error in a publication, the subsequent correction doesn't credit me for my work. Sometimes the correction even suggests that the authors themselves caught the error, like the correction to Stephens-Dougan 2022 seems to do:

After reviewing my code, I discovered an error in the description of the variable, negative stereotype endorsement.

I guess it's possible that Stephens-Dougan "discovered" the error. For instance, maybe after she submitted page proofs, for some reason she decided to review her code, and just happened to catch the error that she had missed before, and it's a big coincidence that this was the same error that I blogged about and alerted the APSR to.

And maybe Stephens-Dougan also discovered that her APSR letter misleadingly deviated from the relevant pre-analysis plan, so that I don't deserve credit for alerting the APSR to that.

Tagged with: , , , , , , , ,

PS: Political Science & Politics recently published Hartnett and Haver 2022 "Unconditional support for Trump's resistance prior to Election Day".

Hartnett and Haver 2022 reported on an experiment conducted in October 2020 in which likely Trump voters were asked to consider the hypothetical of a Biden win in the Electoral College and in the popular vote, with a Biden popular vote percentage point win randomly assigned to be from 1 percentage point through 15 percentage points. These likely Trump voters were then asked whether the Trump campaign should resist or concede.

Data were collected before the election, but Hartnett and Haver 2022 did not report anything about a corresponding experiment involving likely Biden voters. Hartnett and Haver 2022 discussed a Reuters/Ipsos poll that "found that 41% of likely Trump voters would not accept a Biden victory and 16% of all likely Trump voters 'would engage in street protests or even violence' (Kahn 2020)". The Kahn 2020 source indicates that the corresponding percentages for Biden voters for a Trump victory were 43% and 22%, so it didn't seem like there was a good reason to not include a parallel experiment for Biden voters, especially because data on only Trump voters wouldn't permit valid inferences about the characteristics on which Trump voters were distinctive.

---

But text for a somewhat corresponding experiment involving likely Biden voters is hidden in the Hartnett and Haver 2022 codebook under white boxes or something like that. The text of the hidden items can be highlighted, copied, and pasted from the bottom of pages 19 and 20 of the codebook PDF (or more hidden text can be copied, using ctrl+A, then ctrl-C, and then pasted with ctrl-V).

The hidden codebook text indicates that the hartnett_haver block of the survey had a "bidenlose" item that asked likely Biden voters whether, if Biden wins the popular vote by the randomized percentage points and Trump wins the electoral college, the Biden campaign should "Resist the results of the election in any way possible" or "Concede defeat".

There might be an innocent explanation for Hartnett and Haver 2022 not reporting the results for those items, but that innocent explanation hasn't been shared with me yet on Twitter. Maybe Hartnett and Haver 2022 have a manuscript in progress about the "bidenlose" item.

---

NOTES

1. Hartnett and Haver 2022 seems to be the survey that Emily Badger at the New York Times referred to as "another recent survey experiment conducted by Brian Schaffner, Alexandra Haver and Brendan Hartnett at Tufts". The copied-and-pasted codebook text indicates that this was for the "2020 Tufts Class Survey".

2. On page 18 of the Hartnett and Haver 2022 codebook, above the hidden item about socialism, part of the text of the "certain advantages" item is missing, which seems to be a should-be-obvious indication that text has been covered.

3. The codebook seems to be missing pages of the full survey: in the copied-and-pasted text, page numbers jump from "Page 21 of 43" to "Page 24 of 43" to "Page 31 of 43" to "Page 33 of 43". Presumably at least some missing items were for other members of the Tufts class, although I'm not sure what happened to page 32, which seems to be part of the hartnett_haver block that started on page 31 and ended on page 33.

4. The dataset for Hartnett and Haver 2022 includes a popular vote percentage point win from 1 percentage point through 15 percentage points assigned to likely Biden voters, but the dataset has no data on a resist-or-concede outcome or on a follow-up open-ended item.

Tagged with: , , , ,

The American Political Science Review recently published a letter: Stephens-Dougan 2022 "White Americans' reactions to racial disparities in COVID-19".

Figure 1 of the Stephens-Dougan 2022 APSR letter reports results for four outcomes among racially prejudiced Whites, with the 84% confidence interval in the control overlapping with the 84% confidence interval in the treatment for only one of the four reported outcomes (zooming in on Figure 1, the confidence intervals for the parks outcome don't seem to overlap, and the code returns 0.1795327 for the upper bound for the control and 0.18800818 for the lower bound for the treatment). And results for the most obviously overlapping 84% confidence intervals seem to be interpreted as sufficient evidence of an effect, with all four reported outcomes discussed in the passage below:

When racially prejudiced white Americans were exposed to the racial disparities information, there was an increase in the predicted probability of indicating that they were less supportive of wearing face masks, more likely to feel their individual rights were being threatened, more likely to support visiting parks without any restrictions, and less likely to think African Americans adhere to social distancing guidelines.

---

There are at least three things to keep track of: [1] the APSR letter, [2] the survey questionnaire, located at the OSF site for the Time-sharing Experiments for the Social Sciences project; and [3] the pre-analysis plan, located at the OSF and in the appendix of the APSR article. I'll use the PDF of the pre-analysis plan. The TESS site also has the proposal for the survey experiment, but I won't discuss that in this post.

---

The pre-analysis plan does not mention all potential outcome variables that are in the questionnaire, but the pre-analysis plan section labeled "Hypotheses" includes the passage below:

Specifically, I hypothesize that White Americans with anti-Black attitudes and those White Americans who attribute racial disparities in health to individual behavior (as opposed to structural factors), will be more likely to disagree with the following statements:

The United States should take measures aimed at slowing the spread of the coronavirus while more widespread testing becomes available, even if that means many businesses will have to stay closed.

It is important that people stay home rather than participating in protests and rallies to pressure their governors to reopen their states.

I also hypothesize that White Americans with anti-Black attitudes and who attribute racial health disparities to individual behavior will be more likely to agree with the following statements:

State and local directives that ask people to "shelter in place" or to be "safer at home" are a threat to individual rights and freedom.

The United States will take too long in loosening restrictions and the economic impact will be worse with more jobs being lost

The four outcomes mentioned in the passage above correspond to items Q15, Q18, Q16, and Q21 in the survey questionnaire, but, of these four outcomes, the APSR letter reported on only Q16.

The outcome variables in the APSR letter are described as: "Wearing facemasks is not important", "Individual rights and freedom threatened", "Visit parks without any restrictions", and "Black people rarely follow social distancing guidelines". These outcome variables correspond to survey questionnaire items Q20, Q16, Q23A, and Q22A.

---

The pre-analysis plan PDF mentions moderators, with three moderators about racial dispositions: racial resentment, negative stereotype endorsement, and attributions for health disparities. The plan indicates that:

For racial predispositions, we will use two or three bins, depending on their distributions. For ideology and party, we will use three bins. We will include each bin as a dummy variable, omitting one category as a baseline.

The APSR letter reported on only one racial predispositions moderator: negative stereotype endorsement.

---

I'll post a link in the notes below to some of my analyses about the "Specifically, I hypothesize" outcomes, but I don't want to focus on the results, because I wanted this post to focus on deviations from the pre-analysis plan, because -- regardless of whether the estimates from the analyses in the APSR letter are similar to the estimates from the planned analyses in the pre-analysis plan -- I think that it's bad that readers can't trust the APSR to ensure that a pre-analysis plan is followed or at least to provide an explanation about why a pre-analysis plan was not followed, especially given that this APSR letter described itself as reporting on "a preregistered survey experiment" and included the pre-analysis plan in the appendix.

---

NOTES

1. The Stephens-Dougan 2022 APSR letter suggests that the negative stereotype endorsement variable was coded dichotomously ("a variable indicating whether the respondent either endorsed the stereotype that African Americans are less hardworking than whites or the stereotype that African Americans are less intelligent than whites"), but the code and the appendix of the APSR letter indicate that the negative stereotype endorsement variable was measured so that the highest level is for respondents who reported a negative relative stereotype about Blacks for both stereotypes. From Table A7:

(unintelligentstereotype 2 + lazystereotype2 )/2

In the data after running the code for the APSR letter, the negative stereotype endorsement variable is a three-level variable coded 0 for respondents who did not report a negative relative stereotype about Blacks for either stereotype, 0.5 for respondents who reported a negative stereotype about Blacks for one stereotype, and 1 for respondents who reported a negative relative stereotype about Blacks for both stereotypes.

2. The APSR letter indicated that:

The likelihood of racially prejudiced respondents in the control condition agreeing that shelter-in-place orders threatened their individual rights and freedom was 27%, compared with a likelihood of 55% in the treatment condition (p < 0.05 for a one-tailed test).

My analysis using survey weights got 44% and 29% among participants who reported a negative relative stereotype about Blacks for at least one of the two stereotype items, and my analysis got 55% and 26% among participants who reported negative relative stereotypes about Blacks for both stereotype items, with a trivial overlap in 84% confidence intervals.

But the 55% and 26% in a weighted analysis were 43% and 37% in an unweighted analysis with a large overlap in 84% confidence intervals, suggesting that at least some of the results in the APSR letter might be limited to the weighted analysis. I ran the code for the APSR letter removing the weights from the glm command and got the revised Figure 1 plot below. The error bars in the APSR letter are described as 84% confidence intervals.

I think that it's fine to favor the weighted analysis, but I'd prefer that publications indicate when results from an experiment are not robust to the application or non-application of weights. Relevant publication.

3. Given the results in my notes [1] and [2], maybe the APSR letter's Figure 1 estimates are for only respondents who reported negative relative stereotype about Blacks for both stereotypes. If so, the APSR letter's suggestion that this population is the 26% that reported anti-Black stereotypes for either stereotype might be misleading, if the Figure 1 analyses were estimated for only the 10% that reported negative relative stereotype about Blacks for both stereotypes.

For what it's worth, the R code for the APSR letter has code that doesn't use the 0.5 level of the negative stereotype endorsement variable, such as:

# Below are code for predicted probabilities using logit model

# Predicted probability "individualrights_dichotomous"

# Treatment group, negstereotype_endorsement = 1

p1.1 <- invlogit(coef(glm1)[1] + coef(glm1)[2] * 1 + coef(glm1)[3] * 1 + coef(glm1)[4] * 1)

It's possible to see what happens to the Figure 1 results when the negative stereotype endorsement variable is coded 1 for respondents who endorsed at least one of the stereotypes. Run this at the end of the Stata code for the APSR letter:

replace negstereotype_endorsement = ceil((unintelligentstereotype2 + lazystereotype2)/2)

Then run the R code for the APSR letter. Below is the plot I got for a revised Figure 1, with weights applied and the sample limited to respondents who endorsed at least one of the stereotypes:

Estimates in the figure above were close to estimates in my analysis using these Stata commands after running the Stata code from the APSR letter. Stata output.

4. Data, Stata code, and Stata output for my analysis about the "Specifically, I hypothesize" passage of the Stephens-Dougan pre-analysis plan.

My analysis in the Stata output had seven outcomes: the four outcomes mentioned in the "Specifically, I hypothesize" part of the pre-analysis plan as initially measured (corresponding to questionnaire items Q15, Q18, Q16, and Q21), with no dichotomization of five-point response scales for Q15, Q18, and Q16; two of these outcomes (Q15 and Q16) dichotomized as mentioned in the pre-analysis plan (e.g., "more likely to disagree" was split into disagree / not disagree categories, with the not disagree category including respondent skips); and one outcome (Q18) dichotomized so that one category has "Not Very Important" and "Not At All Important" and the other category has the other responses and skips, given that the pre-analysis plan had this outcome dichotomized as disagree but response options in the survey were not on an agree-to-disagree scale. Q21 was measured as a dichotomous variable.

The analysis was limited to presumed racially prejudiced Whites, because I think that that's what the pre-analysis plan hypotheses quoted above focused on. Moreover, that analysis seems more important than a mere difference between groups of Whites.

Note that, for at least some results, a p<0.05 treatment effect might be in the unintuitive direction, so be careful before interpreting a p<0.05 result as evidence for the hypotheses.

My analyses aren't the only analyses that can be conducted, and it might be a good idea to combine results across outcomes mentioned in the pre-analysis plan or across all outcomes in the questionnaire, given that the questionnaire had at least 12 items that could serve as outcome variables.

For what it's worth, I wouldn't be surprised if a lot of people who respond to survey items in an unfavorable way about Blacks backlashed against a message about how Blacks were more likely than Whites to die from covid-19.

5. The pre-analysis plan included a footnote that:

Given the results from my pilot data, it is also my expectation that partisanship will moderate the effect of the treatment or that the treatment effects will be concentrated among Republican respondents.

Moreover, the pre-analysis plan indicated that:

The condition and treatment will be blocked by party identification so that there are roughly equal numbers of Republicans and Democrats in each condition.

But the lone mention of "Repub-" in the APSR letter is:

The sample was 39% self-identified Democrats (including leaners) and 46% self-identified Republicans (including leaners).

6. Link to tweets about the APSR letter.

Tagged with: , , , , , , , ,

According to a 2018-06-18 "survey roundup" blog post by Karthick Ramakrishnan and Janelle Wong (with a link to the blog post tweeted by Jennifer Lee):

Regardless of the question wording, a majority of Asian American respondents express support for affirmative action, including when it is applied specifically to the context of higher education.

However, a majority of Asian American respondents did not express support for affirmative action in data from the National Asian American Survey 2016 Post-Election Survey [data here, dataset citation: Karthick Ramakrishnan, Jennifer Lee, Taeku Lee, and Janelle Wong. National Asian American Survey (NAAS) 2016 Post-Election Survey. Riverside, CA: National Asian American Survey. 2018-03-03.]

Tables below contain item text from the questionnaire. My analysis sample was limited to participants coded 1 for "Asian American" in the dataset's race variable. The three numeric columns in the tables for each item are respectively for: [1] data that are unweighted; [2] data with the nweightnativity weight applied, described in the dataset as "weighted by race/ethnicity and state, nativity, gender, education (raking method"; and [3] data with the pidadjweight weight applied, described in the dataset as "adjusted for partyID variation by ethnicity in re-interview cooperation rate for". See slides 4 and 14 here for more details on the study methodology.

The table below reports on results for items about opinions of particular racial preferences in hiring and promotion. A majority of Asian American respondents did not support these race-based affirmative action policies:

NAAS-Post3

The next table reports on results for items about opinions of particular uses of race in university admissions decisions. A majority of Asian American respondents did not support these race-based affirmative action policies:

NAAS-Post4

I'm not sure why these post-election data were not included in the 2018-06-18 blog post survey roundup or mentioned in this set of slides. I'm also not sure why the manipulations for the university admissions decisions items include only treatments in which the text suggests that Asian American applicants are advantaged by consideration of race instead of or in addition to including treatments in which the text suggests that Asian American applicants are disadvantaged by consideration of race, which would have been perhaps as or more plausible.

---

Notes:

1. Code to reproduce my analyses is here. Including Pacific Islanders and restricting the Asian American sample to U.S. citizens did not produce majority support for any affirmative action item reported on above or for the sex-based affirmative action item (Q7.2).

2. The survey had a sex-based affirmative action item (Q7.2) and had items about whether the participant, a close relative of the participant, or a close personal friend of the participant was advantaged or was disadvantaged by affirmative action (Q7.8 to Q7.11). For the Asian American sample, support for preferential hiring and promotion of women in Q7.2 was at 46% unweighted and at 44% when either weighting variable was applied.

3. This NAAS webpage indicates a 2017-12-05 date for the pre-election survey dataset, and on 2017-12-06 the @naasurvey account tweeted a blurb about these data being available for download. However, that same NAAS webpage lists a 2018-03-03 date for the post-election survey dataset, but I did not see an @naasurvey tweet for that release, and that NAAS webpage did not have a link to the post-election data at least as late as 2018-08-16. I tweeted a question about the availability of the post-election data on 2018-08-31 and then sent in an email and later found the data available at the webpage. I think that this might be the NSF grant for the post-election survey, which indicated that the data were to be publicly released through ICPSR in June 2017.

Tagged with: ,

This post reports on publication bias analyses for the Tara L. Mitchell et al. 2005 meta-analysis: "Racial Bias in Mock Juror Decision-Making: A Meta-Analytic Review of Defendant Treatment" [gated, ungated]. The appendices for the article contained a list of sample sizes and effect sizes, but the list did not match the reported results in at least one case. Dr. Mitchell emailed me a file of the correct data (here).

VERDICTS

Here is the funnel plot for the Mitchell et al. 2005 meta-analysis of verdicts:

mitchell-et-al-2005-verdicts-funnel-plotEgger's test did not indicate at the conventional level of statistical significance the presence of funnel plot asymmetry in any of the four funnel plots, with p-values of p=0.80 (white participants, published studies), p=0.82 (white participants, all studies), p=0.10 (black participants, published studies), and p=0.63 (black participants, all studies).

Trim-and-fill with the L0 estimator imputed missing studies for all four funnel plots to the side of the funnel plot indicating same-race favoritism:

mitchell-et-al-2005-verdicts-tf-l0Trim-and-fill with the R0 estimator imputed missing studies for only the funnel plots for published studies with black participants:

mitchell-et-al-2005-verdicts-tf-r0---

SENTENCES

Here is the funnel plot for the Mitchell et al. 2005 meta-analysis of sentences:

mitchell-et-al-2005-sentences-funnel-plotEgger's test did not indicate at the conventional level of statistical significance the presence of funnel plot asymmetry in any of the four funnel plots, with p-values of p=0.14 (white participants, published studies), p=0.41 (white participants, all studies), p=0.50 (black participants, published studies), and p=0.53 (black participants, all studies).

Trim-and-fill with the L0 estimator imputed missing studies for the funnel plots with white participants to the side of the funnel plot indicating same-race favoritism:

mitchell-et-al-2005-sentences-tf-l0Trim-and-fill with the R0 estimator did not impute any missing studies:

mitchell-et-al-2005-sentences-tf-r0---

I also attempted to retrieve and plot data for the Ojmarrh Mitchell 2005 meta-analysis ("A Meta-Analysis of Race and Sentencing Research: Explaining the Inconsistencies"), but the data were reportedly lost in a computer crash.

---

NOTES:

1. Data and code for the Mitchell et al. 2005 analyses are here: data file for verdicts, data file for sentences, R code for verdicts, and R code for sentences.

Tagged with: , ,

Researchers often have the flexibility to report only the results they want to report, so an important role for peer reviewers is to request that researchers report results that a reasonable skeptical reader might suspect have been strategically unreported. I'll discuss two publications where obvious peer review requests do not appear to have been made and, presuming these requests were not made, how requests might have helped readers better assess evidence in the publication.

---

Example 1. Ahlquist et al. 2014 "Alien Abduction and Voter Impersonation in the 2012 U.S. General Election: Evidence from a Survey List Experiment"

Ahlquist et al. 2014 reports on two list experiments: one list experiment is from December 2012 and has 1,000 cases, and another list experiment is from September 2013 and has 3,000 cases.

Figure 1 of Ahlquist et al. 2014 reports results for the 1,000-person list experiment estimating the prevalence of voter impersonation in the 2012 U.S. general election; the 95% confidence intervals for the full sample and for each reported subgroup cross zero. Figure 2 reports results for the full sample of the 3,000-person list experiment estimating the prevalence of voter impersonation in the 2012 U.S. general election, but Figure 2 did not include subgroup results. Readers are thus left to wonder why subgroup results were not reported for the larger sample that had more power to detect an effect among subgroups.

Moreover, the main voting irregularity list experiment reported in Ahlquist et al. 2014 concerned voter impersonation, but, in footnote 15, Ahlquist et al. discuss another voting irregularity list experiment that was part of the study, about whether political candidates or activists offered the participant money or a gift for their vote:

The other list experiment focused on vote buying and closely mimicked that described in Gonzalez-Ocantos et al. (2012). Although we did not anticipate discovering much vote buying in the USA we included this question as a check, since a similar question successfully discovered voting irregularities in Nicaragua. As expected we found no evidence of vote buying in the USA. We omit details here for space considerations, though results are available from the authors and in the online replication materials...

The phrasing of the footnote is not clear whether the inference of "no evidence of vote buying in the USA" is restricted to an analysis of the full sample or also covers analyses of subgroups.

So the article leaves at least two questions unanswered for a skeptical reader:

  1. Why report subgroup analyses for only the smaller sample?
  2. Why not report the overall estimate and subgroup analyses for the vote buying list experiment?

Sure, for question 2, Ahlquist et al. indicate that the details of the vote buying list experiment were omitted for "space considerations"; however, the 16-page Ahlquist et al. 2014 article is shorter than the other two articles in the journal issue, which are 17 pages and 24 pages.

Peer reviewer requests that could have helped readers were to request a detailed report on the vote buying list experiment and to request a report of subgroup analyses for the 3,000-person sample.

---

Example 2. Sen 2014 "How Judicial Qualification Ratings May Disadvantage Minority and Female Candidates"

Sen 2014 reports logit regression results in Table 3 for four models predicting the ABA rating given to U.S. District Court nominees from 1962 to 2002, with ratings dichotomized into (1) well qualified or exceptionally well qualified and (2) not qualified or qualified.

Model 1 includes a set of variables such as the nominee's sex, race, partisanship, and professional experience (e.g., law clerk, state judge). Compared to model 1, model 2 omits the partisanship variable and adds year dummies. Compared to model 2, model 3 adds district dummies and interaction terms for female*African American and female*Hispanic. And compared to model 3, model 4 removes the year dummies and adds a variable for years of practice and a variable for the nominee's estimated ideology.

The first question raised by the table is the omission of the partisanship variable for models 2, 3, and 4, with no indication of the reason for that omission. The partisanship variable is not statistically significant in model 1, and Sen 2014 notes that the partisanship variable "is never statistically significant under any model specification" (p. 44), but it is not clear why the partisanship variable is dropped in the other models because other variables appear in all four models and never reach statistical significance.

The second question raised by the table is why years of practice appears in only the fourth model, in which roughly one-third of cases are lost due to the inclusion of estimated nominee ideology. Sen 2014 Table 2 indicates that male and white nominees had substantially more years of practice than female and black nominees: men (16.87 years), women (11.02 years), whites (16.76 years), and blacks (10.08 years); therefore, any model assessing whether ABA ratings are biased should account for sex and race differences in years of practice, under the reasonable expectation that nominees should receive higher ratings for more experience.

Peer reviewer requests that could have helped readers were to request a discussion of the absence of the partisanship variable from models 2, 3, and 4, and to request that years of experience be included in more of the models.

---

Does it matter?

Data for Ahlquist et al. 2014 are posted here. I reported on my analysis of the data in a manuscript rejected after peer review by the journal that published Ahlquist et al. 2014.

My analysis indicated that the weighted list experiment estimate of vote buying for the 3,000-person sample was 5 percent (p=0.387), with a 95% confidence interval of [-7%, 18%]. I'll echo my earlier criticism and note that a 25-percentage-point-wide confidence interval is not informative about the prevalence of voting irregularities in the United States because all plausible estimates of U.S. voting irregularities fall within 12.5 percentage points of zero.

Ahlquist et al. 2014 footnote 14 suggests that imputed data on participant voter registration were available, so a peer reviewer could have requested reporting of the vote buying list experiments restricted to registered voters, given that only registered voters have a vote to trade. I did not see a variable for registration in the dataset for the 1,000-person sample, but the list experiment for the 3,000-person sample produced the weighted point estimate that 12 percent of persons listed as registered to vote were contacted by political candidates or activists around the 2012 U.S. general election with an offer to exchange money or gifts for a vote (p=0.018).

I don't believe that this estimate is close to correct, and, given sufficient subgroup analyses, some subgroup analyses would be expected to produce implausible or impossible results, but peer reviewers requesting these data might have produced a more tentative interpretation of the list experiments.

---

For Sen 2014, my analysis indicated that the estimates and standard errors for the partisanship variable (coded 1 for nomination by a Republican president) inflate unusually high when that variable is included in models 2, 3, and 4: the coefficient and standard error for the partisanship variable are 0.02 and 0.11 in model 1, but inflate to 15.87 and 535.41 in model 2, 17.90 and 1,455.40 in model 3, and 18.21 and 2,399.54 in model 4.

The Sen 2014 dataset had variables named Bench.Years, Trial.Years, and Private.Practice.Years. The years of experience for these variables overlap (e.g., nominee James Gilstrap was born in 1957 and respectively has 13, 30, and 30 years for these variables); therefore, the variables cannot be summed to construct a variable for total years of legal experience that does not include double- or triple-counting for some cases. Bench.Years correlates with Trial.Years at -0.47 and with Private.Practice.Years at -0.39, but Trial.Years and Private.Practice.Years correlate at 0.93, so I'll include only Bench.Years and Trial.Years, given that Trial.Years appears more relevant for judicial ratings than Private.Practice.Years.

My analysis indicated that women and blacks had a higher Bench.Years average than men and whites: men (4.05 years), women (5.02 years), whites (4.02 years), and blacks (5.88 years). Restricting the analysis to nominees with nonmissing nonzero Bench.Years, men had slightly more experience than women (9.19 years to 8.36 years) and blacks had slightly more experience than whites (9.33 years to 9.13 years).

Adding Bench.Years and Trial.Years to the four Table 3 models did not produce any meaningful difference in results for the African American, Hispanic, and Female variables, but the p-value for the Hispanic main effect fell to 0.065 in model 4 with Bench.Years added.

---

I estimated a simplified model with the following variables predicting the dichotomous ABA rating variable for each nominee with available data: African American nominee, Hispanic nominee, female nominee, Republican nominee, nominee age, law clerk experience, law school tier (from 1 to 6), Bench0 and Trial0 (no bench or trial experience respectively), Bench.Years, and Trial.Years. These variables reflect demographics, nominee quality, and nominee experience, with a presumed penalty for nominees who lack bench and/or trial experience. Results are below:

aba1The female coefficient was not statistically significant in the above model (p=0.789), but the coefficient was much closer to statistical significance when adding a control for the year of the nomination:

aba2District.Court.Nomination.Year was positively related to the dichotomous ABA rating variable (r=0.16) and to the female variable (r=0.29), and the ABA rating increased faster over time for women than for men (but not at a statistically-significant level: p=0.167), so I estimated a model that interacted District.Court.Nomination.Year with Female and with the race/ethnicity variables:

aba3The model above provides some evidence for an over-time reduction of the sex gap (p=0.095) and the black/white gap (0.099).

The next model is the second model reported above, but with estimated nominee ideology added, coded with higher values indicating higher levels of conservatism:

aba4So there is at least one reasonable model specification that produces evidence of bias against conservative nominees, at least to the extent that the models provide evidence of bias. After all, ABA ratings are based on three criteria—integrity, professional competence, and judicial temperament—but the models include information for only professional competence, so a sex, race, and ideological gap in the models could indicate bias and/or could indicate a sex, race, and ideological gap in nonbiased ABA evaluations of integrity and/or judicial temperament and/or elements of professional competence that are not reflected in the model measures. Sen addressed the possibility of gaps in these other criteria, starting on page 47 of the article.

For what it's worth, evidence of the bias against conservatives is stronger when excluding the partisanship control:

aba5---

The above models for the Sen reanalysis should be interpreted to reflect the fact that there are many reasonable models that could be reported. My assessment from the models that I estimated is that the black/white gap is extremely if not completely robust, the Hispanic/white gap is less robust but still very robust, the female/male gap is less robust but still somewhat robust, and the ideology gap is the least robust of the group.

I'd have liked for the peer reviewers on Sen 2014 to have requested results for the peer reviewers' preferred model, with requested models based only on available data and results reported in at least an online supplement. This would provide reasonable robustness checks for an analysis for which there are many reasonable model specifications. Maybe that happened: the appendix table in the working paper version of Sen 2014 is somewhat different than the published logit regression table. In any event, indicating which models were suggested by peer reviewers might help reduce skepticism about the robustness of reported models, to the extent that models suggested by a peer reviewer have not been volunteered by the researchers.

---

NOTES FOR AHLQUIST ET AL. 2014:

1. Subgroup analyses might have been reported for only the smaller 1,000-person sample because the smaller sample was collected first. However, that does not mean that the earlier sample should be the only sample for which subgroup analyses are reported.

2. Non-disaggregated results for the 3,000-person vote buying list experiment and disaggregated results for the 1,000-person vote buying list experiment were reported in a prior version of Ahlquist et al. 2014, which Dr. Ahlquist sent me. However, a reader of Ahlquist et al. 2014 might not be aware of these results, so Ahlquist et al. 2014 might have been improved by including these results.

---

NOTES FOR SEN 2014:

1. Ideally, models would include a control for twelve years of experience, given that the ABA Standing Committee on the Federal Judiciary "...believes that a prospective nominee to the federal bench ordinarily should have at least twelve years' experience in the practice of law" (p. 3, here). Sen 2014 reports results for a matching analysis that reflects the 12 years threshold, at least for the Trial.Years variable, but I'm less confident in matching results, given the loss of cases (e.g., from 304 women in Table 1 to 65 women in Table 4) and the loss of information (e.g., cases appear to be matched so that nominees with anywhere from 0 to 12 years on Trial.Years are matched on Trial.Years).

2. I contacted the ABA and sent at least one email to the ABA liaison for the ABA committee that handles ratings for federal judicial nominations, asking whether data could be made available for nominee integrity and judicial temperament, such as a dichotomous indication whether an interviewee had raised concerns about the nominee's integrity or judicial temperament. The ABA Standing Committee on the Federal Judiciary prepares a written statement (e.g., here) that describes such concerns for nominees rated as not qualified, if the ABA committee is asked to testify at a Senate Judiciary Committee hearing for the nominee (see p. 8 here). I have not yet received a reply to my inquiries.

---

GENERAL NOTES

1. Data for Ahlquist et al. 2014 are here. Code for my additional analyses is here.

2. Dr. Sen sent me data and R code, but the Sen 2014 data and code do not appear to be online now. Maya Sen's Dataverse is available here. R code for the supplemental Sen models described above is here.

Tagged with: , , , , ,