1.
In May, I published a blog post about deviations from the pre-analysis plan for the Stephens-Dougan 2022 APSR letter, and I tweeted a link to the blog post that tagged @LaFleurPhD and asked her directly about the deviations from the pre-analysis plan. I don't recall receiving a response from Stephens-Dougan, and, a few days later, on May 31, I emailed the APSR about my post, listing three concerns:
* The Stephens-Dougan 2022 description of racially prejudiced Whites not matching how the code for Stephens-Dougan 2022 calculated estimates for racially prejudiced Whites.
* The substantial deviations from the pre-analysis plan.
* Figure 1 of the APSR letter reporting weighted estimates, but the evidence being much weaker in unweighted analyses.
Six months later (December 5), the APSR has published a correction to Stephens-Dougan 2022. The correction addresses each of my three concerns, but not perfectly, which I'll discuss below, along with other discussion about Stephens-Dougan 2022 and its correction. I'll refer to the original APSR letter as "Stephens-Dougan 2022" and the correction as "the correction".
---
2.
The pre-analysis plan associated with Stephens-Dougan 2022 listed four outcomes at the top of its page 4, but only one of these outcomes (referred to as "Individual rights and freedom threatened") was reported on in Stephens-Dougan 2022. However, Table 1 of Stephens-Dougan 2022 reported results for three outcomes that were not mentioned in the pre-analysis plan.
The t-statistics for the key interaction term for the three outcomes included in Table 1 of Stephens-Dougan 2022 but not mentioned in pre-analysis plan were 2.6, 2.0, and 2.1, all of which indicate sufficient evidence. The t-statistics for the key interaction term mentioned in pre-analysis plan but omitted from Stephens-Dougan 2022 were 0.6, 0.6, and 0.6, none of which indicate sufficient evidence.
I calculated the t-statistics of 2.6, 2.0, and 2.1 from Table 1 of Stephens-Dougan 2022, by dividing a coefficient by its standard error. I wasn't able to use the correction to calculate the t-statistics of 0.6, 0.6, and 0.6, because the relevant data for these three omitted pre-analysis plan outcomes are not in the correction but instead are in Table A12 of a "replication-final.pdf" file hosted at the Dataverse.
That's part of what I meant about an imperfect correction: a reader cannot use information published in the APSR itself to calculate the evidence provided by the outcomes that were planned to be reported on in the pre-analysis plan, or, for that matter, to see how there is substantially less evidence in the unweighted analysis. Instead, a reader needs to go to the Dataverse and dig through table after table of results.
The correction refers to deviations from the pre-analysis plan, but doesn't indicate the particular deviations and doesn't indicate what happens when these deviations are not made. The "Supplementary Materials Correction-Final.docx" file at the Dataverse for Stephens-Dougan 2022 has a discussion of deviations from the pre-analysis plan, but, as far as I can tell, the discussion does not provide a reason why the results should not be reported for the three omitted outcomes, which were labeled in Table A12 as "Slow the Spread", "Stay Home", and "Too Long to Loosen Restrictions".
It seems to me to be a bad policy to permit researchers to deviate from a pre-analysis plan without justification and to merely report results from a planned analysis on, say, page 46 of a 68-page file on the Dataverse. But a bigger problem might be that, as far as I can tell, many journals don't even attempt to prevent misleading selective reporting for survey research for which there is no pre-analysis plan. Journals could require researchers reporting on surveys to submit or link to the full questionnaire for the surveys or at least to declare that the main text reports on results for all plausible measured outcomes and moderators.
---
3.
Next, let me discuss a method used in Stephens-Dougan 2022 and the correction, which I think is a bad method.
The code for Stephens-Dougan 2022 used measures of stereotypes about Whites and Blacks on the traits of hard working and intelligent, to create a variable called "negstereotype_endorsement". The code divided respondents into three categories, coded 0 for respondents who did not endorse a negative stereotype about Blacks relative to Whites, 0.5 for respondents who endorsed exactly one of the two negative stereotypes about Blacks relative to Whites, and 1 for respondents who endorsed both negative stereotypes about Blacks relative to Whites. For both Stephens-Dougan 2022 and the correction, Figure 3 reported for each reported outcome an estimate of how much the average treatment effect among prejudiced Whites (defined as those coded 1) differed from the average treatment effect among unprejudiced Whites (defined as those coded 0).
The most straightforward way to estimate this difference in treatment effects is to [1] calculate the treatment effect for prejudiced Whites coded 1, [2] calculate the treatment effect for unprejudiced Whites coded 0, and [3] calculate the difference between these treatment effects. The code for Stephens-Dougan 2022 instead estimated this difference using a logit regression that had three predictors: the treatment, the 0/0.5/1 measure of prejudice, and an interaction of the prior two predictors. But, by this method, the estimated difference in treatment effect between the 1 respondents and the 0 respondents depends on the 0.5 respondents. I can't think of a valid reason why responses from the 0.5 respondents should influence an estimated difference between the 0 respondents and the 1 respondents.
See my Stata output file for more on that. The influence of the 0.5 respondents might not be major in most or all cases, but an APSR reader won't know, based on Stephens-Dougan 2022 or its correction, the extent to which the 0.5 respondents influenced the estimates for the comparison of the 0 respondents to the 1 respondents.
Now about those 0.5 respondents…
---
4.
Remember that the Stephens-Dougan 2022 "negative stereotype endorsement" variable has three levels: 0 for the 74% of respondents who did not endorse a negative stereotype about Blacks relative to Whites, 0.5 for the 16% of respondents who endorsed exactly one of the two negative stereotypes about Blacks relative to Whites, and 1 for the 10% of respondents who endorsed both negative stereotypes about Blacks relative to Whites.
The correction indicates that "I discovered an error in the description of the variable, negative stereotype endorsement" and that "there was no error in the code used to create the variable". So was the intent for Stephens-Dougan 2022 to measure racial prejudice so that only the 1 respondents are considered prejudiced? Or was the intent to consider the 0.5 respondents and the 1 respondents to be prejudiced?
The pre-analysis plan seems to indicate a different method for measuring the moderator of negative stereotype endorsement:
The difference between the rating of Blacks and Whites is taken on both dimensions (intelligence and hard work) and then averaged.
But the pre-analysis plan also indicates that:
For racial predispositions, we will use two or three bins, depending on their distributions.
So, even ignoring the plan to average the stereotype ratings, the pre-analysis plan is inconclusive about whether the intent was to use two or three bins. Let's try this passage from Stephens-Dougan 2022:
A nontrivial fraction of the nationally representative sample—26%—endorsed either the stereotype that African Americans are less hardworking than whites or that African Americans are less intelligent than whites.
So that puts the 16% of respondents at the 0.5 level of negative stereotype endorsement into the same bin as the 10% at the 1 level of negative stereotype endorsement. Stephens-Dougan 2022 doesn't report the percentage that endorsed both negative stereotypes about Blacks. Reporting the percentage of 26% is what would be expected if the intent was to place into one bin any respondent who endorsed at least one of the negative stereotypes about Blacks, so I'm a bit skeptical of the claim in the correction that the description is in error and the code was correct. Maybe I'm missing something, but I don't see how someone who intends to have three bins reports the 26% and does not report the 10%.
For another thing, Stephens-Dougan 2022 has only three figures: Figure 1 reports results for racially prejudiced Whites, Figure 2 reports results for non-racially prejudiced Whites, and Figure 3 reports on the difference between racially prejudiced Whites and non-racially prejudiced Whites. Did Stephens-Dougan 2022 intend to not report results for the group of respondents who endorsed exactly one of the negative stereotypes about Blacks? Did Stephens-Dougan 2022 intend to suggest that respondents who rate Blacks as lazier in general than Whites aren't racially prejudiced as long as they rate Blacks equal to or higher than Whites in general on intelligence?
---
5.
Stephens-Dougan 2022 and the correction depict 84% confidence intervals in all figures. Stephens-Dougan 2022 indicated (footnote omitted) that:
For ease of interpretation, I plotted the predicted probability of agreeing with each pandemic measure in Figure 1, with 84% confidence intervals, the graphical equivalent to p < 0.05.
The 84% confidence interval is good for assessing a p=0.05 difference between estimates, but not for assessing at p=0.05 whether an estimate differs from a particular number such as zero. So 84% confidence intervals make sense for Figures 1 and 2, in which the key comparisons are of the control estimate to the treatment estimate. But 84% confidence intervals don't make as much sense for Figure 3, which plot only one estimate and for which the key assessment is whether the estimate differs from zero (Figure 3 in Stephens-Dougan 2022) or from 1 (the correction).
---
6.
I didn’t immediately realize why, in Figure 3 in Stephens-Dougan 2022, two of the four estimates cross zero, but in Figure 3 in the correction, none of the four estimates cross zero. Then I realized that the estimates plotted in Figure 3 of the correction (but not Figure 3 in Stephens-Dougan 2022) are odds ratios.
The y-axis for odds ratios for Figure 3 of the correction ranges from 0 to 30-something, using a linear scale. The odds ratio that indicates no effect is 1, and an odds ratio can't be negative, so that it why none of the four estimates cross zero in the corrected Figure 3.
It seems like a good idea for a plot of odds ratios to have a guideline for 1, so that readers can assess whether an odds ratio indicating no effect is a plausible value. And a log scale seems like a good idea for odds ratios, too. Relevant prior post that mentions that Fenton and Stephens-Dougan 2021 described a "very small" 0.01 odds ratio as "not substantively meaningful".
None of the 84% confidence intervals for Figure 3 capture an odds ratio that crosses 1, but an 84% confidence interval for Figure A3 in "Supplementary Materials Correction-Final.docx" does.
---
7.
Often, when I alert an author or journal to an error in a publication, the subsequent correction doesn't credit me for my work. Sometimes the correction even suggests that the authors themselves caught the error, like the correction to Stephens-Dougan 2022 seems to do:
After reviewing my code, I discovered an error in the description of the variable, negative stereotype endorsement.
I guess it's possible that Stephens-Dougan "discovered" the error. For instance, maybe after she submitted page proofs, for some reason she decided to review her code, and just happened to catch the error that she had missed before, and it's a big coincidence that this was the same error that I blogged about and alerted the APSR to.
And maybe Stephens-Dougan also discovered that her APSR letter misleadingly deviated from the relevant pre-analysis plan, so that I don't deserve credit for alerting the APSR to that.