PS: Political Science & Politics published Dietrich and Hayes 2022 "Race and Symbolic Politics in the US Congress" as part of a "Research on Race and Ethnicity in Legislative Studies" section with guest editors Tiffany D. Barnes and Christopher J. Clark.

---

1.

Dietrich and Hayes 2022 reported on an experiment in which a representative was randomized to be White or Black, the representative's speech was randomized to be about civil rights or renewable energy, and the representative's speech was randomized to include or not include symbolic references to the Civil Rights Movement. Dietrich and Hayes 2022 noted (p. 283) that:

When those same symbols were used outside of the domain of civil rights, however, white representatives received a significant punishment. That is, Black respondents were significantly more negative in their evaluations of white representatives who (mis-)used civil rights symbolism to advance renewable energy than in any other experimental condition.

The only numeric results that Dietrich and Hayes 2022 reported for this in the main text are in Figure 1, for an approval rating outcome. But the data file seems to have at least four potential outcomes: the symbolic_approval outcome (strongly disapprove to strongly approve), and the next three listed variables: symbolic_vote (extremely likely to extremely unlikely), symbolic_care (none to a lot), and symbolic_thermometer (0 to 100). The supplemental files have a figure that reports results for a dv_therm variable, but that figure doesn't report results for renewable energy separately by symbolic and non-symbolic.

---

2.

Another result reported in Dietrich and Hayes 2022 involved Civil Rights Movement symbolism in U.S. House of Representatives floor speeches that mentioned civil rights:

In addition to influencing African Americans' evaluation of representatives, our research shows that symbolic references to the civil rights struggle are linked to Black voter turnout. Using an analysis of validated voter turnout from the 2006–2018 Cooperative Election Study, our analyses suggest that increases in the number of symbolic speeches given by a member of Congress during a given session are associated with an increase in Black turnout in the subsequent congressional election. Our model predicts that increasing from the minimum of symbolic speeches in the previous Congress to the maximum in the current Congress is associated with a 65.67-percentage-point increase in Black voter turnout compared to the previous year.

This estimated 66 percentage point increase is at the congressional district level. Dietrich and Hayes 2022 calculated this estimate using a linear regression that predicted the change in Black turnout in a congressional district with a lagged symbolism percentage of 0 and a symbolism percentage of 1. From their code:

mod1<-lm(I(black_turnout-lag_black_turnout)~I(symbolic_percent-lag_symbolic_percent),data=cces)

print(round(predict(mod1,data.frame(symbolic_percent=1,lag_symbolic_percent=0))*100,2))

The estimated change in Black turnout was 85 percentage points when I modified the code to have a lagged symbolism percentage of 1 and a symbolism percentage of 0.

---

These estimated changes in Black turnout of 66 and 85 percentage points seemed implausible as causal estimates, and I'm not even sure that these are correct correlational estimates, based on the data in the "cces_turnout_results.csv" dataset in the hayes_dietrich_replication.zip file.

For one thing, the dataset lists symbolic_percent values for Alabama's fourth congressional district by row as 0.017857143, 0.047619048, 0.047619048, 0.013157895, 0.013157895, 0.004608295, 0.004608295, 0.00990099, 0.00990099, 1, 1, 1 , and 1. For speeches that mentioned civil rights, that's a relatively large jump in the percentage of such speeches that used Civil Rights Movement symbolism, from several values under 5% all the way to 100%. And this large jump to 100% is not limited to this congressional district: the mean symbolic_percent values across the full dataset were 0.14 (109th Congress), 0.02 (110th), 0.02 (111th), 0.03 (112th), 0.09 (113th), 1 (114th), and 1 (115th).

Moreover, the repetition in symbolic_percent within a congressional district is also consistent across the data that I checked. So, for the above district, 0.017857143 is for the 109th Congress, the first 0.047619048 is for one year of the 110th Congress, and the second 0.047619048 is for the other year of the 110th Congress, the two 0.013157895 values are for the two years of the 111th Congress, and so forth. From what I can tell, each dataset case is for a given district-year, but symbolic_percent is calculated only every two years, so that a large percentage of the "I(symbolic_percent-lag_symbolic_percent)" predictors are zero because of a research design decision to calculate the percentage of symbolic speeches per Congress and not per year; from what I can tell, these zeros might not be true zeros in which the percentage of symbolic speeches was the same in the given year and the lagged year.

---

For another thing, the "inline_calculations.R" file in the Dietrich and Hayes 2022 replication materials indicates that Black turnout values were based on CCES surveys and indicates that survey sample sizes might be very low for some congressional districts. The file describes a bootstrapping process that was used to produce the Black turnout values, which were then standardized to range from 0 to 1, but, from the description, I'm not sure how that standardization process works.

For instance, if, in one year the CCES has 2 Black participants for a certain congressional district and neither voted (0% turnout), and the next year is a presidential election year and the CCES had 3 Black participants in that district and all three voted (100% turnout), I'm not sure what the bootstrapping process would do to adjust that to get these congressional district Black turnout estimates to be closer to their true values, which presumably are between 0% and 100%. For what it's worth, of the 4,373 rows in the dataset, black_turnout is NA in 545 rows (12%), is 0 in 281 rows (6%), and is 1 in 1,764 rows (40%).

So I'm not sure how the described bootstrapping process adequately addresses the concern that the range of Black turnout values for a congressional district in the dataset is more extreme than the range of true Black turnout values for the congressional district. Maybe the standardization process addresses this in a way that I don't understand, so that 0 and 1 for black_turnout don't represent 0% turnout and 100% turnout, but, if that's the case, then I'm not sure how it would be justified for Dietrich and Hayes 2022 to interpret the aforementioned output of 65.67 as a 65.67 percentage-point increase.

---

NOTES

1. Dietrich and Hayes 2022 indicated that, in the survey experiment, participants were asked "to evaluate a representative on the basis of his or her floor speech", and Dietrich and Hayes 2022 indicated that the experimental manipulation for the representative's race involved "accompanying images of either a white or a Black representative". But the use of "his or her" makes me curious if the representative's gender was also experimentally manipulated.

2. Dietrich and Hayes 2022 Figure 1 reports [approval of the representative in the condition involving Civil Rights Movement symbolism in a speech about civil rights] in the same panel as [approval of the representative in the condition involving Civil Rights symbolism in a speech about renewable energy]. However, for assessing a penalty for use of Civil Rights Movement symbolism in the renewable energy speech, I think that it is more appropriate to compare [approval of the representative in the condition in which the renewable energy speech used Civil Rights Movement symbolism] to [approval of the representative in the condition in which the renewable energy speech did not use Civil Rights Movement symbolism].

If there is a penalty for using Civil Rights Movement symbolism in the speech about renewable energy, that penalty can be compared to the difference in approval between using and not using Civil Rights Movement symbolism in the speech about civil rights, to see whether the penalty in the renewable energy speech condition reflects a generalized penalty for the use of Civil Rights Movement symbolism.

3. On June 27, I emailed Dr. Dietrich and Dr. Hayes a draft of this blog post with an indication that "I thought that, as a courtesy, I would send the draft to you, if you would like to indicate anything in the draft that is unfair or incorrect". I have not yet received a reply, although it's possible that I used incorrect email addresses or my email went to a spam box.

I'll hopefully at some point write a summary that refers to a lot of my "comments" posts. But I have at least a few more to release before then, so here goes...

---

Politics, Groups, and Identities recently published Peay and McNair II 2022 "Concurrent pressures of mass protests: The dual influences of #BlackLivesMatter on state-level policing reform adoption". Peay and McNair II 2022 reported regressions that predicted a count of the number of police reform policies enacted by a state from August 2014 through 2020, using a key predictor of the number of Black Lives Matter protests in a state in the year after the killing of Michael Brown in August 2014.

An obvious concern is that the number of protests in a state is capturing the population size of the state. That's a concern because it's plausible that higher population states have legislatures that are more active than smaller population states, so that we would expect these high population states to tend to enact more policies per se, and not merely to enact more police reform policies. But the Peay and McNair II 2022 analysis does not control for the population size of the state.

I checked the correlation between [1] the number of Black Lives Matter protests in a state in the year after the killing of Michael Brown in August 2014 (data from Trump et al. 2018) and [2] the first list of the number of bills enacted by a state that I happened upon, which was the number of bills a state enacted from 2006 to 2009 relating to childhood obesity. The R-squared was 0.22 for a bivariate OLS regression using the state-level count of BLM protests to predict the state-level count of childhood obesity bills enacted. In comparison, Peay and McNair II 2022 Table 2 indicated that the R-squared was 0.19 in a bivariate OLS regression that used the state-level count of BLM protests to predict the state-level count of police reform policies enacted. So the concern about population size seems at least plausible.

---

This is a separate concern, but Figure 6 of Peay and McNair II 2022 reports predicted probabilities, with an x-axis of the number of protests. My analysis indicated that the number of protests ranged from 0 to 87, with only three states having more than 40 protests: New York at 67, Missouri at 74, and California at 87. Yet the widest the 95% confidence interval gets in Figure 6 is about 1 percentage point, at 87, which is a pretty precise estimate given data for only 50 states and only one state past 74.

Maybe the tight 95% confidence interval is a function of the network analysis for Figure 6, if the analysis, say, treats each potential connection between California and the other 49 states as 49 independent observations. Table 2 of Peay and McNair II 2022 doesn't have a sample size for this analysis, but reports 50 as the number of observations for the other analyses in that table.

---

NOTES

1. Data for my analysis.

2. No reply yet from the authors on Twitter.

Homicide Studies recently published Schildkraut and Turanovic 2022 "A New Wave of Mass Shootings? Exploring the Potential Impact of COVID-19". From the abstract:

Results show that total, private, and public mass shootings increased following the declaration of COVID-19 as a national emergency in March of 2020.

I was curious how Schildkraut and Turanovic 2022 addressed the possible confound of the 25 May 2020 killing of George Floyd.

---

Below is my plot of data used in Schildkraut and Turanovic 2022, for total mass shootings:

My read of the plot is that, until after the killing of George Floyd, there is insufficient evidence that mass shootings were higher in 2020 than in 2019.

Table 1 of Schildkraut and Turanovic 2022 reports an interrupted time series analysis that does not address the killing of George Floyd, with a key estimate of 0.409 and a standard error of 0.072. Schildkraut and Turanovic 2022 reports a separate analysis about George Floyd...

However, since George Floyd's murder occurred after the onset of the COVID-19 declaration, we conducted ITSA using only the post-COVID time period (n = 53 weeks) and used the week of May 25, 2020 as the point of interruption in each time series. These results indicated that George Floyd's murder had no impact on changes in overall mass shootings (b = 0.354, 95% CI [−0.074, 0.781], p = .105) or private mass shootings (b = 0.125, 95% CI [−0.419, 0.669], p = .652), but that Floyd's murder was linked to increases in public mass shootings (b = 0.772, 95% CI [0.062, 1.483], p = .033).

...but Schildkraut and Turanovic 2022 does not report any attempt to assess whether there is sufficient evidence to attribute the increase in mass shootings to covid once the 0.354 estimate for Floyd is addressed. The lack of statistical significance for the 0.354 Floyd estimate can't be used to conclude "no impact", especially given that the analysis for the covid declaration had data for 52 weeks pre-declaration and 53 weeks post-declaration, but the analysis for Floyd had data for only 11 weeks pre-Floyd and 42 weeks post-Floyd.

Schildkraut and Turanovic 2022 also disaggregated mass shootings into public mass shootings and private mass shootings. Corresponding plots by me are below. It doesn't look like the red line for the covid declaration is the break point for the increase in 2020 relative to 2019.

Astral Codex Ten discussed methods used to try to disentangle the effect of covid from the effect of Floyd, such as using for reference prior protests and other countries.

---

NOTES

1. In the Schildkraut and Turanovic 2022 data, some dates appeared in different weeks, such as 2019 Week 11 running from March 11 to March 17, but 2020 Week 11 running from March 9 to March 15.

2. The 13 March 2020 covid declaration occurred in the middle of Week 11, but the Floyd killing occurred at the start of Week 22, which ran from 25 May 2020 to May 31 2020.

3. Data. R code for the "total" plot above.

Suppose that Bob at time 1 believes that Jewish people are better than every other group, but Bob at time 2 changes his belief to be that Jewish people are no better or worse than every other group, and Bob at time 3 changes his belief to be that Jewish people are worse than every other group.

Suppose also that these changes in Bob's belief about Jewish people have a causal effect on his vote choices. Bob at time 1 will vote 100% of the time for a Jewish candidate running against a non-Jewish candidate, no matter the relative qualifications of the candidates. At time 2, a candidate's Jewish identity is irrelevant to Bob's vote choice, so that, if given a choice between a Jewish candidate and an all-else-equal non-Jewish candidate, Bob will flip a coin and vote for the Jewish candidate only 50% of the time. Bob at time 3 will vote 0% of the time for a Jewish candidate running against a non-Jewish candidate, no matter the relative qualifications of the candidates.

Based on this setup, what is your estimate of the influence of antisemitism on Bob's voting decisions?

---

I think that the effect of antisemitism is properly understood as the effect of negative attitudes about Jewish people, so that the effect can be estimated in the above setup as the difference between Bob's voting decisions at time 2, when Bob is indifferent to a candidate's Jewish identity, and Bob's voting decisions at time 3, when Bob has negative attitudes about Jewish people. Thus, the effect of antisemitism on Bob's voting decisions is a 50 percentage point decrease, from 50% to 0%.

For the first decrease, from 100% to 50%, neither belief -- the belief that Jewish people are better than every other group, or the belief that Jewish people are no better or worse than every other group -- is antisemitic, so none of this decrease should be attributed to antisemitism. Generally, I think that this means that respondents who have positive attitudes about a group should not be used to estimate the effect of negative attitudes about that group.

---

So let's discuss the Race and Social Problems article: Sharrow et al 2021 "What's in a Name? Symbolic Racism, Public Opinion, and the Controversy over the NFL's Washington Football Team Name". The key predictor was a measure of resentment against Native Americans, built from responses to the statements below, measured on a 5-point scale from "strongly agree" to "strongly disagree":

Most Native Americans work hard to make a living just like everyone else.

Most Native Americans take unfair advantage of privileges given to them by the government.

My analysis indicates that 39% of the 1500 participants (N=582) provided consistently positive responses about Native Americans on both items, agreeing or strongly agreeing with the first statement and disagreeing or strongly disagreeing with the second statement. I don't see why these 582 respondents should be included in an analysis that attempts to estimate the effect of negative attitudes about Native Americans, if these participants do not fall along the indifferent-to-negative-attitudes continuum about Native Americans.

So let's check what happens after removing these respondents from the analysis.

---

I first conducted an unweighted OLS regression using the full sample and controls to predict the summary Team Name Index outcome, which measured support for the Washington football team's name placed on a 0-to-1 scale. For this regression (N=1024), the measure of resentment against Native Americans ranged from 0 for respondents who selected the most positive responses to both resentment items to 1 for respondents who selected the most negative responses to both resentment items. In this regression, the coefficient was 0.26 (t=6.31) for resentment against Native Americans.

I then removed respondents who provided positive responses about Native Americans for both resentment items. For this next unweighted OLS regression (N=572), the measure of resentment against Native Americans still had a value of 1 for respondents who provided the most negative responses to both resentment items; however, 0 was for participants who were neutral on one resentment item but provided the most positive response on the other resentment item, such as strongly agreeing that "Most Native Americans work hard to make a living just like everyone else" but neither agreeing or disagreeing that "Most Native Americans take unfair advantage of privileges given to them by the government". In this regression, the coefficient was 0.12 (t=2.23) for resentment against Native Americans.

The drop is similar when the regressions include only the measure of resentment against Native Americans and no other predictors: the coefficient is 0.44 for the full sample, but is 0.22 after dropping respondents who provided positive responses about Native Americans for both resentment items.

---

So I think that Sharrow et al 2021 might report substantial overestimates of the effect of resentment of Native Americans, because the estimates in Sharrow et al 2021 about the effect of negative attitudes about Native Americans included the effect of positive attitudes about Native Americans.

---

NOTES

1. About 20% of the Sharrow et al 2022 sample reported a negative attitude on at least one of the two measures of resentment against Native Americans. About 6% of the sample reported a negative attitude on both measures of resentment against Native Americans.

2. Sharrow et al 2021 indicated that "Our conclusions illustrate that symbolic racism toward Native Americans is central to interpreting the public's resistance toward changing the name, in sharp contrast to Snyder's claim that the name is about 'respect.'" (p. 111).

For what it's worth, the Sharrow et al 2021 data indicate that a nontrivial percentage of respondents with positive views of Native Americans somewhat or strongly disagreed with the claim that Washington football team name is offensive (in an item that reported the name of the team at the time): 47% of respondents who provided positive responses about Native Americans for both resentment items, 47% of respondents who rated Native Americans at 100 on a 0-to-100 feeling thermometer, 40% of respondents who provided positive responses about Native Americans for both resentment items and rated Native Americans at 100 on a 0-to-100 feeling thermometer, and 32% of respondents who provided the most positive responses about Native Americans for both resentment items and rated Native Americans at 100 on a 0-to-100 feeling thermometer (although this 32% was only 22% in unweighted analyses).

3. Sharrow et a 2021 indicated a module sample of 1,500 but the sample size fell to 1,024 in model 3 of Table 1. My analysis indicates that this is largely due to missing values on the outcome variable (N=1,362), the NFL sophistication index (N=1,364), and the measure of resentment of Native Americans (N=1,329).

4. Data for my analysis. Stata code and output.

5. Social Science Quarterly recently published Levin et al 2022 "Validating and testing a measure of anti-semitism on support for QAnon and vote intention for Trump in 2020", which also has the phenomenon of estimating the effect of negative attitudes about a target group but not excluding participants who favor the target group.

The American Political Science Review recently published a letter: Stephens-Dougan 2022 "White Americans' reactions to racial disparities in COVID-19".

Figure 1 of the Stephens-Dougan 2022 APSR letter reports results for four outcomes among racially prejudiced Whites, with the 84% confidence interval in the control overlapping with the 84% confidence interval in the treatment for only one of the four reported outcomes (zooming in on Figure 1, the confidence intervals for the parks outcome don't seem to overlap, and the code returns 0.1795327 for the upper bound for the control and 0.18800818 for the lower bound for the treatment). And results for the most obviously overlapping 84% confidence intervals seem to be interpreted as sufficient evidence of an effect, with all four reported outcomes discussed in the passage below:

When racially prejudiced white Americans were exposed to the racial disparities information, there was an increase in the predicted probability of indicating that they were less supportive of wearing face masks, more likely to feel their individual rights were being threatened, more likely to support visiting parks without any restrictions, and less likely to think African Americans adhere to social distancing guidelines.

---

There are at least three things to keep track of: [1] the APSR letter, [2] the survey questionnaire, located at the OSF site for the Time-sharing Experiments for the Social Sciences project; and [3] the pre-analysis plan, located at the OSF and in the appendix of the APSR article. I'll use the PDF of the pre-analysis plan. The TESS site also has the proposal for the survey experiment, but I won't discuss that in this post.

---

The pre-analysis plan does not mention all potential outcome variables that are in the questionnaire, but the pre-analysis plan section labeled "Hypotheses" includes the passage below:

Specifically, I hypothesize that White Americans with anti-Black attitudes and those White Americans who attribute racial disparities in health to individual behavior (as opposed to structural factors), will be more likely to disagree with the following statements:

The United States should take measures aimed at slowing the spread of the coronavirus while more widespread testing becomes available, even if that means many businesses will have to stay closed.

It is important that people stay home rather than participating in protests and rallies to pressure their governors to reopen their states.

I also hypothesize that White Americans with anti-Black attitudes and who attribute racial health disparities to individual behavior will be more likely to agree with the following statements:

State and local directives that ask people to "shelter in place" or to be "safer at home" are a threat to individual rights and freedom.

The United States will take too long in loosening restrictions and the economic impact will be worse with more jobs being lost

The four outcomes mentioned in the passage above correspond to items Q15, Q18, Q16, and Q21 in the survey questionnaire, but, of these four outcomes, the APSR letter reported on only Q16.

The outcome variables in the APSR letter are described as: "Wearing facemasks is not important", "Individual rights and freedom threatened", "Visit parks without any restrictions", and "Black people rarely follow social distancing guidelines". These outcome variables correspond to survey questionnaire items Q20, Q16, Q23A, and Q22A.

---

The pre-analysis plan PDF mentions moderators, with three moderators about racial dispositions: racial resentment, negative stereotype endorsement, and attributions for health disparities. The plan indicates that:

For racial predispositions, we will use two or three bins, depending on their distributions. For ideology and party, we will use three bins. We will include each bin as a dummy variable, omitting one category as a baseline.

The APSR letter reported on only one racial predispositions moderator: negative stereotype endorsement.

---

I'll post a link in the notes below to some of my analyses about the "Specifically, I hypothesize" outcomes, but I don't want to focus on the results, because I wanted this post to focus on deviations from the pre-analysis plan, because -- regardless of whether the estimates from the analyses in the APSR letter are similar to the estimates from the planned analyses in the pre-analysis plan -- I think that it's bad that readers can't trust the APSR to ensure that a pre-analysis plan is followed or at least to provide an explanation about why a pre-analysis plan was not followed, especially given that this APSR letter described itself as reporting on "a preregistered survey experiment" and included the pre-analysis plan in the appendix.

---

NOTES

1. The Stephens-Dougan 2022 APSR letter suggests that the negative stereotype endorsement variable was coded dichotomously ("a variable indicating whether the respondent either endorsed the stereotype that African Americans are less hardworking than whites or the stereotype that African Americans are less intelligent than whites"), but the code and the appendix of the APSR letter indicate that the negative stereotype endorsement variable was measured so that the highest level is for respondents who reported a negative relative stereotype about Blacks for both stereotypes. From Table A7:

(unintelligentstereotype 2 + lazystereotype2 )/2

In the data after running the code for the APSR letter, the negative stereotype endorsement variable is a three-level variable coded 0 for respondents who did not report a negative relative stereotype about Blacks for either stereotype, 0.5 for respondents who reported a negative stereotype about Blacks for one stereotype, and 1 for respondents who reported a negative relative stereotype about Blacks for both stereotypes.

2. The APSR letter indicated that:

The likelihood of racially prejudiced respondents in the control condition agreeing that shelter-in-place orders threatened their individual rights and freedom was 27%, compared with a likelihood of 55% in the treatment condition (p < 0.05 for a one-tailed test).

My analysis using survey weights got 44% and 29% among participants who reported a negative relative stereotype about Blacks for at least one of the two stereotype items, and my analysis got 55% and 26% among participants who reported negative relative stereotypes about Blacks for both stereotype items, with a trivial overlap in 84% confidence intervals.

But the 55% and 26% in a weighted analysis were 43% and 37% in an unweighted analysis with a large overlap in 84% confidence intervals, suggesting that at least some of the results in the APSR letter might be limited to the weighted analysis. I ran the code for the APSR letter removing the weights from the glm command and got the revised Figure 1 plot below. The error bars in the APSR letter are described as 84% confidence intervals.

I think that it's fine to favor the weighted analysis, but I'd prefer that publications indicate when results from an experiment are not robust to the application or non-application of weights. Relevant publication.

3. Given the results in my notes [1] and [2], maybe the APSR letter's Figure 1 estimates are for only respondents who reported negative relative stereotype about Blacks for both stereotypes. If so, the APSR letter's suggestion that this population is the 26% that reported anti-Black stereotypes for either stereotype might be misleading, if the Figure 1 analyses were estimated for only the 10% that reported negative relative stereotype about Blacks for both stereotypes.

For what it's worth, the R code for the APSR letter has code that doesn't use the 0.5 level of the negative stereotype endorsement variable, such as:

# Below are code for predicted probabilities using logit model

# Predicted probability "individualrights_dichotomous"

# Treatment group, negstereotype_endorsement = 1

p1.1 <- invlogit(coef(glm1)[1] + coef(glm1)[2] * 1 + coef(glm1)[3] * 1 + coef(glm1)[4] * 1)

It's possible to see what happens to the Figure 1 results when the negative stereotype endorsement variable is coded 1 for respondents who endorsed at least one of the stereotypes. Run this at the end of the Stata code for the APSR letter:

replace negstereotype_endorsement = ceil((unintelligentstereotype2 + lazystereotype2)/2)

Then run the R code for the APSR letter. Below is the plot I got for a revised Figure 1, with weights applied and the sample limited to respondents who endorsed at least one of the stereotypes:

Estimates in the figure above were close to estimates in my analysis using these Stata commands after running the Stata code from the APSR letter. Stata output.

4. Data, Stata code, and Stata output for my analysis about the "Specifically, I hypothesize" passage of the Stephens-Dougan pre-analysis plan.

My analysis in the Stata output had seven outcomes: the four outcomes mentioned in the "Specifically, I hypothesize" part of the pre-analysis plan as initially measured (corresponding to questionnaire items Q15, Q18, Q16, and Q21), with no dichotomization of five-point response scales for Q15, Q18, and Q16; two of these outcomes (Q15 and Q16) dichotomized as mentioned in the pre-analysis plan (e.g., "more likely to disagree" was split into disagree / not disagree categories, with the not disagree category including respondent skips); and one outcome (Q18) dichotomized so that one category has "Not Very Important" and "Not At All Important" and the other category has the other responses and skips, given that the pre-analysis plan had this outcome dichotomized as disagree but response options in the survey were not on an agree-to-disagree scale. Q21 was measured as a dichotomous variable.

The analysis was limited to presumed racially prejudiced Whites, because I think that that's what the pre-analysis plan hypotheses quoted above focused on. Moreover, that analysis seems more important than a mere difference between groups of Whites.

Note that, for at least some results, a p<0.05 treatment effect might be in the unintuitive direction, so be careful before interpreting a p<0.05 result as evidence for the hypotheses.

My analyses aren't the only analyses that can be conducted, and it might be a good idea to combine results across outcomes mentioned in the pre-analysis plan or across all outcomes in the questionnaire, given that the questionnaire had at least 12 items that could serve as outcome variables.

For what it's worth, I wouldn't be surprised if a lot of people who respond to survey items in an unfavorable way about Blacks backlashed against a message about how Blacks were more likely than Whites to die from covid-19.

5. The pre-analysis plan included a footnote that:

Given the results from my pilot data, it is also my expectation that partisanship will moderate the effect of the treatment or that the treatment effects will be concentrated among Republican respondents.

Moreover, the pre-analysis plan indicated that:

The condition and treatment will be blocked by party identification so that there are roughly equal numbers of Republicans and Democrats in each condition.

But the lone mention of "Repub-" in the APSR letter is:

The sample was 39% self-identified Democrats (including leaners) and 46% self-identified Republicans (including leaners).

6. Link to tweets about the APSR letter.

Broockman 2013 "Black politicians are more intrinsically motivated to advance Blacks' interests: A field experiment manipulating political incentives" reported results from an experiment in which U.S. state legislators were sent an email from "Tyrone Washington", which is a name that suggests that the email sender is Black. The experimental manipulation was that "Tyrone" indicated that the city that he lived in was a city in the legislator's district or was a well-known city far from the legislator's district.

Based on Table 2 column 2, response percentages were:

  • 56.1% from in-district non-Black legislators
  • 46.4% from in-district Black legislators (= 0.561 - 0.097)
  • 28.6% from out-of-district non-Black legislators (= 0.561 - 0.275)
  • 41.4% from out-of-district Black legislators (= 0.561 - 0.275 + 0.128)

---

Broockman 2013 lacked another emailer to serve as comparison for response rates to Tyrone, such as an emailer with a stereotypical White name. Broockman 2013 discusses this:

One challenge in designing the experiment was that there were so few black legislators in the United States (as of November 2010) that a set of white letter placebo conditions could not be implemented due to a lack of adequate sample size.

So all emails in the Broockman 2013 experiment were signed "Tyrone Washington".

---

But here is how Broockman 2013 was cited by Rhinehar 2020 in American Politics Research:

A majority of this work has explored legislator responsiveness by varying the race or ethnicity of the email sender (Broockman, 2013;...

---

Costa 2017 in the Journal of Experimental Political Science:

As for variables that do have a statistically significant effect, minority constituents are almost 10 percentage points less likely to receive a response than non-minority constituents (p < 0.05). This is consistent with many individual studies that have shown requests from racial and ethnic minorities are given less attention overall, and particularly when the recipient official does not share their race (Broockman, 2013;...

But Broockman 2013 didn't vary the race of the requester, so I'm not sure of the basis for the suggestion that Broockman 2013 provided evidence that requests from racial and ethnic minorities are given less attention overall.

---

Mendez and Grose 2018 in Legislative Studies Quarterly:

Others argue or show, through experimental audit studies, that political elites have biases toward minority constituents when engaging in nonpolicy representation (e.g.,Broockman 2013...

I'm not sure how Broockman 2013 permits an inference of political elite bias toward minority constituents, when the only constituent was Tyrone.

---

Lajevardi 2018 in Politics, Groups, and Identities:

Audit studies have previously found that public officials are racially biased in whether and how they respond to constituent communications (e.g., Butler and Broockman 2011; Butler, Karpowitz, and Pope 2012; Broockman 2013;...

---

Dinesen et al 2021 in the American Political Science Review:

In the absence of any extrinsic motivations, legislators still favor in-group constituents (Broockman 2013), thereby indicating a role for intrinsic motivations in unequal responsiveness.

Again, Tyrone was the only constituent in Broockman 2013.

---

Hemker and Rink 2017 in the American Journal of Political Science:

White officials in both the United States and South Africa are more likely to respond to requests from putative whites, whereas black politicians favor putative blacks (Broockman 2013, ...

---

McClendon 2016 in the Journal of Experimental Political Science:

Politicians may seek to favor members of their own group and to discriminate against members of out-groups (Broockman, 2013...

---

Gell-Redman et al 2018 in American Politics Research:

Studies that explore other means of citizen and legislator interaction have found more consistent evidence of bias against minority constituents. Notably, Broockman (2013) finds that white legislators are significantly less likely to respond to black constituents when the political benefits of doing so were diminished.

But the only constituent was Tyrone, so you can't properly infer bias against Tyrone or minority constituents more generally, because the experiment didn't indicate whether the out-of-district drop-off for Tyrone differed from the out-of-district drop-off for a putative non-Black emailer.

---

Broockman 2014 in the American Journal of Political Science:

Outright racial favoritism among politicians themselves is no doubt real (e.g., Broockman 2013b;...

But who was Tyrone favored more than or less than?

---

Driscoll et al 2018 in the American Journal of Political Science:

Broockman (2013) finds that African American state legislators expend more effort to improve the welfare of black voters than white state legislators, irrespective of whether said voters reside in their districts.

Even ignoring the added description of the emailer as a "voter", response rates to Tyrone were not "irrespective" of district residence. Broockman 2013 even plotted data for the matched case analysis, in which the bar for in-district Black legislators was not longer than the bar for in-district non-Black legislators:

---

Shoub et al 2020 in the Journal of Race, Ethnicity, and Politics:

Black politicians are more likely to listen and respond to black constituents (Broockman 2013),...

The prior context in Shoub et al 2020 suggests that the "more likely" comparison is to non-Black politicians, but this description loses the complication in which Black legislators were not more likely than non-Black legislators to respond to in-district Tyrone, which is especially important if we reasonably assume that in-district Tyrone was perceived to be a constituent and out-of-district Tyrone wasn't. Same problem with Christiani et al 2021 in Politics, Groups, and Identities:

Black politicians are more likely to listen and respond to black constituents than white politicians (Broockman
2013)...

The similar phrasing for the above two passages might be due to the publications having the same group of authors: Shoub Epp Baumgartner Christiani Roach, and Christiani Shoub Baumgartner Epp Roach.

---

Gleason and Stout 2014 in the Journal of Black Studies:

Recent experimental studies conducted by Butler and Broockman (2011) and Broockman (2013) confirm these findings. These studies show that Black elected officials are more likely to help co-racial constituents in and outside of their districts gain access to the ballot more than White elected officials.

This passage, from what I can tell, describes both citations incorrectly: in Broockman 2013, Tyrone was asking for help getting unemployment benefits, and I'm not sure what the basis is for the "in...their districts" claim: in-district response rates were 56.1% from non-Black legislators and 46.4% from Black legislators. The Butler and Broockman 2011 appendix reports results such as DeShawn receiving responses from 41.9%, 22.4%, and 44.0% of Black Democrat legislators when DeShawn respectively asked about a primary, a Republican primary, and a Democratic primary and, respectively, from 54.3%, 56.1%, and 62.1% of White Democrat legislators.

But checking citations to Butler and Broockman 2011 would be another post.

---

NOTES

1. The above isn't a systematic analysis of citations of Broockman 2013, so no strong inferences should be made about the percentage of times Broockman 2013 was cited incorrectly, other than maybe too often, especially in these journals.

2. I think that, for the Broockman 2013 experiment, a different email could have been sent from a putative White person, without sample size concerns. Imagine that "Billy Bob" emailed each legislator asking for help with, say, welfare benefits. If, like with Tyrone, Black legislator response rates were similar for in-district Billy Bob and for out-of-district Billy Bob, that would provide a strong signal to not attribute the similar rates to an intrinsic motivation to advance Blacks' interests. But if the out-of-district drop off in Black legislator response rates was much larger for Billy Bob than for Tyrone, that would provide a strong signal to attribute the similar Black legislator response rates for in-district Tyrone and out-of-district Tyrone to an intrinsic motivation to advance Blacks' interests.

3. I think that the error bars in Figure 1 above might be 50% confidence intervals, given that the error bars seems to match the Stata command "reg code_some treat_out treatXblack leg_black [iweight=cem_weights], level(50)" that I ran on the Broockman 2013 data after line 17 in the Stata do file.

4. I shared this post with David Broockman, who provided the following comments:

Hi LJ,

I think you're right that some of these citations are describing my paper incorrectly and probably meant to cite my 2011 paper with Butler. (FWIW, in that study, we find legislators of all races seem to just discriminate in favor of their race, across both parties, so some of the citations don't really capture that either....)

The experiment would definitely be better with a white control, there was just a bias-variance trade-off here -- adding a putative race of constituent factor in the experiment would mean less bias but more variance. I did the power calculations and didn't think the experiment would be well-powered enough if I made the cells that small and were looking for a triple interaction between legislator race X letter writer putative race X in vs. out of district. In the paper I discuss a few alternative explanations that the lack of a white letter introduces and do some tests for them (see the 3 or 4 paragraphs starting with "One challenge..."). Essentially, I didn't see any reason why we should expect black legislators to just be generically less sensitive to whether a person is in their district, especially given in our previous paper we found they reacted pretty strongly to the race of the email sender (so it's not like the black legislators who do respond to emails just don't read emails carefully). Still, I definitely still agree with what I wrote then that this is a weakness of the study. It would be nice for someone to replicate this study, and I like the idea you have in footnote 2 for doing this. Someone should do that study!

Research involves a lot of decisions, which in turn provides a lot of opportunities for research to be incorrect or substandard, such as mistakes in recoding a variable, not using the proper statistical method, or not knowing unintuitive elements of statistical software such as how Stata treats missing values in logical expressions.

Peer and editorial review provides opportunities to catch flaws in research, but some journals that publish political science don't seem to be consistently doing a good enough job at this. Below, I'll provide a few examples that I happened upon recently and then discuss potential ways to help address this.

---

Feinberg et al 2022

PS: Political Science & Politics published Feinberg et al 2022 "The Trump Effect: How 2016 campaign rallies explain spikes in hate", which claims that:

Specifically, we established that the words of Donald Trump, as measured by the occurrence and location of his campaign rallies, significantly increased the level of hateful actions directed toward marginalized groups in the counties where his rallies were held.

After Feinberg et al published a similar claim in the Monkey Cage in 2019, I asked the lead author about the results when the predictor of hosting a Trump rally is replaced with a predictor of hosting a Hillary Clinton rally.

I didn't get a response from Ayal Feinberg, but Lilley and Wheaton 2019 reported that the point estimate for the effect on the count of hate-motivated events is larger for hosting a Hillary Clinton rally than for hosting a Donald Trump rally. Remarkably, the Feinberg et al 2022 PS article does not address the Lilley and Wheaton 2019 claim about Clinton rallies, even though the supplemental file for the Feinberg et al 2022 PS article discusses a different criticism from Lilley and Wheaton 2019.

The Clinton rally counterfactual is an obvious way to assess the claim that something about Trump increased hate events. Even if the reviewers and editors for PS didn't think to ask about the Clinton rally counterfactual, that counterfactual analysis appears in the Reason magazine criticism that Feinberg et al 2022 discusses in its supplemental files, so the analysis was presumably available to the reviewers and editors.

Will May has published a PubPeer comment discussing other flaws of the Feinberg et al 2022 PS article.

---

Christley 2021

The impossible "p < .000" appears eight times in Christley 2021 "Traditional gender attitudes, nativism, and support for the Radical Right", published in Politics & Gender.

Moreover, Christley 2021 indicates that (emphasis added):

It is also worth mentioning that in these data, respondent sex does not moderate the relationship between gender attitudes and radical right support. In the full model (Appendix B, Table B1), respondent sex is correlated with a higher likelihood of supporting the radical right. However, this finding disappears when respondent sex is interacted with the gender attitudes scale (Table B2). Although the average marginal effect of gender attitudes on support is 1.4 percentage points higher for men (7.3) than it is for women (5.9), there is no significant difference between the two (Figure 5).

Table B2 of Christley 2021 has 0.64 and 0.250 for the logit coefficient and standard error for the "Male*Gender Scale" interaction term, with no statistical significance asterisks; the 0.64 is the only table estimate without results reported to three decimal places, so it's not clear to me from the table if the asterisks are missing or is the estimate should be, say, 0.064 instead of 0.64. The sample size for the Table B2 regression is 19,587, so a statistically significant 1.4-percentage-point difference isn't obviously out of the question, from what I can tell.

---

Hua and Jamieson 2022

Politics, Groups, and Identities published Hua and Jamieson 2022 "Whose lives matter? Race, public opinion, and military conflict".

Participants were assigned to a control condition with no treatment, to a placebo condition with an article about baseball gloves, or to an article about a U.S. service member being killed in combat. The experimental manipulation was the name of the service member, intended to signal race: Connor Miller, Tyrone Washington, Javier Juarez, Duc Nguyen, and Misbah Ul-Haq.

Inferences from Hua and Jamieson 2022 include:

When faced with a decision about whether to escalate a conflict that would potentially risk even more US casualties, our findings suggest that participants are more supportive of escalation when the casualties are of Pakistani and African American soldiers than they are when the deaths are soldiers from other racial–ethnic groups.

But, from what I can tell, this inference of participants being "more supportive" depending on the race of the casualties is based on differences in statistical significance when each racial condition is compared to the control condition. Figure 5 indicates a large enough overlap between confidence intervals for the racial conditions for this escalation outcome to prevent a confident claim of "more supportive" when comparing racial conditions to each other.

Figure 5 seems to plot estimates from the first column in Table C.7. The largest racial gap in estimates is between the Duc Nguyen condition (0.196 estimate and 0.133 standard error) and the Tyrone Washington condition (0.348 estimate and 0.137 standard error). So this difference in means is 0.152, and I don't think that there is sufficient evidence to infer that these estimates differ from each other. 83.4% confidence intervals would be about [0.01, 0.38] and [0.15, 0.54].

---

Walker et al 2022

PS: Political Science & Politics published Walker et al 2022 "Choosing reviewers: Predictors of undergraduate manuscript evaluations", which, for the regression predicting reviewer ratings of manuscript originality, interpreted a statistically significant -0.288 OLS coefficient for "White" as indicating that "nonwhite reviewers gave significantly higher originality ratings than white reviewers". But the table note indicates that the "originality" outcome variable is coded 1 for yes, 2 for maybe, and 3 for no, so that the "higher" originality ratings actually indicate lower ratings of originality.

Moreover, Walker et al 2022 claims that:

There is no empirical linkage between reviewers' year in school and major and their assessment of originality.

But Table 2 indicates p<0.01 evidence that reviewer major associates with assessments of originality.

And the "a", "b", and "c" notes for Table 2 are incorrectly matched to the descriptions; for example, the "b" note about the coding of the originality outcome is attached to the other outcome.

The "higher originality ratings" error has been corrected, but not the other errors. I mentioned only the "higher" error in this tweet, so maybe that explains that. It'll be interesting to see if PS issues anything like a corrigendum about "Trump rally / hate" Feinberg et al 2022, given that the flaw in Feinberg et al 2022 seems a lot more important.

---

Fattore et al 2022

Social Science Quarterly published Fattore et al 2022 "'Post-election stress disorder?' Examining the increased stress of sexual harassment survivors after the 2016 election". For a sample of women participants, the analysis uses reported experience being sexually harassed to predict a dichotomous measure of stress due to the 2016 election, net of controls.

Fattore et al 2022 Table 1 reports the standard deviation for a presumably multilevel categorical race variable that ranges from 0 to 4 and for a presumably multilevel categorical marital status variable that ranges from 0 to 2. Fattore et al 2022 elsewhere indicates that the race variable was coded 0 for white and 1 for minority, but indicates that the marital status variable is coded 0 for single, 1 for married/coupled, and 2 for separated/divorced/widowed, so I'm not sure how to interpret regression results for the marital status predictor.

And Fattore et al 2022 has this passage:

With 95 percent confidence, the sample mean for women who experienced sexual harassment is between 0.554 and 0.559, based on 228 observations. Since the dependent variable is dichotomous, the probability of a survivor experiencing increased stress symptoms in the post-election period is almost certain.

I'm not sure how to interpret that passage: Is the 95% confidence interval that thin (0.554, 0.559) based on 228 observations? Is the mean estimate of about 0.554 to 0.559 being interpreted as almost certain? Here is the paragraph that that passage is from.

---

Hansen and Dolan 2022

Political Behavior published Hansen and Dolan 2022 "Cross‑pressures on political attitudes: Gender, party, and the #MeToo movement in the United States".

Table 1 of Hansen and Dolan 2022 reported results from a regression limited to 694 Republican respondents in a 2018 ANES survey, which indicated that the predicted feeling thermometer rating about the #MeToo movement was 5.44 units higher among women than among men, net of controls, with a corresponding standard error of 2.31 and a statistical significance asterisk. However, Hansen and Dolan 2022 interpreted this to not provide sufficient evidence of a gender gap:

In 2018, we see evidence that women Democrats are more supportive of #MeToo than their male co-partisans. However, there was no significant gender gap among Republicans, which could signal that both women and men Republican identifiers were moved to stand with their party on this issue in the aftermath of the Kavanaugh hearings.

Hansen and Dolan 2022 indicated that this inference of no significant gender gap is because, in Figure 1, the relevant 95% confidence interval for Republican men overlapped with the corresponding 95% confidence interval for Republican women.

Footnote 9 of Hansen and Dolan 2022 noted that assessing statistical significance using overlap of 95% confidence intervals is a "more rigorous standard" than using a p-value threshold of p=0.05 in a regression model. But Footnote 9 also claimed that "Research suggests that using non-overlapping 95% confidence intervals is equivalent to using a p < .06 standard in the regression model (Schenker & Gentleman, 2001)", and I don't think that this "p < .06" claim is correct or at least not misleading.

My Stata analysis of the data for Hansen and Dolan 2022 indicated that the p-value for the gender gap among Republicans on this item is p=0.019, which is about what would be expected given data in Table 1 of a t-statistic of 5.44/2.31 and more than 600 degrees of freedom. From what I can tell, the key evidence from Schenker and Gentleman 2001 is Figure 3, which indicates that the probability of a Type 1 error using the overlap method is about equivalent to p=0.06 only when the ratio of the two standard errors is about 20 or higher.

This discrepancy in inferences might have been avoided if 83.4% confidence intervals were more commonly taught and recommended by editors and reviewers, for visualizations in which the key comparison is between two estimates.

---

Footnote 10 of Hansen and Dolan 2022 states:

While Fig. 1 appears to show that Republicans have become more positive towards #MeToo in 2020 when compared to 2018, the confidence bounds overlap when comparing the 2 years.

I'm not sure what that refers to. Figure 1 of Hansen and Dolan 2022 reports estimates for Republican men in 2018, Republican women in 2018, Republican men in 2020, and Republican women in 2020, with point estimates increasing in that order. Neither 95% confidence interval for Republicans in 2020 overlaps with either 95% confidence interval for Republicans in 2018.

---

Other potential errors in Hansen and Dolan 2022:

[1] The code for the 2020 analysis uses V200010a, which is a weight variable for the pre-election survey, even though the key outcome variable (V202183) was on the post-election survey.

[2] Appendix B Table 3 indicates that 47.3% of the 2018 sample was Republican and 35.3% was Democrat, but the sample sizes for the 2018 analysis in Table 1 are 694 for the Republican only analysis and 1001 for the Democrat only analysis.

[3] Hansen and Dolan 2022 refers multiple times to predictions of feeling thermometer ratings as predicted probabilities, and notes for Tables 1 and 2 indicate that the statistical significance asterisk is for "statistical significance at p > 0.05".

---

Conclusion

I sometimes make mistakes, such as misspelling an author's name in a prior post. In 2017, I preregistered an analysis that used overlap of 95% confidence intervals to assess evidence for the difference between estimates, instead of a preferable direct test for a difference. So some of the flaws discussed above are understandable. But I'm not sure why all of these flaws got past review at respectable journals.

Some of the flaws discussed above are, I think, substantial, such as the political bias in Feinberg et al 2022 not reporting a parallel analysis for Hillary Clinton rallies, especially with the Trump rally result being prominent enough to get a fact check from PolitiFact in 2019. Some of the flaws discussed above are trivial, such as "p < .000". But even trivial flaws might justifiably be interpreted as reflecting a review process that is less rigorous than it should be.

---

I think that peer review is valuable at least for its potential to correct errors in analyses and to get researchers to report results that they otherwise wouldn't report, such as a robustness check suggested by a reviewer that undercuts the manuscript's claims. But peer review as currently practiced doesn't seem to do that well enough.

Part of the problem might be that peer review at a lot of political science journals combines [1] assessment of the contribution of the manuscript and [2] assessment of the quality of the analyses, often for manuscripts that are likely to be rejected. Some journals might benefit from having a (or having another) "final boss" who carefully reads conditionally accepted manuscripts only for assessment [2], to catch minor "p < .000" types of flaws, to catch more important "no Clinton rally analysis" types of flaws, and to suggest robustness checks and additional analyses.

But even better might be opening peer review to volunteers, who collectively could plausibly do a better job than a final boss could do alone. I discussed the peer review volunteer idea in this symposium entry. The idea isn't original to me; for example, Meta-Psychology offers open peer review. The modal number of peer review volunteers for a publication might be zero, but there is a good chance that I would have raised the "no Clinton rally analysis" criticism had PS posted a conditionally accepted version of Feinberg et al 2022.

---

Another potentially good idea would be for journals or an organization such as APSA to post at least a small set of generally useful advice, such as reporting results for a test for differences between estimates if the manuscript suggests a difference between estimates. More specific advice could be posted by topic, such as, for count analyses, advice about predicting counts in which the opportunity varies by observation: Lilley and Wheaton 2019 discussed this page, but I think that this page has an explanation that is easier to understand.

---

NOTES

1. It might be debatable whether this is a flaw per se, but Long 2022 "White identity, Donald Trump, and the mobilization of extremism" reported correlational results from a survey experiment but, from what I can tell, didn't indicate whether any outcomes differed by treatment.

2. Data for Hansen and Dolan 2022. Stata code for my analysis:

desc V200010a V202183

svyset [pw=weight]

svy: reg metoo education age Gender race income ideology2 interest media if partyid2=="Republican"

svy: mean metoo if partyid2=="Republican" & women==1

3. The journal Psychological Science is now publishing peer reviews. Peer reviews are also available for the journal Meta-Psychology.

4. Regarding the prior post about Lacina 2022 "Nearly all NFL head coaches are White. What are the odds?", Bethany Lacina discussed that with me on Twitter. I have published an update at that post.

5. I emailed or tweeted to at least some authors of the aforementioned publications discussing the planned comments or indicating at least some of the criticism. I received some feedback from one of the authors, but the author didn't indicate that I had permission to acknowledge the author.