Political Research Quarterly published Huber and Gunderson 2022 "Putting a fresh face forward: Does the gender of a police chief affect public perceptions?". Huber and Gunderson 2022 reports on a survey experiment in which, for one of the manipulations, a police chief was described as female (Christine Carlson or Jada Washington) or male (Ethan Carlson or Kareem Washington).

---

Huber and Gunderson 2022 has a section called "Heterogeneous Responses to Treatment" that reports on results that divided the sample into "high sexism" respondents and "low sexism" respondents. For example, the mean overall support for the female police chief was 3.49 among "low sexism" respondents and was 3.41 among "high sexism" respondents, with p=0.05 for the difference. Huber and Gunderson 2022 (p. 8) claims that [sic on the absence of a "to"]:

These results indicate that respondents' sexism significantly moderates their support for a female police chief and supports role congruity theory, as individuals that are more sexist should react more negatively [sic] violations of gender roles.

But, for all we know from the results reported in Huber and Gunderson 2022, "high sexism" respondents might merely rate police chiefs lower relative to how "low sexism" respondents rate police chiefs, regardless of the gender of the police chief.

Instead of the method in Huber and Gunderson 2022, a better method to test whether "individuals that are more sexist...react more negatively [to] violations of gender roles" is to estimate the effect of the male/female treatment on ratings about the police chief among "high sexism" respondents. And, to test whether "respondents' sexism significantly moderates their support for a female police chief", we can compare the results of that test to results from a corresponding test among "low sexism" respondents.

---

Using the data and code for Huber and Gunderson 2022, I ran the code up to the section for Table 4, which is the table about sexism. I then ran my modified code of the Huber and Gunderson 2022 code for Table 4, among respondents Huber and Gunderson 2022 labeled "high sexism", which is for a score above 0.35 on the measure of sexism, and then among respondents Huber and Gunderson 2022 labeled "low sexism", which is for a score below 0.35 on the measure of sexism.

Results are below, indicating a lack of p<0.05 evidence for a male/female treatment effect among these "high sexism" respondents, along with a p<0.05 pro-female bias among the "low sexism" respondents on all but one of the Table 4 items.

HIGH SEXISM RESPONDENTS------------------
                     Female Male
                     Chief  Chief
Domestic Violence    3.23   3.16  p=0.16
Sexual Assault       3.20   3.16  p=0.45
Violent Crime Rate   3.20   3.23  p=0.45
Corruption           3.21   3.18  p=0.40
Police Brutality     3.17   3.17  p=0.94
Community Leaders    3.33   3.31  p=0.49
Police Chief Support 3.41   3.39  p=0.52

LOW SEXISM RESPONDENTS------------------
                     Female Male
                     Chief  Chief
Domestic Violence    3.40   3.21  p<0.01
Sexual Assault       3.44   3.22  p<0.01
Violent Crime Rate   3.40   3.33  p=0.10
Corruption           3.21   3.07  p=0.01
Police Brutality     3.24   3.11  p=0.01
Community Leaders    3.40   3.32  p=0.02
Police Chief Support 3.49   3.37  p<0.01

---

I'm sure that there might be more of interest, such as calculating p-values for the difference between the treatment effect among "low sexism" respondents and the treatment effect among "high sexism" respondents, and assessing whether there is stronger evidence of a treatment effect among "high sexism" respondents higher up the sexism scale than the 0.35 threshold used in Huber and Gunderson 2022.

But I at least wanted to document another example of a pro-female bias among "low sexism" respondents.

Tagged with: , , , ,

The Journal of Politics recently published Butler et al 2022 "Constituents ask female legislators to do more".

---

1. PREREGISTRATION

The relevant preregistration plan for Butler et al 2022 has an outcome that the main article does not mention, for the "Lower Approval for Women" hypothesis. Believe it or not, the Butler et al 2022 analysis didn’t find sufficient evidence in its "Lower Approval for Women" tests. So instead of reporting that in the JOP article or its abstract or its title, Butler et al mentioned the insufficient evidence in appendix C of the online supplement to Butler et al 2022.

---

2. POSSIBLE ERROR FOR THE APPROVAL HYPOTHESIS

The Butler et al 2022 online appendix indicates that the dependent variable for Table C2 is a four-point scale that was predicted using ordered probit. Table C2 reports results for four cut points, even though a four-point dependent variable should have only three cut points. The dependent variable was drawn from a 5-point scale in which the fifth point was "Not sure", so I think that someone forgot to recode the "Not sure" responses to missing.

Butler et al 2022 online appendix C indicates that:

Constituents chose among 5 response options for the question: Strongly approve, Somewhat approve, Somewhat disapprove, Strongly disapprove, Not sure.

So I think that the "Not sure" responses were coded as if being not sure was super strongly disapprove.

---

3. PREREGISTRATION + RESEARCH METHOD

The image below has a tabulation of the dependent variable for the preregistered hypothesis of Butler et al 2022 that is reported in the main text, the abstract, and the title:

That's a very large percentage of zeros.

The Butler et al 2022 experiment involved male legislators and female legislators sending letters to constituents asking the constituents to complete an online survey, and, in that online survey, the legislator asked "What policy issues do you think I should work on during the current session?".

Here is a relevant passage from the Butler et al 2022 preregistration reported in the online appendix, with my emphasis added and [sic] for "...condition the code...":

Coding the Dependent Variable. This would be an open-ended question where voters could list multiple issues. We will have RAs who are blind to the hypothesis and treatment condition the code the number of issues given in the open response. We will use that number as the dependent variable. We will then an OLS regression where the DV is the number of issues and the IV is the gender treatment.

That passage seems to indicate that the dependent variable was preregistered to be a measure about what constituents provided in the open response. From what I can tell based on the original coding of the "NumberIssues" dependent variable, the RAs coded 14 zeros based on what respondents provided in the open response, out of a total of 1,203 observations. I ran the analysis on only these 1,203 observations, and the coefficient for the gender of the legislator (fem_treatment) was p=0.29 without controls and p=0.29 with controls.

But Butler et al 2022 coded the dependent variable to be zero for the 29,386 people who didn't respond to the survey at all or at least didn't respond in the open response. Converting these 29,386 observations to zero policy issues asked about produces corresponding p-values of p=0.06 and p=0.09. But it seems potentially misleading to focus on a dependent variable that conflates [1] the number of issues that a constituent asked about and [2] the probability that the constituent responded to the survey.

Table D2 of Butler et al 2022 indicates that constituents were more likely to respond to the female legislators' request to respond to the online survey (p<0.05). Butler et al 2022 indicates that "Women are thus contacted more often but do not receive more requests per contact" (p. 2281). But it doesn't seem correct to describe a higher chance of responding to a female legislator's request to complete a survey as contacting female legislators more, especially if the suggestion is that the experimental results about contact initiated by the legislator applies to contact that is not initiated by the legislator.

If anything, constituents being more likely to respond to female legislator requests than male legislator requests seems like a constituent bias in favor of female legislators.

---

NOTE

1. To date, no responses to tweets about the potential error or the research method.

Tagged with: , ,

In a prior post, I criticized the questionnaire for the ANES 2020 Time Series Study, so I want to use this post to praise the questionnaire for the ANES 2022 Pilot Study, plus add some other comments.

---

1. The pilot questionnaire has items that ask participants to rate men and women on 0-to-100 feeling thermometers, which will permit assessment of the association for negative attitudes about women and men, presuming that some of the planned 1500 respondents express such negative attitudes.

2. The pilot questionnaire has items in which response options permit underestimation of the frequency of certain types of vote fraud, with a "Never" option for items about how often in the respondent's state [1] a voter casts more than one ballot and [2] votes are cast on behalf of dead people. That happened at least once recently in Arizona (see also https://www.heritage.org/voterfraud), and I suspect that this is currently a misperception that is more common on the political left.

But it doesn't seem like a good idea to phrase the vote fraud item about the respondent's state, so that coding a response as a misperception requires checking evidence in 50 states. And I don't think there is an obvious threshold for overestimating how often, say, a voter casts more than one ballot. "Rarely" seems like an appropriate response for Arizona residents, but is "Occasionally" incorrect?

3. The pilot questionnaire has an item about the genuineness of emails on Hunter Biden's laptop in which Hunter Biden "contacted representatives of foreign governments about business deals". So I guess that can be a misinformation item that liberals are more likely to be misinformed about.

4. The pilot questionnaire has items about whether being White/Black/Hispanic/Asian "comes with advantages, disadvantages, or doesn't it matter". Based on the follow up item, these items might not permit respondents to select both "advantages" and "disadvantages", and, if so, it might be better to differentiate respondents who think that, for instance, being White has only advantages from respondents who think that being White has on net more advantages than disadvantages.

5. The pilot questionnaire permits respondents to report the belief that Black and Hispanic Americans have lower socioeconomic status than White Americans because of biological differences, but respondents can't report the belief that particular less positive outcomes for White Americans relative to another group are due to biological differences (e.g., average White American K12 student math performance relative to average Asian American K12 student math performance).

---

Overall, the 2022 pilot seems like an improvement. For one thing, the pilot questionnaire, like is common for the ANES, has feeling thermometers about Whites, Blacks, Hispanics, and Asians, so that it's possible to construct a measure of negative attitudes about each included racial/ethnic group. And the feeling thermometers for men and women permit construction of a measure of negative attitudes about men and women. For another thing, respondents can report misperceptions that are presumably more common among persons on the political left. That's more than what is permitted by a lot of similar surveys.

Tagged with: , , , , ,

Politics & Gender published Deckman and Cassese 2021 "Gendered nationalism and the 2016 US presidential election", which, in 2022, shared an award for the best article published in Politics & Gender the prior year.

---

1.

So what is gendered nationalism? From Deckman and Cassese 2021 (p. 281):

Rather than focus on voters' sense of their own masculinity and femininity, we consider whether voters characterized American society as masculine or feminine and whether this macro-level gendering, or gendered nationalism as we call it, had political implications in the 2016 presidential election.

So how is this characterization of American society as masculine or feminine measured? The Deckman and Cassese 2021 online appendix indicates that gendered nationalism is...

Measured with a single survey item asking whether "Society as a whole has become too soft and feminine." Responses were provided on a four-point Likert scale ranging from strongly disagree to strongly agree.

So the measure of "whether voters characterized American society as masculine or feminine" (p. 281) ranged from the characterization that American society is (too) feminine to the characterization that American society is...not (too) feminine. The "(too)" is because I suspect that respondents might interpret the "too" in "too soft and feminine" as also applying to "feminine", but I'm not sure it matters much.

Regardless, there are at least three potential relevant characterizations: American society is feminine, masculine, or neither feminine nor masculine. It seems like a poor research design to combine two of these characterizations.

---

2.

Deckman and Cassese 2021 also described gendered nationalism as (p. 278):

Our project diverges from this work by focusing on beliefs about the gendered nature of American society as a whole—a sense of whether society is 'appropriately' masculine or has grown too soft and feminine.

But disagreement with the characterization that "Society as a whole has become too soft and feminine" doesn't necessarily indicate a characterization that society is "appropriately" masculine, because a respondent could believe that society is too masculine or that society is neither feminine nor masculine.

Omission of a response option indicating a belief that American society is (too) masculine might have made it easier for Deckman and Cassese 2021 to claim that "we suppose that those who rejected gendered nationalism were likely more inclined to vote for Hillary Clinton" (p. 282), as if only the measured "too soft and feminine" characterization is acceptance of "gendered nationalism" and not the unmeasured characterization that American society is (too) masculine.

---

3.

Regression results in Table 2 of Deckman and Cassese 2021 indicate that gendered nationalism predicts a vote for Trump over Clinton in 2016, net of controls for political party, a single measure of political ideology, and demographics such as class, race, and education.

Gendered nationalism is the only specific belief in the regression, and Deckman and Cassese 2021 reports no evidence about whether "beliefs about the gendered nature of American society as a whole" has any explanatory power above other beliefs about gender, such as gender roles and animus toward particular genders.

---

4.

Deckman and Cassese 2021 reported on four categories of class: lower class, working class, middle class, and upper class. Deckman and Cassese 2021 hypothesis H2 is that:

Gendered nationalism is more common among working-class men and women than among men and women with other socioeconomic class identifications.

For such situations, in which the hypothesis is that one of four categories is distinctive, the most straightforward approach is to omit from the regressions the hypothesized distinctive category, because then the p-values and coefficients for each of the three included categories will provide information about the evidence that that included category differs from the omitted category.

But the regressions in Deckman and Cassese 2021 omitted middle class, and, based on the middle model in Table 1, Deckman and Cassese 2021 concluded that:

Working-class Democrats were significantly more likely to agree that the United States has grown too soft and feminine, consistent with H2.

But the coefficients and standard errors were 0.57 and 0.26 for working class and 0.31 and 0.40 for lower class, so I'm not sure that the analysis in Table 1 contained enough evidence that the 0.57 estimate for working class differs from the 0.31 estimate for lower class.

---

5.

I think that Deckman and Cassese 2021 might have also misdescribed the class results in the Conclusions section, in the passage below, which doesn't seem limited to Democrat participants. From p. 295:

In particular, the finding that working-class voters held distinctive views on gendered nationalism is compelling given that many accounts of voting behavior in 2016 emphasized support for Donald Trump among the (white) working class.

For that "distinctive" claim, Deckman and Cassese 2021 seemed to reference differences in statistical significance (p. 289, footnote omitted):

The upper- and lower-class respondents did not differ from middle-class respondents in their endorsement of gendered nationalism beliefs. However, people who identified as working class were significantly more likely to agree that the United States has grown too soft and feminine, though the effect was marginally significant (p = .09) in a two-tailed test. This finding supports the idea that working-class voters hold a distinctive set of beliefs about gender and responded to the gender dynamics in the campaign with heightened support for Donald Trump’s candidacy, consistent with H2.

In the Table 1 baseline model predicting gendered nationalism without interactions, ologit coefficients are 0.25 for working class and 0.26 for lower class, so I'm not sure that there is sufficient evidence that working class views on gendered nationalism were distinctive from lower class views on gendered nationalism, even though the evidence is stronger that the 0.25 working class coefficient differs from zero than the 0.26 lower class coefficient differs from zero.

Looks like the survey's pre-election wave had at least twice as many working class respondents as lower class respondents. If that ratio was similar for the post-election wave, that would explain the difference in statistical significance and explain why the standard error was smaller for the working class (0.15) than for the lower class (0.23). Search for "class" at the PRRI site and use the PRRI/The Atlantic 2016 White Working Class Survey.

---

6.

At least Deckman and Cassese 2021 interpreted the positive coefficient on the interaction of college and Republican as an estimate of how the association of college and the outcome among Republicans differed from the association of college and the outcome among the omitted category.

But I'm not sure of the justification for "largely" in Deckman and Cassese 2021 (p. 293):

Thus, in accordance with our mediation hypothesis (H5), gender differences in beliefs that the United States has grown too soft and feminine largely account for the gender gap in support for Donald Trump in 2016.

Inclusion of the predictor for gendered nationalism pretty much only halves the logit coefficient for "female", from 0.80 to 0.42, and, in Figure 3, the gender gap in predicted probability of a Trump vote is pretty much only cut in half, too. I wouldn't call about half "largely", especially without addressing the obvious confound of attitudes about men and women that have nothing to do with "gendered nationalism".

---

7.

Deckman and Cassese 2021 was selected for a best article award by the editorial board of Politics & Gender. From my prior posts on publications in Politics & Gender: p < .000, misinterpreted interaction terms, and an example of the difference in statistical signifiance being used to infer an difference in effect.

---

NOTES

1. Prior post mentioning Deckman and Cassese 2021.

2. Prior post on deviations from a preregistration plan, for Cassese and Barnes 2017.

3. "Gendered nationalism" is an example of use of a general term when a better approach would be specificity, such as a measure that separates "masculine nationalism" from "feminine nationalism". Another example is racial resentment, in which a general term is used to describe only the type of racial resentment directed at Blacks. Feel free to read through participant comments in the Kam and Burge survey, in which plenty of comments from respondents who score low on the racial resentment scale indicate resentment directed at Whites.

Tagged with: , , ,

I reached ten new publications to comment on that I didn't think were worth a separate blog post, so here goes:

---

1.

The Twitter account for the journal Politics, Groups, and Identities retweeted R.G. Cravens linking to two of his articles in Politics, Groups, and Identities. I blogged about one of these articles, discussing, among other things, the article's erroneous interpretation of interaction terms. The other article that R.G. Cravens linked to in that tweet ("The view from the top: Social acceptance and ideological conservatism among sexual minorities") also misinterpreted an interaction term:

However, the coefficient estimate for the interaction term between racial minority identity and racial identity group consciousness (β = −.312, p = .000), showing the effect of racial identity group consciousness only among racial minority respondents, indicates a negative relationship between racial minority group consciousness and conservatism at the 99% confidence level.

The corresponding Table 1 coefficient for RI Consciousness is 0.117, indicating the estimated effect of racial identity consciousness when the "Racial minority" variable is set to zero. The -0.312 interaction term indicates how much the estimated effect of racial identity consciousness *differs* between non-racial minorities and racial minorities, so that the estimated effect of racial identity consciousness among racial minorities is 0.117 plus -0.312, which is -0.195.

Two articles by one author in the same journal within three years, and each article misinterpreted an interaction term.

---

2.

PS: Political Science & Politics published another article about student evaluations of teaching: Foster 2022 "Instructor name preference and student evaluations of instruction". The key finding seems plausible, that "SEIs were higher for instructors who preferred going by their first name...than for instructors who preferred going by 'Dr. Moore'" (p. 4).

But a few shortcomings about the reporting on the experiment in Study 2, which manipulated the race of an instructor, the gender of the instructor, and the instructor's stated preference for using his/her first name versus using his/her title and last name:

* Hypothesis 5 is about conservative Republicans:

Moderated mediation: We predict that female instructors who express a preference for going by "Dr. Moore" will have lower teacher ratings through decreased perceived immediacy, but only for students who identify as conservative and Republican.

But, as far as I can tell, the article doesn't report any data about Hypothesis 5.

* Table 2 indicates a positive p<0.05 correlation between the race of the instructor and SEIs (student evaluations of instruction) and a positive p<0.05 correlation between the race of the instructor and course evaluations. But, as far as I can tell, the article doesn't report how the race variable was coded, so it's not clear whether the White instructors or the Black instructors had the higher SEIs and course evaluations.

* The abstract indicates that:

Study 2 found the highest SEIs for Black male instructors when instructors asked students to call them by their first name, but there was a decrease in SEI scores if they went by their professional title.

But, as far as I can tell, the article doesn't report sufficient evidence about whether the estimated influence of the name preference among the Black male instructor targets differed from the estimated influence of the name preference among any of the comparison instructors. The p-value being under p=0.05 for the Black male instructor targets and not being under p=0.05 for the other instructor targets isn't enough evidence to infer at p<0.05 that participants treated the Black male instructor targets differently than participants treated the comparison instructor targets, so that the article doesn't report sufficient evidence to permit an inference of racial discrimination.

---

---

5.

I wasn't the only person to notice this next one (see tweets from Tom Pepinsky and Brendan Nyhan), but Politics & Gender recently published Forman-Rabinovici and Mandel 2022 "The prevalence and implications of gender blindness in quantitative political science research", which indicated that:

Our findings show that gender-sensitive analysis yields more accurate and useful results. In two out of the three articles we tested, gender-sensitive analysis indeed led to different outcomes that changed the ramifications for theory building as a result.

But the inferential technique in the analysis reflected a common error.

For the first of the three aforementioned articles (Gandhi and Ong 2019), Table 1a of Forman-Rabinovici and Mandel 2022 reported results with a key coefficient that was -.308 across the sample, was -.294 (p=.003) among men in the sample, and was -.334 (p=.154) among women in the sample. These estimates are from a linear probability model predicting a dichotomous "Support PH" outcome, so the point estimates were 29 percentage points among men and 33 percentage points among women.

The estimate was more extreme among women than among men, but the estimate was less precise among women than among men, at least partly because the sample size among men (N=1902) was about three times the sample size among women (N=652).

Figure 1 of Forman-Rabinovici and Mandel 2022 described these results as:

Male voters leave PH coalition

Female voters continue to vote for PH coalition

But, in my analysis of the data, the ends of the 95% confidence interval for the estimate among women indicated an 82 percentage point decrease and a 15 percentage point increase [-0.82, +0.15], so that's not nearly enough evidence to infer a lack of an effect among women.

---

6.

Politics & Gender published another article that has at least a misleading interpretation of interaction terms: Kreutzer 2022 "Women's support shaken: A study of women's political trust after natural disasters".

Table 1 reports results for three multilevel mixed-effects linear regressions, with coefficients on a "Number of Disasters Present" predictor of 0.017, 0.009, and 0.022. The models have a predictor for "Female" and an interaction of "Female" and "Number of Disasters Present" with interaction coefficients of –0.001, –0.002, and –0.001. So the combination of coefficients indicates that the associations of "Number of Disasters Present" and the "trust" outcomes are positive among women, but not as positive as the associations are among men.

Kreutzer 2022 discusses this correctly in some places, such as indicating that the interaction term "allows a comparison of how disasters influence women's political trust compared with men's trust" (p. 15). But in other places the interpretation is, I think, incorrect or at least misleading, such as in the abstract (emphasis added):

I investigate women's trust in government institutions when natural disasters have recently occurred and argue that because of their unique experiences and typical government responses, women's political trust will decline when there is a natural disaster more than men's. I find that when there is a high number of disasters and when a larger percentage of the population is affected by disasters, women's political trust decreases significantly, especially institutional trust.

Or from page 23:

I have demonstrated that natural disasters create unique and vulnerable situations for women that cause their trust in government to decline.

And discussing Figure 5, referring to a different set of three regressions (reference to footnote 12 omitted):

The figure shows a small decline in women's trust (overall, institutional, organizational) as the percentage of the population affected by disasters in the country increases. The effect is significantly different from 0, but the percentage affected seems not to make a difference.

That seems to say that the percentage of the population affected has an effect that is simultaneously not zero and does not seem to make a difference. I think Figure 5 marginal effects plots indicate that women have lower trust than men (which is why each point estimate line falls in the negative range), but that this gender difference in trust does not vary much by the percentage of the population affected (which is why the each point estimate line is pretty much flat).

---

The "Women's Political Empowerment Index" coefficient and standard error are –0.017 and 0.108 in Model 4, so maybe the ** indicating a two-tailed p<0.01 is an error.

Tweet to the author (Oct 3). No reply yet.

---

7, 8.

Let's return to Politics, Groups, and Identities, for Ditonto 2019 "Direct and indirect effects of prejudice: sexism, information, and voting behavior in political campaigns". From the abstract:

I also find that subjects high in sexism search for less information about women candidates...

At least in the reported analyses, the comparison for "less" is to participants low in sexism instead of to male candidates. So we get this result discussing Table 2 (pp. 598-599):

Those who scored lowest in sexism are predicted to look at approximately 13 unique information boxes for the female candidate, while those who scored highest are predicted to access about 10 items, or almost 1/3 less.

It should be obvious to peer reviewers and any editors that a comparison to the male candidates in the experiment would be a more useful comparison for assessing the effect of sexism, because, for all we know, respondents high in sexism might search for less information than respondents low in sexism search for, no matter the gender of the candidate.

Ditonto has another 2019 article in a different journal (Political Psychology) based on the same experiment: "The mediating role of information search in the relationship between prejudice and voting behavior". From that abstract:

I also find that subjects high in prejudice search for less information about minority candidates...

But, again, Table 2 in that article merely indicates that symbolic racism negatively associates with information search for a minority candidate, with no information provided about information search for a non-minority candidate.

---

And I think that the Ditonto 2019 abstracts include claims that aren't supported by results reported in the article. The PGI abstract claims that "I find that subjects with higher scores on items measuring modern sexism...rate female candidates more negatively than their male counterparts", and the PP abstract claims that "I find that subjects higher in symbolic racism...rate minority candidates more negatively than their white counterparts".

By the way, claims about respondents high in sexism or racism should be assessed using data only from respondents high in sexism or racism, because the association of a sexism or racism measure with an outcome might be completely due to respondents low in sexism or racism.

Tweet to the author (Oct 9). No reply yet.

---

9.

Below is a passage from "Lower test scores from wildfire smoke exposure", by Jeff Wen and Marshall Burke, published in 2022 in Nature Sustainability:

When we consider the cumulative losses over all study years and across subgroups (Fig. 4b), we estimate the net present value of lost future income to be roughly $544 million (95% CI: −$999 million to −$100 million) from smoke PM2.5 exposure in 2016 for districts with low economic disadvantage and low proportion of non-White students. For districts with high economic disadvantage and high proportion of non-White students, we estimate cumulative impacts to be $1.4 billion (95% CI: −$2.3 billion to −$477 million) from cumulative smoke PM2.5 exposure in 2016. Thus, of the roughly $1.7 billion in total costs during the smokiest year in our sample, 82% of the costs we estimate were borne by economically disadvantaged communities of colour.

So, in 2016, the lost future income was about $0.5 billion for low economic disadvantage / low non-White districts and $1.4 billion for high economic disadvantage / high non-White districts; that gets us to $1.9 billion, without even including the costs from low/high districts and high/low districts. But total costs were cited as roughly $1.7 billion.

From what I can tell from Figure 4b, the percentage of total costs attributed to economically disadvantaged communities of color (the high / high category) is 59%. It's not a large inferential difference from 82%, in that both estimates are a majority, but it's another example of an error that could have been caught by careful reading.

Tweet to the authors about this (Oct 17). No reply yet.

---

10.

Political Research Quarterly published "Opening the Attitudinal Black Box: Three Dimensions of Latin American Elites' Attitudes about Gender Equality", by Amy Alexander, Asbel Bohigues, and Jennifer M. Piscopo.

I was curious about the study's measurement of attitudes about gender equality, and, not unexpectedly, the measurement was not good, using items such as "In general, men make better political leaders than women", in which respondents can agree that men make better political leaders, can disagree that men make better political leaders, and can be neutral about the claim that men make better political leaders...but respondents cannot report the belief that, in general, women make better political leaders than men do.

I checked the data, in case almost no respondent disagreed with the statement that "In general, men make better political leaders than women", in which case presumably no respondent would think that women make better political leaders than men do. But disagreement with the statement was pretty high, with 69% strongly disagreeing, another 15% disagreeing, and another 11% selecting neither agree nor disagree.

I tweeted a question about this to some of the authors (Oct 21). No reply yet.

Tagged with: , , , ,

Research involves a lot of decisions, which in turn provides a lot of opportunities for research to be incorrect or substandard, such as mistakes in recoding a variable, not using the proper statistical method, or not knowing unintuitive elements of statistical software such as how Stata treats missing values in logical expressions.

Peer and editorial review provides opportunities to catch flaws in research, but some journals that publish political science don't seem to be consistently doing a good enough job at this. Below, I'll provide a few examples that I happened upon recently and then discuss potential ways to help address this.

---

Feinberg et al 2022

PS: Political Science & Politics published Feinberg et al 2022 "The Trump Effect: How 2016 campaign rallies explain spikes in hate", which claims that:

Specifically, we established that the words of Donald Trump, as measured by the occurrence and location of his campaign rallies, significantly increased the level of hateful actions directed toward marginalized groups in the counties where his rallies were held.

After Feinberg et al published a similar claim in the Monkey Cage in 2019, I asked the lead author about the results when the predictor of hosting a Trump rally is replaced with a predictor of hosting a Hillary Clinton rally.

I didn't get a response from Ayal Feinberg, but Lilley and Wheaton 2019 reported that the point estimate for the effect on the count of hate-motivated events is larger for hosting a Hillary Clinton rally than for hosting a Donald Trump rally. Remarkably, the Feinberg et al 2022 PS article does not address the Lilley and Wheaton 2019 claim about Clinton rallies, even though the supplemental file for the Feinberg et al 2022 PS article discusses a different criticism from Lilley and Wheaton 2019.

The Clinton rally counterfactual is an obvious way to assess the claim that something about Trump increased hate events. Even if the reviewers and editors for PS didn't think to ask about the Clinton rally counterfactual, that counterfactual analysis appears in the Reason magazine criticism that Feinberg et al 2022 discusses in its supplemental files, so the analysis was presumably available to the reviewers and editors.

Will May has published a PubPeer comment discussing other flaws of the Feinberg et al 2022 PS article.

---

Christley 2021

The impossible "p < .000" appears eight times in Christley 2021 "Traditional gender attitudes, nativism, and support for the Radical Right", published in Politics & Gender.

Moreover, Christley 2021 indicates that (emphasis added):

It is also worth mentioning that in these data, respondent sex does not moderate the relationship between gender attitudes and radical right support. In the full model (Appendix B, Table B1), respondent sex is correlated with a higher likelihood of supporting the radical right. However, this finding disappears when respondent sex is interacted with the gender attitudes scale (Table B2). Although the average marginal effect of gender attitudes on support is 1.4 percentage points higher for men (7.3) than it is for women (5.9), there is no significant difference between the two (Figure 5).

Table B2 of Christley 2021 has 0.64 and 0.250 for the logit coefficient and standard error for the "Male*Gender Scale" interaction term, with no statistical significance asterisks; the 0.64 is the only table estimate without results reported to three decimal places, so it's not clear to me from the table if the asterisks are missing or is the estimate should be, say, 0.064 instead of 0.64. The sample size for the Table B2 regression is 19,587, so a statistically significant 1.4-percentage-point difference isn't obviously out of the question, from what I can tell.

---

Hua and Jamieson 2022

Politics, Groups, and Identities published Hua and Jamieson 2022 "Whose lives matter? Race, public opinion, and military conflict".

Participants were assigned to a control condition with no treatment, to a placebo condition with an article about baseball gloves, or to an article about a U.S. service member being killed in combat. The experimental manipulation was the name of the service member, intended to signal race: Connor Miller, Tyrone Washington, Javier Juarez, Duc Nguyen, and Misbah Ul-Haq.

Inferences from Hua and Jamieson 2022 include:

When faced with a decision about whether to escalate a conflict that would potentially risk even more US casualties, our findings suggest that participants are more supportive of escalation when the casualties are of Pakistani and African American soldiers than they are when the deaths are soldiers from other racial–ethnic groups.

But, from what I can tell, this inference of participants being "more supportive" depending on the race of the casualties is based on differences in statistical significance when each racial condition is compared to the control condition. Figure 5 indicates a large enough overlap between confidence intervals for the racial conditions for this escalation outcome to prevent a confident claim of "more supportive" when comparing racial conditions to each other.

Figure 5 seems to plot estimates from the first column in Table C.7. The largest racial gap in estimates is between the Duc Nguyen condition (0.196 estimate and 0.133 standard error) and the Tyrone Washington condition (0.348 estimate and 0.137 standard error). So this difference in means is 0.152, and I don't think that there is sufficient evidence to infer that these estimates differ from each other. 83.4% confidence intervals would be about [0.01, 0.38] and [0.15, 0.54].

---

Walker et al 2022

PS: Political Science & Politics published Walker et al 2022 "Choosing reviewers: Predictors of undergraduate manuscript evaluations", which, for the regression predicting reviewer ratings of manuscript originality, interpreted a statistically significant -0.288 OLS coefficient for "White" as indicating that "nonwhite reviewers gave significantly higher originality ratings than white reviewers". But the table note indicates that the "originality" outcome variable is coded 1 for yes, 2 for maybe, and 3 for no, so that the "higher" originality ratings actually indicate lower ratings of originality.

Moreover, Walker et al 2022 claims that:

There is no empirical linkage between reviewers' year in school and major and their assessment of originality.

But Table 2 indicates p<0.01 evidence that reviewer major associates with assessments of originality.

And the "a", "b", and "c" notes for Table 2 are incorrectly matched to the descriptions; for example, the "b" note about the coding of the originality outcome is attached to the other outcome.

The "higher originality ratings" error has been corrected, but not the other errors. I mentioned only the "higher" error in this tweet, so maybe that explains that. It'll be interesting to see if PS issues anything like a corrigendum about "Trump rally / hate" Feinberg et al 2022, given that the flaw in Feinberg et al 2022 seems a lot more important.

---

Fattore et al 2022

Social Science Quarterly published Fattore et al 2022 "'Post-election stress disorder?' Examining the increased stress of sexual harassment survivors after the 2016 election". For a sample of women participants, the analysis uses reported experience being sexually harassed to predict a dichotomous measure of stress due to the 2016 election, net of controls.

Fattore et al 2022 Table 1 reports the standard deviation for a presumably multilevel categorical race variable that ranges from 0 to 4 and for a presumably multilevel categorical marital status variable that ranges from 0 to 2. Fattore et al 2022 elsewhere indicates that the race variable was coded 0 for white and 1 for minority, but indicates that the marital status variable is coded 0 for single, 1 for married/coupled, and 2 for separated/divorced/widowed, so I'm not sure how to interpret regression results for the marital status predictor.

And Fattore et al 2022 has this passage:

With 95 percent confidence, the sample mean for women who experienced sexual harassment is between 0.554 and 0.559, based on 228 observations. Since the dependent variable is dichotomous, the probability of a survivor experiencing increased stress symptoms in the post-election period is almost certain.

I'm not sure how to interpret that passage: Is the 95% confidence interval that thin (0.554, 0.559) based on 228 observations? Is the mean estimate of about 0.554 to 0.559 being interpreted as almost certain? Here is the paragraph that that passage is from.

---

Hansen and Dolan 2022

Political Behavior published Hansen and Dolan 2022 "Cross‑pressures on political attitudes: Gender, party, and the #MeToo movement in the United States".

Table 1 of Hansen and Dolan 2022 reported results from a regression limited to 694 Republican respondents in a 2018 ANES survey, which indicated that the predicted feeling thermometer rating about the #MeToo movement was 5.44 units higher among women than among men, net of controls, with a corresponding standard error of 2.31 and a statistical significance asterisk. However, Hansen and Dolan 2022 interpreted this to not provide sufficient evidence of a gender gap:

In 2018, we see evidence that women Democrats are more supportive of #MeToo than their male co-partisans. However, there was no significant gender gap among Republicans, which could signal that both women and men Republican identifiers were moved to stand with their party on this issue in the aftermath of the Kavanaugh hearings.

Hansen and Dolan 2022 indicated that this inference of no significant gender gap is because, in Figure 1, the relevant 95% confidence interval for Republican men overlapped with the corresponding 95% confidence interval for Republican women.

Footnote 9 of Hansen and Dolan 2022 noted that assessing statistical significance using overlap of 95% confidence intervals is a "more rigorous standard" than using a p-value threshold of p=0.05 in a regression model. But Footnote 9 also claimed that "Research suggests that using non-overlapping 95% confidence intervals is equivalent to using a p < .06 standard in the regression model (Schenker & Gentleman, 2001)", and I don't think that this "p < .06" claim is correct or at least not misleading.

My Stata analysis of the data for Hansen and Dolan 2022 indicated that the p-value for the gender gap among Republicans on this item is p=0.019, which is about what would be expected given data in Table 1 of a t-statistic of 5.44/2.31 and more than 600 degrees of freedom. From what I can tell, the key evidence from Schenker and Gentleman 2001 is Figure 3, which indicates that the probability of a Type 1 error using the overlap method is about equivalent to p=0.06 only when the ratio of the two standard errors is about 20 or higher.

This discrepancy in inferences might have been avoided if 83.4% confidence intervals were more commonly taught and recommended by editors and reviewers, for visualizations in which the key comparison is between two estimates.

---

Footnote 10 of Hansen and Dolan 2022 states:

While Fig. 1 appears to show that Republicans have become more positive towards #MeToo in 2020 when compared to 2018, the confidence bounds overlap when comparing the 2 years.

I'm not sure what that refers to. Figure 1 of Hansen and Dolan 2022 reports estimates for Republican men in 2018, Republican women in 2018, Republican men in 2020, and Republican women in 2020, with point estimates increasing in that order. Neither 95% confidence interval for Republicans in 2020 overlaps with either 95% confidence interval for Republicans in 2018.

---

Other potential errors in Hansen and Dolan 2022:

[1] The code for the 2020 analysis uses V200010a, which is a weight variable for the pre-election survey, even though the key outcome variable (V202183) was on the post-election survey.

[2] Appendix B Table 3 indicates that 47.3% of the 2018 sample was Republican and 35.3% was Democrat, but the sample sizes for the 2018 analysis in Table 1 are 694 for the Republican only analysis and 1001 for the Democrat only analysis.

[3] Hansen and Dolan 2022 refers multiple times to predictions of feeling thermometer ratings as predicted probabilities, and notes for Tables 1 and 2 indicate that the statistical significance asterisk is for "statistical significance at p > 0.05".

---

Conclusion

I sometimes make mistakes, such as misspelling an author's name in a prior post. In 2017, I preregistered an analysis that used overlap of 95% confidence intervals to assess evidence for the difference between estimates, instead of a preferable direct test for a difference. So some of the flaws discussed above are understandable. But I'm not sure why all of these flaws got past review at respectable journals.

Some of the flaws discussed above are, I think, substantial, such as the political bias in Feinberg et al 2022 not reporting a parallel analysis for Hillary Clinton rallies, especially with the Trump rally result being prominent enough to get a fact check from PolitiFact in 2019. Some of the flaws discussed above are trivial, such as "p < .000". But even trivial flaws might justifiably be interpreted as reflecting a review process that is less rigorous than it should be.

---

I think that peer review is valuable at least for its potential to correct errors in analyses and to get researchers to report results that they otherwise wouldn't report, such as a robustness check suggested by a reviewer that undercuts the manuscript's claims. But peer review as currently practiced doesn't seem to do that well enough.

Part of the problem might be that peer review at a lot of political science journals combines [1] assessment of the contribution of the manuscript and [2] assessment of the quality of the analyses, often for manuscripts that are likely to be rejected. Some journals might benefit from having a (or having another) "final boss" who carefully reads conditionally accepted manuscripts only for assessment [2], to catch minor "p < .000" types of flaws, to catch more important "no Clinton rally analysis" types of flaws, and to suggest robustness checks and additional analyses.

But even better might be opening peer review to volunteers, who collectively could plausibly do a better job than a final boss could do alone. I discussed the peer review volunteer idea in this symposium entry. The idea isn't original to me; for example, Meta-Psychology offers open peer review. The modal number of peer review volunteers for a publication might be zero, but there is a good chance that I would have raised the "no Clinton rally analysis" criticism had PS posted a conditionally accepted version of Feinberg et al 2022.

---

Another potentially good idea would be for journals or an organization such as APSA to post at least a small set of generally useful advice, such as reporting results for a test for differences between estimates if the manuscript suggests a difference between estimates. More specific advice could be posted by topic, such as, for count analyses, advice about predicting counts in which the opportunity varies by observation: Lilley and Wheaton 2019 discussed this page, but I think that this page has an explanation that is easier to understand.

---

NOTES

1. It might be debatable whether this is a flaw per se, but Long 2022 "White identity, Donald Trump, and the mobilization of extremism" reported correlational results from a survey experiment but, from what I can tell, didn't indicate whether any outcomes differed by treatment.

2. Data for Hansen and Dolan 2022. Stata code for my analysis:

desc V200010a V202183

svyset [pw=weight]

svy: reg metoo education age Gender race income ideology2 interest media if partyid2=="Republican"

svy: mean metoo if partyid2=="Republican" & women==1

3. The journal Psychological Science is now publishing peer reviews. Peer reviews are also available for the journal Meta-Psychology.

4. Regarding the prior post about Lacina 2022 "Nearly all NFL head coaches are White. What are the odds?", Bethany Lacina discussed that with me on Twitter. I have published an update at that post.

5. I emailed or tweeted to at least some authors of the aforementioned publications discussing the planned comments or indicating at least some of the criticism. I received some feedback from one of the authors, but the author didn't indicate that I had permission to acknowledge the author.

Tagged with: , , , , ,

Below are leftover comments on publications that I read in 2021.

---

ONO AND ZILIS 2021

Politics, Groups, and Identities published Ono and Zilis 2021, "Do Americans perceive diverse judges as inherently biased?". Ono and Zilis 2021 indicated that "We test whether Americans perceive diverse judges as inherently biased with a list experiment". The statements to test whether Americans perceive diverse judges to be "inherently biased" were:

When a court case concerns issues like #metoo, some women judges might give biased rulings.

When a court case concerns issues like immigration, some Hispanic judges might give biased rulings.

Ono and Zilis 2021 indicated that "...by endorsing that idea, without evidence, that 'some' members of a group are inclined to behave in an undesirable way, respondents are engaging in stereotyping" (pp. 3-4).

But statements about whether *some* Hispanic judges and *some* women judges *might* be biased can't measure stereotypes or the belief that Hispanic judges or women judges are *inherently* biased. For example, a belief that *some* women *might* commit violence doesn't require the belief that women are inherently violent and doesn't even require the belief that women are on average more violent than men are.

---

Ono and Zilis 2021 claimed that "Hispanics do not believe that Hispanic judges are biased" (p. 4, emphasis in the original), but, among Hispanic respondents, the 95% confidence interval for agreement with the claim that Hispanic judges might be biased in cases involving issues like immigration did not cross zero in the multivariate analyses in Figure 1.

For Table 2 analyses without controls, the corresponding point estimate indicated that 25 percent of Hispanics agreed with the claim about Hispanic judges, but the ratio of the relevant coefficient to standard error was 0.25/0.15, which is about 1.67, depending on how the 0.25 and 0.15 were rounded. The corresponding p-value isn't less than p=0.05, but that doesn't support the conclusion that the percentage of Hispanics that agreed with the statement is zero.

---

BERRY ET AL 2021

Politics, Groups, and Identities published Berry et al 2021,"White identity politics: linked fate and political participation". Berry et al 2021 claimed to have found "notable partisan differences in the relationship between racial linked fate and electoral participation for White Americans". But this claim is based on differences in the presence of statistical significance between estimates for White Republicans and estimates for White Democrats ("Linked fate is significantly and consistently associated with increased electoral participation for Republicans, but not Democrats", p. 528), instead of being based on statistical tests of whether estimates for White Republicans differ from estimates for White Democrats.

The estimates in the Berry et al 2021 appendix that I highlighted in yellow appear to be incorrect, in terms of plausibility and based on the positive estimate in the corresponding regression output.

---

ARCHER AND CLIFFORD FORTHCOMING

In "Improving the Measurement of Hostile Sexism" (reportedly forthcoming at Public Opinion Quarterly), Archer and Clifford proposed a modified version of the hostile sexism scale that is item specific. For example, instead of measuring responses about the statement "Women exaggerate problems they have at work", the corresponding item-specific item measures responses to the question of "How often do women exaggerate problems they have at work?". Thus, to get the lowest score on the hostile sexism scale, instead of merely strongly disagreeing that women exaggerate problems they have at work, respondents must report the belief that women *never* exaggerate problems they have at work.

---

Archer and Clifford indicated that responses to some of their revised items are measured on a bipolar scale. For example, respondents can indicate that women are offended "much too often", "a bit too often", "about the right amount", "not quite often enough", or "not nearly often enough". So to get the lowest hostile sexism score, respondents need to indicate that women are wrong about how often they are offended, by not being offended enough.

Scott Clifford, co-author of the Archer and Clifford article, engaged me in a discussion about the item specific scale (archived here). Scott suggested that the low end of the scale is more feminist, but dropped out of the conversation after I asked how much of an OLS coefficient for the proposed item-specific hostile sexism scale is due to hostile sexism and how much is due to feminism.

The portion of the hostile sexism measure that is sexism seems like something that should have been addressed in peer review, if the purpose of a hostile sexism scale is to estimate the effect of sexism and not to merely estimate the effect of moving from highly positive attitudes about women to highly negative attitudes about women.

---

VIDAL ET AL 2021

Social Science Quarterly published Vidal et al,"Identity and the racialized politics of violence in gun regulation policy preferences". Appendix A indicates that, for the American National Election Studies 2016 Time Series Study, responses to the feeling thermometer about Black Lives Matter ranged from 0 to 999, with a standard deviation of 89.34, even though the ANES 2016 feeling thermometer for Black Lives Matter ran from 0 to 100, with 999 reserved for respondents who indicate that they don't know what Black Lives Matter is.

---

ARORA AND STOUT 2021

Research & Politics published Arora and Stout 2021 "After the ballot box: How explicit racist appeals damage constituents views of their representation in government", which noted that:

The results provide evidence for our hypothesis that exposure to an explicitly racist comment will decrease perceptions of interest representation among Black and liberal White respondents, but not among moderate and conservative Whites.

This is, as far as I can tell, a claim that the effect among Black and liberal White respondents will differ from the effect among moderate and conservative Whites, but Arora and Stout 2021 did not report a test of whether these effects differ, although Arora and Stout 2021 did discuss statistical significance for each of the four groups.

Moreover, Arora and Stout 2021 footnote 4 indicates that:

In the supplemental appendix, we confirm that explicit racial appeals have a unique effect on interest representation and are not tied to other candidate evaluations such as vote choice.

But the estimated effect for interest representation (Table 1) was -0.06 units among liberal White respondents (with a "+" indicator for statistical significance), which is the same reported number as the estimated effect for vote choice (Table A5): -0.06 units among liberal White respondents (with a "+" indicator for statistical significance).

None of the other estimates in Table 1 or Table A5 have an indicator for statistical significance.

---

Arora and Stout 2021 repeatedly labeled as "explicitly racist" the statement that "If he invited me to a public hanging, I'd be on the front row", but it's not clear to me how that statement is explicitly racist. The Data and Methodology section indicates that "Though the comment does not explicitly mention the targeted group...". Moreover, the Conclusion of Arora and Stout 2021 indicates that...

In spite of Cindy Hyde-Smith's racist comments during the 2018 U.S. Senate election which appeared to show support for Mississippi's racist and violent history, she still prevailed in her bid for elected office.

... and "appeared to" isn't language that I would expect from an explicit statement.

---

CHRISTIANI ET AL 2021

The Journal of Race, Ethnicity, and Politics published Christiani et al 2021 "Masks and racial stereotypes in a pandemic: The case for surgical masks". The abstract indicates that:

...We find that non-black respondents perceive a black male model as more threatening and less trustworthy when he is wearing a bandana or a cloth mask than when he is not wearing his face covering—especially those respondents who score above average in racial resentment, a common measure of racial bias. When he is wearing a surgical mask, however, they do not perceive him as more threatening or less trustworthy. Further, it is not that non-black respondents find bandana and cloth masks problematic in general. In fact, the white model in our study is perceived more positively when he is wearing all types of face coverings.

Those are the within-model patterns, but it's interesting to compare ratings of the models in the control, pictured below:

Appendix Table B.1 indicates that, on average, non-Black respondents rated the White model more threatening and more untrustworthy compared to the Black model: on a 0-to-1 scale, among non-Black respondents, the mean ratings of "threatening" were 0.159 for the Black model and 0.371 for the White model, and the mean ratings of "untrustworthy" were 0.128 for the Black model and 0.278 for the White model. These Black/White gaps were about five times the standard errors.

Christiani et al 2021 claimed that this baseline difference does not undermine their results:

Fortunately, the divergent evaluations of our two models without their masks on do not undermine either of the main thrusts of our analyses. First, we can still compare whether subjects perceive the black model differently depending on what type of mask he is wearing...Second, we can still assess whether people resolve the ambiguity associated with seeing a man in a mask based on the race of the wearer.

But I'm not sure that it is true, that "divergent evaluations of our two models without their masks on do not undermine either of the main thrusts of our analyses".

I tweeted a question to one of the Christiani et al 2021 co-authors that included the handles of two other co-authors, asking whether it was plausible that masks increase the perceived threat of persons who look relatively nonthreatening without a mask but decrease the perceived threat of persons who look relatively more threatening without a mask. That phenomenon would explain the racial difference in patterns described in the abstract, given that the White model in the control was perceived to be more threatening than the Black model in the control.

No co-author has yet responded to defend their claim.

---

Below are the mean ratings on the 0-to-1 "threatening" scale for models in the "no mask" control group, among non-Black respondents by high and low racial resentment, based on Tables B.2 and B.3:

Non-Black respondents with high racial resentment
0.331 mean "threatening" rating of the White model
0.376 mean "threatening" rating of the Black model

Non-Black respondents with low racial resentment
0.460 mean "threatening" rating of the White model
0.159 mean "threatening" rating of the Black model

---

VICUÑA AND PÉREZ 2021

Politics, Groups, and Identities published Vicuña and Pérez 2021, "New label, different identity? Three experiments on the uniqueness of Latinx", which claimed that:

Proponents have asserted, with sparse empirical evidence, that Latinx entails greater gender-inclusivity than Latino and Hispanic. Our results suggest this inclusivity is real, as Latinx causes individuals to become more supportive of pro-LGBTQ policies.

The three studies discussed in Vicuña and Pérez 2021 had these prompts, with the bold font in square brackets indicating the differences in treatments across the four conditions:

Using the spaces below, please write down three (3) attributes that make you [a unique person/Latinx/Latino/Hispanic]. These could be physical features, cultural practices, and/or political ideas that you hold [as a member of this group].

If the purpose is to assess whether "Latinx" differs from "Latino" and "Hispanic", I'm not sure of the value of the "a unique person" treatment.

Discussing their first study, Vicuña and Pérez 2021 reported the p-value for the effect of the "Latinx" treatment relative to the "unique person" treatment (p<.252) and reported the p-values for the effect of the "Latinx" treatment relative to the "Latino" treatment (p<.046) and the "Hispanic" treatment (p<.119). Vicuña and Pérez 2021 reported all three corresponding p-values when discussing their second study and their third study.

But, discussing their meta-analysis of the three studies, Vicuña and Pérez 2021 reported one p-value, which is presumably for the effect of the "Latinx" treatment relative to the "unique person" treatment.

I tweeted a request Dec 20 to the authors to post their data, but I haven't received a reply yet.

---

KIM AND PATTERSON JR. 2021

Political Science & Politics published Kim and Patterson Jr. 2021, "The Pandemic and Gender Inequality in Academia", which reported on tweets of tenure-track political scientists in the United States.

Kim and Patterson Jr. 2021 Figure 2 indicates that, in February 2020, the percentage of work-related tweets was about 11 percent for men and 11 percent for women, and that, shortly after Trump declared a national emergency, these percentages had dropped to about 8 percent and 7 percent respectively. Table 2 reports difference-in-difference results indicating that the pandemic-related decrease in the percentage of work-related tweets was 1.355 percentage points larger for women than for men.

That seems like a relatively small gender inequality in size and importance, and I'm not sure that this gender inequality in percentage of work-related tweets offsets the advantage of having the 31.5k follower @womenalsoknow account tweet about one's research.

---

The abstract of Kim and Patterson Jr. 2021 refers to "tweets from approximately 3,000 political scientists". Table B1 in Appendix B has sample size of 2,912, with a larger number of women than men at the rank of assistant professor, at the rank of associate professor, and at the rank of full professor. The APSA dashboard indicates that women were 37% of members of the American Political Science Association and that 79.5% of APSA members are in the United States, so I think that Table B1 suggests that a higher percentage of female political scientists might be on Twitter than male political scientists.

Oddly, though, when discussing the representatives of this sample, Kim and Patterson Jr. 2021 indicated that (p. 3):

Yet, relevant to our design, we found no evidence that female academics are less likely to use Twitter than male colleagues conditional on academic rank.

That's true about not being *less* likely, but my analysis of the data for Kim and Patterson Jr. 2021 Table 1 indicated that, controlling for academic rank, about 5 percent more female political scientists from top 50 departments were on Twitter, compared to male political scientists from top 50 departments.

Table 1 of Kim and Patterson Jr. 2021 is limited to the 1,747 tenure-track political scientists in the United States from top 50 departments. I'm not sure why Kim and Patterson Jr. 2021 didn't use the full N=2,912 sample for the Table 1 analysis.

---

My analysis indicated that the female/male gaps in the sample were as follows: 2.3 percentage points (p=0.655) among assistant professors, 4.5 percentage points (p=0.341) among associate professors, and 6.7 percentage points (p=0.066) among full professors, with an overall 5 percentage point male/female gap (p=0.048) conditional on academic rank.

---

Kim and Patterson Jr. 2021 suggest a difference in the effect by rank:

Disaggregating these results by academic rank reveals an effect most pronounced among assistants, with significant—albeit smaller—effects for associates. There is no differential effect on work-from-home at the rank of full professor, which is consistent with our hypothesis that these gaps are driven by the increased obligations placed on women who are parenting young children.

But I don't see a test for whether the coefficients differ from each other. For example, in Table 2 for work-related tweets, the "Female * Pandemic" coefficient is -1.188 for associate professors and is -0.891 for full professors, for a difference of 0.297, relative to the respective standard errors of 0.579 and 0.630.

---

Table 1 of Kim and Patterson Jr. 2021 reported a regression predicting whether a political scientist in a top 50 department was a Twitter user, and the p-values are above p=0.05 for all coefficients for "female" and for all interactions involving "female". That might be interpreted as a lack of evidence for a gender difference in Twitter use among these political scientists, but the interaction terms don't permit a clear inference about an overall gender difference.

For example, associate professor is the omitted category of rank in the regression, so the 0.045 non-statistically significant "female" coefficient indicates only that female associate professor political scientists from top 50 departments were 4.5 percentage points more likely to be a Twitter user than male associate professor political scientists from top 50 departments.

And the non-statistically significant "Female X Assistant" coefficient doesn't indicate whether female assistant professors differ from male assistant professors: instead, the non-statistically significant "Female X Assistant" coefficient indicates only that the associate/assistant difference among men in the sample does not differ at p<0.05 from the associate/assistant difference among women in the sample.

Link to the data. R code for my analysis. R output from my analysis.

---

LEFTOVER PLOT

I had the plot below for a draft post that I hadn't yet published:

Item text: "For each of the following groups, how much discrimination is there in the United States today?" [Blacks/Hispanics/Asians/Whites]. Substantive response options were: A great deal, A lot, A moderate amount, A little, and None at all.

Data source: American National Election Studies. 2021. ANES 2020 Time Series Study Preliminary Release: Combined Pre-Election and Post-Election Data [dataset and documentation]. March 24, 2021 version. www.electionstudies.org.

Stata and R code. Dataset for the plot.

Tagged with: , , , ,